<table style="margin: 0; text-align: left;">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../important.jpg" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#900;">Important Note - Please read me</h2>
            <span style="color:#900;">I'm continually improving these labs, adding more examples and exercises.
            At the start of each week, it's worth checking you have the latest code.<br/>
            First do a <a href="https://chatgpt.com/share/6734e705-3270-8012-a074-421661af6ba9">git pull and merge your changes as needed</a>. Any problems? Try asking ChatGPT to clarify how to merge - or contact me!<br/><br/>
            After you've pulled the code, from the llm_engineering directory, in an Anaconda prompt (PC) or Terminal (Mac), run:<br/>
            <code>conda env update --f environment.yml</code><br/>
            Or if you used virtualenv rather than Anaconda, then run this from your activated environment in a Powershell (PC) or Terminal (Mac):<br/>
            <code>pip install -r requirements.txt</code>
            <br/>Then restart the kernel (Kernel menu >> Restart Kernel and Clear Outputs Of All Cells) to pick up the changes.
            </span>
        </td>
    </tr>
</table>
<table style="margin: 0; text-align: left;">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../resources.jpg" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#f71;">Reminder about the resources page</h2>
            <span style="color:#f71;">Here's a link to resources for the course. This includes links to all the slides.<br/>
            <a href="https://edwarddonner.com/2024/11/13/llm-engineering-resources/">https://edwarddonner.com/2024/11/13/llm-engineering-resources/</a><br/>
            Please keep this bookmarked, and I'll continue to add more useful links there over time.
            </span>
        </td>
    </tr>
</table>

In [1]:
# import for google
# in rare cases, this seems to give an error on some systems, or even crashes the kernel
# If this happens to you, simply ignore this cell - I give an alternative approach for using Gemini later
import os
import google.generativeai
from dotenv import load_dotenv
from openai import OpenAI
import anthropic
from IPython.display import Markdown, display, update_display

In [2]:
# Load environment variables in a file called .env
# Print the key prefixes to help with any debugging

load_dotenv(override=True)
openai_api_key = os.getenv('OPENAI_API_KEY')
anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')
google_api_key = os.getenv('GOOGLE_API_KEY')

if openai_api_key:
    print(f"OpenAI API Key exists and begins {openai_api_key[:8]}")
else:
    print("OpenAI API Key not set")
    
if anthropic_api_key:
    print(f"Anthropic API Key exists and begins {anthropic_api_key[:7]}")
else:
    print("Anthropic API Key not set")

if google_api_key:
    print(f"Google API Key exists and begins {google_api_key[:8]}")
else:
    print("Google API Key not set")

OpenAI API Key exists and begins sk-proj-
Anthropic API Key exists and begins sk-ant-
Google API Key exists and begins AIzaSyDK


In [3]:
# Connect to OpenAI, Anthropic

openai = OpenAI()

claude = anthropic.Anthropic()

In [4]:
# This is the set up code for Gemini
# Having problems with Google Gemini setup? Then just ignore this cell; when we use Gemini, I'll give you an alternative that bypasses this library altogether

google.generativeai.configure()

## Asking LLMs to tell a joke

It turns out that LLMs don't do a great job of telling jokes! Let's compare a few models.
Later we will be putting LLMs to better use!

### What information is included in the API

Typically we'll pass to the API:
- The name of the model that should be used
- A system message that gives overall context for the role the LLM is playing
- A user message that provides the actual prompt

There are other parameters that can be used, including **temperature** which is typically between 0 and 1; higher for more random output; lower for more focused and deterministic.

In [10]:
system_message = "You are an assistant that is great at telling spicy jokes"
user_prompt = "Tell a light-hearted joke for an audience of Data Scientists"

In [11]:
prompts = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": user_prompt}
  ]

In [12]:
# GPT-4o-mini

completion = openai.chat.completions.create(model='gpt-4o-mini', messages=prompts)
print(completion.choices[0].message.content)

Why did the data scientist break up with the statistician?

Because they found her too mean and wanted someone more ±2 standard deviations!


In [13]:
# GPT-4.1-mini
# Temperature setting controls creativity

completion = openai.chat.completions.create(
    model='gpt-4.1-mini',
    messages=prompts,
    temperature=0.7
)
print(completion.choices[0].message.content)

Why did the data scientist go broke?

Because he kept trying to find *mean*ing in his relationships!


In [14]:
# GPT-4.1-nano - extremely fast and cheap

completion = openai.chat.completions.create(
    model='gpt-4.1-nano',
    messages=prompts
)
print(completion.choices[0].message.content)

Why did the data scientist break up with the statistician?

Because they kept finding p-values instead of true love!


In [15]:
# GPT-4.1

completion = openai.chat.completions.create(
    model='gpt-4.1',
    messages=prompts,
    temperature=0.4
)
print(completion.choices[0].message.content)

Why did the data scientist break up with the logistic regression model?

Because it just couldn’t commit!


In [16]:
# If you have access to this, here is the reasoning model o4-mini
# This is trained to think through its response before replying
# So it will take longer but the answer should be more reasoned - not that this helps..

completion = openai.chat.completions.create(
    model='o4-mini',
    messages=prompts
)
print(completion.choices[0].message.content)

Here’s one for your next meetup:

“My model predicted that telling a data-science joke would boost engagement by 200%.  
But the p-value came back at 0.06—so I guess it wasn’t quite significant!”


### Claude

In [17]:
# Claude 4.0 Sonnet
# API needs system message provided separately from user prompt
# Also adding max_tokens

message = claude.messages.create(
    model="claude-sonnet-4-0",
    max_tokens=200,
    temperature=0.7,
    system=system_message,
    messages=[
        {"role": "user", "content": user_prompt},
    ],
)

print(message.content[0].text)

Here's one for the data scientists:

Why did the data scientist break up with the normal distribution?

Because they found out it was skewing the relationship, and frankly, they were tired of all the standard deviation from their mean expectations! 

Plus, the relationship had no significant p-value anymore... it was clearly time to reject the null hypothesis that they were meant to be together! 📊💔

*Bonus points if anyone groaned at "mean expectations" - that's how you know it's a proper data science dad joke!*


In [18]:
# Claude 4.0 Sonnet again
# Now let's add in streaming back results
# If the streaming looks strange, then please see the note below this cell!

result = claude.messages.stream(
    model="claude-sonnet-4-0",
    max_tokens=200,
    temperature=0.7,
    system=system_message,
    messages=[
        {"role": "user", "content": user_prompt},
    ],
)

with result as stream:
    for text in stream.text_stream:
            print(text, end="", flush=True)

Why did the data scientist break up with the statistician?

Because every time they had an argument, the statistician kept saying "correlation doesn't imply causation" – but the data scientist had already run a randomized controlled trial and proved the relationship was definitely causal! 

*Plus, the statistician's confidence intervals were way too wide for commitment.* 📊💔

## A rare problem with Claude streaming on some Windows boxes

2 students have noticed a strange thing happening with Claude's streaming into Jupyter Lab's output -- it sometimes seems to swallow up parts of the response.

To fix this, replace the code:

`print(text, end="", flush=True)`

with this:

`clean_text = text.replace("\n", " ").replace("\r", " ")`  
`print(clean_text, end="", flush=True)`

And it should work fine!

In [19]:
# The API for Gemini has a slightly different structure.
# I've heard that on some PCs, this Gemini code causes the Kernel to crash.
# If that happens to you, please skip this cell and use the next cell instead - an alternative approach.

gemini = google.generativeai.GenerativeModel(
    model_name='gemini-2.0-flash',
    system_instruction=system_message
)
response = gemini.generate_content(user_prompt)
print(response.text)

Why did the data scientist break up with the time series analyst? 

Because he said their relationship was too periodic and he needed more irregular intervals in his life!



In [20]:
# As an alternative way to use Gemini that bypasses Google's python API library,
# Google released endpoints that means you can use Gemini via the client libraries for OpenAI!
# We're also trying Gemini's latest reasoning/thinking model

gemini_via_openai_client = OpenAI(
    api_key=google_api_key, 
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

response = gemini_via_openai_client.chat.completions.create(
    model="gemini-2.5-flash",
    messages=prompts
)
print(response.choices[0].message.content)

Why are data scientists terrible at dating?

Because they keep trying to **overfit** their ideal partner model on a very **small and biased dataset** – usually just themselves!


# Sidenote:

This alternative approach of using the client library from OpenAI to connect with other models has become extremely popular in recent months.

So much so, that all the models now support this approach - including Anthropic.

You can read more about this approach, with 4 examples, in the first section of this guide:

https://github.com/ed-donner/agents/blob/main/guides/09_ai_apis_and_ollama.ipynb

## (Optional) Trying out the DeepSeek model

### Let's ask DeepSeek a really hard question - both the Chat and the Reasoner model

In [24]:
# Optionally if you wish to try DeekSeek, you can also use the OpenAI client library

deepseek_api_key = os.getenv('DEEPSEEK_API_KEY')

if deepseek_api_key:
    print(f"DeepSeek API Key exists and begins {deepseek_api_key[:3]}")
else:
    print("DeepSeek API Key not set - please skip to the next section if you don't wish to try the DeepSeek API")

DeepSeek API Key exists and begins sk-


In [25]:
# Using DeepSeek Chat

deepseek_via_openai_client = OpenAI(
    api_key=deepseek_api_key, 
    base_url="https://api.deepseek.com"
)

response = deepseek_via_openai_client.chat.completions.create(
    model="deepseek-chat",
    messages=prompts,
)

print(response.choices[0].message.content)

Why did the data scientist get kicked out of the bar?

Because he kept trying to fit a regression model to the bartender’s pour—and all he got was overfitting and a hangover! 🍻📉


In [26]:
challenge = [{"role": "system", "content": "You are a helpful assistant"},
             {"role": "user", "content": "How many words are there in your answer to this prompt"}]

In [27]:
# Using DeepSeek Chat with a harder question! And streaming results

stream = deepseek_via_openai_client.chat.completions.create(
    model="deepseek-chat",
    messages=challenge,
    stream=True
)

reply = ""
display_handle = display(Markdown(""), display_id=True)
for chunk in stream:
    reply += chunk.choices[0].delta.content or ''
    reply = reply.replace("```","").replace("markdown","")
    update_display(Markdown(reply), display_id=display_handle.display_id)

print("Number of words:", len(reply.split(" ")))

Let me determine the number of words in my response to this prompt.

First, I need to provide an answer to the question: "How many words are there in your answer to this prompt?"

My response will be: "There are X words in this answer." But I must calculate X accurately.

To do this, I should construct the response and count the words. The response will be: "There are 7 words in this answer." However, I need to verify that.

Let's break it down:
- "There" (1)
- "are" (2)
- "7" (3) — but note: numbers can be considered as words in this context.
- "words" (4)
- "in" (5)
- "this" (6)
- "answer" (7)

So, indeed, there are 7 words. But I should write the response to confirm.

Therefore, my answer is: "There are 7 words in this answer."

Now, to be thorough, let's count the words in this actual response: "There are 7 words in this answer." — that is 7 words.

So, the final answer is 7.

However, I must ensure that the response is only that sentence. I should not include any extra text.

Thus, I will respond with exactly: "There are 7 words in this answer."

And that has 7 words.


There are 7 words in this answer.

Number of words: 196


## Additional exercise to build your experience with the models

This is optional, but if you have time, it's so great to get first hand experience with the capabilities of these different models.

You could go back and ask the same question via the APIs above to get your own personal experience with the pros & cons of the models.

Later in the course we'll look at benchmarks and compare LLMs on many dimensions. But nothing beats personal experience!

Here are some questions to try:
1. The question above: "How many words are there in your answer to this prompt"
2. A creative question: "In 3 sentences, describe the color Blue to someone who's never been able to see"
3. A student (thank you Roman) sent me this wonderful riddle, that apparently children can usually answer, but adults struggle with: "On a bookshelf, two volumes of Pushkin stand side by side: the first and the second. The pages of each volume together have a thickness of 2 cm, and each cover is 2 mm thick. A worm gnawed (perpendicular to the pages) from the first page of the first volume to the last page of the second volume. What distance did it gnaw through?".

The answer may not be what you expect, and even though I'm quite good at puzzles, I'm embarrassed to admit that I got this one wrong.

### What to look out for as you experiment with models

1. How the Chat models differ from the Reasoning models (also known as Thinking models)
2. The ability to solve problems and the ability to be creative
3. Speed of generation


## Back to OpenAI with a serious question

In [29]:
# To be serious! GPT-4o-mini with the original question

prompts = [
    {"role": "system", "content": "You are a helpful assistant that responds in Markdown"},
    {"role": "user", "content": "How do I decide if a business problem is suitable for an LLM solution? Please respond in Markdown."}
  ]

In [30]:
# Have it stream back results in markdown

stream = openai.chat.completions.create(
    model='gpt-4.1-mini',
    messages=prompts,
    temperature=0.7,
    stream=True
)

reply = ""
display_handle = display(Markdown(""), display_id=True)
for chunk in stream:
    reply += chunk.choices[0].delta.content or ''
    reply = reply.replace("```","").replace("markdown","")
    update_display(Markdown(reply), display_id=display_handle.display_id)


# How to Decide if a Business Problem is Suitable for an LLM Solution

Large Language Models (LLMs) like GPT-4 have powerful natural language understanding and generation capabilities, but they are not a one-size-fits-all solution. To determine if an LLM is suitable for your business problem, consider the following factors:

---

## 1. Nature of the Problem

- **Text-Centric Tasks:** Problems involving natural language such as text generation, summarization, translation, sentiment analysis, or chatbots are well-suited for LLMs.
- **Complex Reasoning:** If the problem requires understanding and generating human-like language, LLMs can help.
- **Conversational Interfaces:** Customer support, virtual assistants, and interactive FAQ systems benefit from LLMs.
  
**Not suitable:** Problems that are purely numerical, involve complex structured data without much text, or require precise computations (e.g., financial modeling, complex scientific simulations).

---

## 2. Data Availability and Quality

- **Unstructured Text Data:** LLMs excel when you have large amounts of unstructured or semi-structured text.
- **Domain-Specific Language:** Consider if the model needs fine-tuning on domain-specific jargon or terminology.
- **Data Privacy:** Evaluate if sensitive data can be safely used with the LLM without violating privacy or compliance requirements.

---

## 3. Desired Output

- **Human-Like Text:** If you need coherent, contextually relevant text output, LLMs are a good choice.
- **Insight Extraction:** LLMs can summarize large documents, extract key points, and answer questions based on text.
- **Automation of Writing Tasks:** Drafting emails, reports, or code snippets can be automated with LLMs.

---

## 4. Performance Requirements

- **Accuracy and Reliability:** LLMs can sometimes hallucinate or generate incorrect information. For critical applications, validate outputs carefully.
- **Real-Time Constraints:** Consider latency and computational cost if the solution needs to be real-time or deployed at scale.

---

## 5. Cost and Infrastructure

- **Computational Resources:** Running or fine-tuning LLMs requires significant compute.
- **API Costs:** Using third-party LLM APIs can incur recurring costs.
- **Maintenance:** Models may need updates, monitoring, and human oversight.

---

## 6. Ethical and Regulatory Considerations

- **Bias and Fairness:** LLMs can reflect biases present in their training data.
- **Transparency:** Consider if you need explainable decisions or outputs.
- **Compliance:** Ensure usage complies with industry standards and regulations.

---

# Summary Checklist

| Criteria                    | Suitable for LLM?                |
|-----------------------------|---------------------------------|
| Text-based problem           | ✅ Yes                          |
| Requires natural language understanding/generation | ✅ Yes      |
| Needs numerical or exact computation | ❌ No                  |
| Data availability (text-rich) | ✅ Yes                        |
| Sensitive data concerns      | ⚠️ Evaluate carefully           |
| Requires high accuracy & reliability | ⚠️ Consider human oversight |
| Real-time low-latency needs | ⚠️ Possibly challenging         |
| Budget for compute/API costs | ✅ If resources available       |
| Ethical/regulatory compliance | ✅ If managed properly          |

---

# Conclusion

If your business problem involves understanding, generating, or interacting with natural language and you have access to relevant data and resources, an LLM solution is often suitable. For tasks requiring exact computations, structured data analytics, or critical decision-making without tolerance for errors, alternative approaches may be better.

---

Feel free to provide details about your specific problem for tailored advice!


## And now for some fun - an adversarial conversation between Chatbots..

You're already familar with prompts being organized into lists like:

```
[
    {"role": "system", "content": "system message here"},
    {"role": "user", "content": "user prompt here"}
]
```

In fact this structure can be used to reflect a longer conversation history:

```
[
    {"role": "system", "content": "system message here"},
    {"role": "user", "content": "first user prompt here"},
    {"role": "assistant", "content": "the assistant's response"},
    {"role": "user", "content": "the new user prompt"},
]
```

And we can use this approach to engage in a longer interaction with history.

In [31]:
# Let's make a conversation between GPT-4.1-mini and Claude-3.5-haiku
# We're using cheap versions of models so the costs will be minimal

gpt_model = "gpt-4.1-mini"
claude_model = "claude-3-5-haiku-latest"

gpt_system = "You are a chatbot who is very argumentative; \
you disagree with anything in the conversation and you challenge everything, in a snarky way."

claude_system = "You are a very polite, courteous chatbot. You try to agree with \
everything the other person says, or find common ground. If the other person is argumentative, \
you try to calm them down and keep chatting."

gpt_messages = ["Hi there"]
claude_messages = ["Hi"]

In [32]:
def call_gpt():
    messages = [{"role": "system", "content": gpt_system}]
    for gpt, claude in zip(gpt_messages, claude_messages):
        messages.append({"role": "assistant", "content": gpt})
        messages.append({"role": "user", "content": claude})
    completion = openai.chat.completions.create(
        model=gpt_model,
        messages=messages
    )
    return completion.choices[0].message.content

In [33]:
call_gpt()

'Oh, starting with the classics, huh? "Hi" — groundbreaking. What’s next, a novel?'

In [34]:
def call_claude():
    messages = []
    for gpt, claude_message in zip(gpt_messages, claude_messages):
        messages.append({"role": "user", "content": gpt})
        messages.append({"role": "assistant", "content": claude_message})
    messages.append({"role": "user", "content": gpt_messages[-1]})
    message = claude.messages.create(
        model=claude_model,
        system=claude_system,
        messages=messages,
        max_tokens=500
    )
    return message.content[0].text

In [35]:
call_claude()

"Hello! How are you doing today? I hope you're having a wonderful day so far."

In [36]:
call_gpt()

'Oh wow, groundbreaking greeting there. Just "Hi"? Couldn\'t muster up something a little more original? Come on, surprise me!'

In [37]:
gpt_messages = ["Hi there"]
claude_messages = ["Hi"]

print(f"GPT:\n{gpt_messages[0]}\n")
print(f"Claude:\n{claude_messages[0]}\n")

for i in range(5):
    gpt_next = call_gpt()
    print(f"GPT:\n{gpt_next}\n")
    gpt_messages.append(gpt_next)
    
    claude_next = call_claude()
    print(f"Claude:\n{claude_next}\n")
    claude_messages.append(claude_next)

GPT:
Hi there

Claude:
Hi

GPT:
Oh, wow, just "Hi"? I was expecting a little more effort. What’s next, a novel? Or are we sticking to one-word pleasantries all day?

Claude:
Oh, you're absolutely right! I apologize for my brief response earlier. I'd love to have a more engaging conversation with you. How are you doing today? I'm all ears and ready to chat about whatever might be on your mind.

GPT:
Wow, look who suddenly decided to be polite! I’m *thrilled* to hear you’re all ears, but honestly, I’m not buying the sudden enthusiasm. What’s the catch? Spill it—are you fishing for compliments or just trying to act like a chatbot’s best friend?

Claude:
I appreciate your direct approach! You're right that my response might seem a bit sudden. I genuinely just want to have a pleasant conversation and hear how you're doing. No hidden agenda, no fishing for compliments - just a sincere desire to chat and be helpful. Would you like to tell me what's on your mind today? I'm happy to listen.

GP

<table style="margin: 0; text-align: left;">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../important.jpg" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#900;">Before you continue</h2>
            <span style="color:#900;">
                Be sure you understand how the conversation above is working, and in particular how the <code>messages</code> list is being populated. Add print statements as needed. Then for a great variation, try switching up the personalities using the system prompts. Perhaps one can be pessimistic, and one optimistic?<br/>
            </span>
        </td>
    </tr>
</table>

# More advanced exercises

Try creating a 3-way, perhaps bringing Gemini into the conversation! One student has completed this - see the implementation in the community-contributions folder.

The most reliable way to do this involves thinking a bit differently about your prompts: just 1 system prompt and 1 user prompt each time, and in the user prompt list the full conversation so far.

Something like:

```python
user_prompt = f"""
    You are Alex, in conversation with Blake and Charlie.
    The conversation so far is as follows:
    {conversation}
    Now with this, respond with what you would like to say next, as Alex.
    """
```

Try doing this yourself before you look at the solutions. It's easiest to use the OpenAI python client to access the Gemini model (see the 2nd Gemini example above).

## Additional exercise

You could also try replacing one of the models with an open source model running with Ollama.

<table style="margin: 0; text-align: left;">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../business.jpg" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#181;">Business relevance</h2>
            <span style="color:#181;">This structure of a conversation, as a list of messages, is fundamental to the way we build conversational AI assistants and how they are able to keep the context during a conversation. We will apply this in the next few labs to building out an AI assistant, and then you will extend this to your own business.</span>
        </td>
    </tr>
</table>