# Comparing Llama Models

- Load helper function to prompt Llama models

In [1]:
%load_ext autoreload
%autoreload 2

# import llama helper function
from utils import llama, llama_chat

### Task 1: Sentiment Classification
- Compare the models on few-shot prompt sentiment classification.
- You are asking the model to return a one word response.

In [2]:
prompt = '''
Message: Hi Amit, thanks for the thoughtful birthday card!
Sentiment: Positive
Message: Hi Dad, you're 20 minutes late to my piano recital!
Sentiment: Negative
Message: Can't wait to order pizza for dinner tonight!
Sentiment: ?

Give a one word response.
'''

- First, use the 8B parameter chat model (`meta-llama/Llama-3-8b-chat-hf`) to get the response.

In [3]:
response = llama(prompt,
                 model = "meta-llama/Llama-3-8b-chat-hf")
print(response)

Positive


- Now, use the 70B parameter chat model (`meta-llama/Llama-3-70b-chat-hf`) on the same task

In [4]:
response = llama(prompt,
                 model="meta-llama/Llama-3-70b-chat-hf")
print(response)

Positive


### Task 2: Summarization
- Compare the models on summarization task.
- This is the same "email" as the one you used previously in the course.

In [5]:
email = """
Dear Amit,

An increasing variety of large language models (LLMs) are open source, or close to it. The proliferation of models with relatively permissive licenses gives developers more options for building applications.

Here are some different ways to build applications based on LLMs, in increasing order of cost/complexity:

Prompting. Giving a pretrained LLM instructions lets you build a prototype in minutes or hours without a training set. Earlier this year, I saw a lot of people start experimenting with prompting, and that momentum continues unabated. Several of our short courses teach best practices for this approach.
One-shot or few-shot prompting. In addition to a prompt, giving the LLM a handful of examples of how to carry out a task — the input and the desired output — sometimes yields better results.
Fine-tuning. An LLM that has been pretrained on a lot of text can be fine-tuned to your task by training it further on a small dataset of your own. The tools for fine-tuning are maturing, making it accessible to more developers.
Pretraining. Pretraining your own LLM from scratch takes a lot of resources, so very few teams do it. In addition to general-purpose models pretrained on diverse topics, this approach has led to specialized models like BloombergGPT, which knows about finance, and Med-PaLM 2, which is focused on medicine.
For most teams, I recommend starting with prompting, since that allows you to get an application working quickly. If you’re unsatisfied with the quality of the output, ease into the more complex techniques gradually. Start one-shot or few-shot prompting with a handful of examples. If that doesn’t work well enough, perhaps use RAG (retrieval augmented generation) to further improve prompts with key information the LLM needs to generate high-quality outputs. If that still doesn’t deliver the performance you want, then try fine-tuning — but this represents a significantly greater level of complexity and may require hundreds or thousands more examples. To gain an in-depth understanding of these options, I highly recommend the course Generative AI with Large Language Models, created by AWS and DeepLearning.AI.

(Fun fact: A member of the DeepLearning.AI team has been trying to fine-tune Llama-2-7B to sound like me. I wonder if my job is at risk? 😜)

Additional complexity arises if you want to move to fine-tuning after prompting a proprietary model, such as GPT-4, that’s not available for fine-tuning. Is fine-tuning a much smaller model likely to yield superior results than prompting a larger, more capable model? The answer often depends on your application. If your goal is to change the style of an LLM’s output, then fine-tuning a smaller model can work well. However, if your application has been prompting GPT-4 to perform complex reasoning — in which GPT-4 surpasses current open models — it can be difficult to fine-tune a smaller model to deliver superior results.

Beyond choosing a development approach, it’s also necessary to choose a specific model. Smaller models require less processing power and work well for many applications, but larger models tend to have more knowledge about the world and better reasoning ability. I’ll talk about how to make this choice in a future letter.

Keep learning!

Andrew
"""

prompt = f"""
Summarize this email and extract some key points.

What did the author say about llama models?
```
{email}
```
"""

- First, use the 8B parameter chat model (`meta-llama/Llama-3-8b-chat-hf`) to summarize the email.

In [9]:
response_llama3_8b = llama(prompt,
                model="meta-llama/Llama-3-8b-chat-hf")
print(response_llama3_8b)

Summary:

The email discusses the increasing availability of large language models (LLMs) with permissive licenses, which provides developers with more options for building applications. The author, Andrew, outlines different ways to build applications based on LLMs, ranging from prompting to fine-tuning, and provides guidance on when to use each approach. He also mentions the importance of choosing the right model and notes that a future letter will cover this topic.

Key points:

* LLMs with permissive licenses provide developers with more options for building applications.
* Prompting is a simple and quick way to build an application, but may not produce high-quality results.
* One-shot or few-shot prompting can improve results by providing the LLM with examples of how to carry out a task.
* Fine-tuning an LLM requires more resources and expertise, but can produce high-quality results.
* Pretraining an LLM from scratch is resource-intensive and typically only done by a few teams.
* 

- Lastly, use the 70B parameter chat model (`meta-llama/Llama-3-70b-chat-hf`) to summarize the email.

In [10]:
response_llama3_70b = llama(prompt,
                model="meta-llama/Llama-3-70b-chat-hf")
print(response_llama3_70b)

Here is a summary of the email and some key points:

**Summary:** The email discusses the increasing availability of open-source large language models (LLMs) and the various ways to build applications using them, ranging from simple prompting to fine-tuning and pretraining. The author recommends starting with prompting and gradually moving to more complex techniques if needed.

**Key Points:**

1. Open-source LLMs are becoming more available, giving developers more options for building applications.
2. There are four ways to build applications using LLMs, in increasing order of cost/complexity: prompting, one-shot or few-shot prompting, fine-tuning, and pretraining.
3. Prompting is a quick and easy way to build a prototype, while fine-tuning and pretraining require more resources and expertise.
4. The choice of development approach depends on the application and the desired output.
5. Smaller models require less processing power and may be suitable for many applications, while larger m

#### Model-Graded Evaluation: Summarization

- Interestingly, you can ask a LLM to evaluate the responses of other LLMs.
- This is known as **Model-Graded Evaluation**.

- Create a `prompt` that will evaluate these three responses using 70B parameter chat model (`meta-llama/Llama-3-70b-chat-hf`).
- In the `prompt`, provide the "email", "name of the models", and the "summary" generated by each model.

In [11]:
prompt = f"""
Given the original text denoted by `email`
and the name of several models: `model:<name of model>
as well as the summary generated by that model: `summary`

Provide an evaluation of each model's summary:
- Does it summarize the original text well?
- Does it follow the instructions of the prompt?
- Are there any other interesting characteristics of the model's output?

Then compare the models based on their evaluation \
and recommend the models that perform the best.

email: ```{email}```

model: llama-3-8b-chat
summary: {response_llama3_8b}

model: llama-3-70b-chat
summary: {response_llama3_70b}
"""

response_eval = llama(prompt,
                model="meta-llama/Llama-3-70b-chat-hf")
print(response_eval)

**Evaluation of each model's summary:**

**Model: llama-3-8b-chat**

* The summary provides a good overview of the original text, covering the main points and key ideas.
* The summary follows the instructions of the prompt, providing a concise summary of the email.
* The key points section is well-organized and easy to follow, highlighting the main takeaways from the email.
* One interesting characteristic of this model's output is the use of a clear and concise structure, with a separate section for the summary and key points.

**Model: llama-3-70b-chat**

* The summary is also a good representation of the original text, covering the main ideas and key points.
* The summary follows the instructions of the prompt, providing a concise summary of the email.
* The key points section is well-organized and easy to follow, highlighting the main takeaways from the email.
* One interesting characteristic of this model's output is the use of numbered key points, which makes it easy to scan and 

### Task 3: Reasoning ###
- Compare the three models' performance on reasoning tasks.

In [12]:
context = """
Jeff and Tommy are neighbors

Tommy and Eddy are not neighbors
"""

In [13]:
query = """
Are Jeff and Eddy neighbors?
"""

In [14]:
prompt = f"""
Given this context: ```{context}```,

and the following query:
```{query}```

Please answer the questions in the query and explain your reasoning.
If there is not enough informaton to answer, please say
"I do not have enough information to answer this questions."
"""

In [15]:
response_llama3_8b_chat = llama(prompt,
                        model="META-LLAMA/Llama-3-8B-CHAT-HF")
print(response_llama3_8b_chat)

I do not have enough information to answer this question.

The query asks if Jeff and Eddy are neighbors, but there is no information provided about their relationship. We know that Jeff and Tommy are neighbors, and Tommy and Eddy are not neighbors, but we don't know anything about Jeff and Eddy's relationship. Therefore, I cannot determine whether they are neighbors or not.


In [16]:
response_llama3_70b_chat = llama(prompt,
                        model="META-LLAMA/Llama-3-70B-CHAT-HF")
print(response_llama3_70b_chat)

Based on the given context, I can conclude that:

Jeff and Eddy are not neighbors.

Here's my reasoning:

1. We know that Jeff and Tommy are neighbors.
2. We also know that Tommy and Eddy are not neighbors.
3. Since Tommy is a neighbor of Jeff, but not a neighbor of Eddy, it implies that Jeff and Eddy are not neighbors.

So, I can confidently answer that Jeff and Eddy are not neighbors.


#### Model-Graded Evaluation: Reasoning

- Again, ask a LLM to compare the three responses.
- Create a `prompt` that will evaluate these three responses using 70B parameter chat model (`meta-llama/Llama-3-70b-chat-hf`).
- In the `prompt`, provide the `context`, `query`,"name of the models", and the "response" generated by each model.

In [17]:
prompt = f"""
Given the context `context:`,
Also also given the query (the task): `query:`
and given the name of several models: `mode:<name of model>,
as well as the response generated by that model: `response:`

Provide an evaluation of each model's response:
- Does it answer the query accurately?
- Does it provide a contradictory response?
- Are there any other interesting characteristics of the model's output?

Then compare the models based on their evaluation \
and recommend the models that perform the best.

context: ```{context}```

model: llama-3-8b-chat
response: ```{response_llama3_8b_chat}```

model: llama-3-70b-chat
response: ``{response_llama3_70b_chat}``
"""

In [18]:
response_eval = llama(prompt, 
                      model="META-LLAMA/Llama-3-70B-CHAT-HF")

print(response_eval)

Here's the evaluation of each model's response:

**Model: llama-3-8b-chat**

* Does it answer the query accurately? No, it doesn't provide a direct answer to the query.
* Does it provide a contradictory response? No, it doesn't provide a contradictory response, but it doesn't provide a conclusive answer either.
* Are there any other interesting characteristics of the model's output? The model correctly identifies the lack of information about Jeff and Eddy's relationship and refuses to make an assumption.

**Model: llama-3-70b-chat**

* Does it answer the query accurately? Yes, it provides a correct answer to the query based on the given context.
* Does it provide a contradictory response? No, it doesn't provide a contradictory response.
* Are there any other interesting characteristics of the model's output? The model provides a clear and logical reasoning to support its answer, which is a desirable trait in a language model.

Comparison and Recommendation:

Based on the evaluation, t