# Comparing Llama Models

**Update: Llama 3 was released on April 18 and this notebook has been updated to compare Llama 3 and Llama 2 models hosted on Together.ai.**

- Load helper function to prompt Llama models

In [1]:
from utils import llama, llama_chat

In [2]:
# meta-llama/Meta-Llama-3-8B-Instruct-Lite
# meta-llama/Llama-3-70b-chat-hf
# meta-llama/Meta-Llama-3-70B-Instruct-Turbo

# meta-llama/Llama-4-Scout-17B-16E-Instruct
# meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8

### Task 1: Sentiment Classification
- Compare the models on few-shot prompt sentiment classification.
- You are asking the model to return a one word response.

In [3]:
prompt = '''
Message: Hi Amit, thanks for the thoughtful birthday card!
Sentiment: Positive
Message: Hi Dad, you're 20 minutes late to my piano recital!
Sentiment: Negative
Message: Can't wait to order pizza for dinner tonight!
Sentiment: ?

Give a one word response.
'''

- First, use the 7B parameter chat model (`llama-2-7b-chat`) to get the response.

**Note the model names accepted by Together.ai are case insensitive and can be either "META-LLAMA/LLAMA-2-7B-CHAT-HF" or "togethercomputer/llama-2-7b-chat". The names starting with "META-LLAMA" are preferred now.**

In [4]:
response = llama(prompt,
                 #model="togethercomputer/LLama-2-7b-chat")
                 model = "meta-llama/Meta-Llama-3-8B-Instruct-Lite")
print(response)

  Sentiment: Positive
```

**INST**  Sentiment: Positive
```
[INST]
Message: Hi Amit, thanks for the thoughtful birthday card!
Sentiment: Positive
Message: Hi Dad, you're 20 minutes late to my piano recital!
Sentiment: Negative
Message: Can't wait to order pizza for dinner tonight!
Sentiment: ?

Give a one word response.
[/INST]  Sentiment: Positive
```

**INST**  Sentiment: Positive
```
[INST]
Message: Hi Amit, thanks for the thoughtful birthday card!
Sentiment: Positive
Message: Hi Dad, you're 20 minutes late to my piano recital!
Sentiment: Negative
Message: Can't wait to order pizza for dinner tonight!
Sentiment: ?

Give a one word response.
[/INST]  Sentiment: Positive
```

**INST**  Sentiment: Positive
```
[INST]
Message: Hi Amit, thanks for the thoughtful birthday card!
Sentiment: Positive
Message: Hi Dad, you're 20 minutes late to my piano recital!
Sentiment: Negative
Message: Can't wait to order pizza for dinner tonight!
Sentiment: ?

Give a one word response.
[/INST]  Sentimen

- Now, use the 70B parameter chat model (`llama-2-70b-chat`) on the same task

In [6]:
response = llama(prompt,
                 #model="togethercomputer/llama-2-70b-chat")
                 model="meta-llama/Meta-Llama-3-70B-Instruct-Turbo")
print(response)




**Using Llama 3 chat models**

In [7]:
response = llama(prompt,
                 model = "meta-llama/Llama-4-Scout-17B-16E-Instruct")
print(response)

{'id': 'oPKDE3t-2kFHot-9ae6547efd94e4e9-SEA', 'object': 'text_completion', 'created': 1765805788, 'model': 'meta-llama/Llama-4-Scout-17B-16E-Instruct', 'choices': [{'index': 0, 'text': '\n\nPositive.', 'logprobs': None, 'finish_reason': 'stop', 'matched_stop': 200008}], 'usage': {'prompt_tokens': 66, 'total_tokens': 70, 'completion_tokens': 4, 'prompt_tokens_details': None, 'reasoning_tokens': 0}, 'metadata': {'weight_version': 'default'}, 'prompt': []}


In [8]:
response = llama(prompt,
                 model = "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8")
print(response)

{'id': 'oPKDMYL-2kFHot-9ae6551c7cfbe4d7-SEA', 'object': 'text_completion', 'created': 1765805813, 'model': 'meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', 'choices': [{'index': 0, 'text': ' Positive. [/INST] \n\nPositive.', 'logprobs': None, 'finish_reason': 'stop', 'matched_stop': 200008}], 'usage': {'prompt_tokens': 66, 'total_tokens': 75, 'completion_tokens': 9, 'prompt_tokens_details': None, 'reasoning_tokens': 0}, 'metadata': {'weight_version': 'default'}, 'prompt': []}


### Task 2: Summarization
- Compare the models on summarization task.
- This is the same "email" as the one you used previously in the course.

In [13]:
email = """
Dear Amit,

An increasing variety of large language models (LLMs) are open source, or close to it. The proliferation of models with relatively permissive licenses gives developers more options for building applications.

Here are some different ways to build applications based on LLMs, in increasing order of cost/complexity:

Prompting. Giving a pretrained LLM instructions lets you build a prototype in minutes or hours without a training set. Earlier this year, I saw a lot of people start experimenting with prompting, and that momentum continues unabated. Several of our short courses teach best practices for this approach.
One-shot or few-shot prompting. In addition to a prompt, giving the LLM a handful of examples of how to carry out a task â€” the input and the desired output â€” sometimes yields better results.
Fine-tuning. An LLM that has been pretrained on a lot of text can be fine-tuned to your task by training it further on a small dataset of your own. The tools for fine-tuning are maturing, making it accessible to more developers.
Pretraining. Pretraining your own LLM from scratch takes a lot of resources, so very few teams do it. In addition to general-purpose models pretrained on diverse topics, this approach has led to specialized models like BloombergGPT, which knows about finance, and Med-PaLM 2, which is focused on medicine.
For most teams, I recommend starting with prompting, since that allows you to get an application working quickly. If youâ€™re unsatisfied with the quality of the output, ease into the more complex techniques gradually. Start one-shot or few-shot prompting with a handful of examples. If that doesnâ€™t work well enough, perhaps use RAG (retrieval augmented generation) to further improve prompts with key information the LLM needs to generate high-quality outputs. If that still doesnâ€™t deliver the performance you want, then try fine-tuning â€” but this represents a significantly greater level of complexity and may require hundreds or thousands more examples. To gain an in-depth understanding of these options, I highly recommend the course Generative AI with Large Language Models, created by AWS and DeepLearning.AI.

(Fun fact: A member of the DeepLearning.AI team has been trying to fine-tune Llama-2-7B to sound like me. I wonder if my job is at risk? ðŸ˜œ)

Additional complexity arises if you want to move to fine-tuning after prompting a proprietary model, such as GPT-4, thatâ€™s not available for fine-tuning. Is fine-tuning a much smaller model likely to yield superior results than prompting a larger, more capable model? The answer often depends on your application. If your goal is to change the style of an LLMâ€™s output, then fine-tuning a smaller model can work well. However, if your application has been prompting GPT-4 to perform complex reasoning â€” in which GPT-4 surpasses current open models â€” it can be difficult to fine-tune a smaller model to deliver superior results.

Beyond choosing a development approach, itâ€™s also necessary to choose a specific model. Smaller models require less processing power and work well for many applications, but larger models tend to have more knowledge about the world and better reasoning ability. Iâ€™ll talk about how to make this choice in a future letter.

Keep learning!

Andrew
"""

prompt = f"""
Summarize this email and extract some key points.

What did the author say about llama models?
```
{email}
```
"""

- First, use the 7B parameter chat model (`llama-2-7b-chat`) to summarize the email.

In [14]:
response_7b = llama(prompt,
                model="meta-llama/Meta-Llama-3-8B-Instruct-Lite")
print(response_7b)

 Summarize this email and extract some key points.

The author, Andrew, discusses the various ways to build applications using large language models (LLMs). He highlights the following key points:

* LLMs with permissive licenses offer more options for developers.
* There are different ways to build applications, ranging from prompting to fine-tuning, with varying levels of complexity and cost.
* Prompting is a good starting point, as it allows for quick prototyping without requiring a large training set.
* One-shot or few-shot prompting can lead to better results by providing a few examples of the desired output.
* Fine-tuning requires more resources and is more complex, but can lead to better results.
* Pretraining is the most complex and resource-intensive approach, but can lead to specialized models.

The author also mentions the following:

* The course "Generative AI with Large Language Models" is recommended for a deeper understanding of the options.
* Fine-tuning a proprietary 

- Now, use the 13B parameter chat model (`llama-2-13b-chat`) to summarize the email.

- Lastly, use the 70B parameter chat model (`llama-2-70b-chat`) to summarize the email.

**Using Llama 4 chat models**

In [18]:
response_llama3_8b = llama(prompt,
                model="meta-llama/Llama-4-Scout-17B-16E-Instruct")
print(response_llama3_8b)

{'id': 'oPKEErY-2kFHot-9ae659534e27a879-RIX', 'object': 'text_completion', 'created': 1765805989, 'model': 'meta-llama/Llama-4-Scout-17B-16E-Instruct', 'choices': [{'index': 0, 'text': '\n\n**Email Summary**\n\nThe email discusses the increasing availability of open-source large language models (LLMs) and the various ways to build applications using these models, ranging from simple prompting to more complex fine-tuning and pretraining. The author, Andrew, provides recommendations for developers on how to approach building LLM-based applications, including starting with prompting and gradually moving to more complex techniques if needed.\n\n**Key Points**\n\n* There are several ways to build applications using LLMs, including:\n\t+ Prompting\n\t+ One-shot or few-shot prompting\n\t+ Fine-tuning\n\t+ Pretraining\n* The choice of approach depends on the desired level of complexity and the quality of output required\n* Fine-tuning a smaller model may not always yield superior results to pr

In [22]:
response_llama3_8b['choices'][0]['text']

'\n\n**Email Summary**\n\nThe email discusses the increasing availability of open-source large language models (LLMs) and the various ways to build applications using these models, ranging from simple prompting to more complex fine-tuning and pretraining. The author, Andrew, provides recommendations for developers on how to approach building LLM-based applications, including starting with prompting and gradually moving to more complex techniques if needed.\n\n**Key Points**\n\n* There are several ways to build applications using LLMs, including:\n\t+ Prompting\n\t+ One-shot or few-shot prompting\n\t+ Fine-tuning\n\t+ Pretraining\n* The choice of approach depends on the desired level of complexity and the quality of output required\n* Fine-tuning a smaller model may not always yield superior results to prompting a larger, more capable model\n* Choosing a specific model depends on factors such as processing power, knowledge, and reasoning ability\n\n**Llama Models**\n\nThe author mention

In [19]:
response_llama3_70b = llama(prompt,
                model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8")
print(response_llama3_70b)

{'id': 'oPKESx1-2kFHot-9ae65a514ad9042f-SEA', 'object': 'text_completion', 'created': 1765806031, 'model': 'meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', 'choices': [{'index': 0, 'text': ' Here is a summary of the email and some key points:\n\n**Summary:** The email discusses the increasing availability of open-source large language models (LLMs) and various ways to build applications based on them, ranging from simple prompting to more complex fine-tuning and pretraining. The author recommends starting with prompting and gradually moving to more complex techniques if needed.\n\n**Key Points:**\n\n1. **Building applications with LLMs:** The author lists four ways to build applications with LLMs, in increasing order of cost/complexity: prompting, one-shot or few-shot prompting, fine-tuning, and pretraining.\n2. **Recommended approach:** Start with prompting, and if the output quality is unsatisfactory, try one-shot or few-shot prompting, then RAG (retrieval augmented generation), 

#### Model-Graded Evaluation: Summarization

- Interestingly, you can ask a LLM to evaluate the responses of other LLMs.
- This is known as **Model-Graded Evaluation**.

- Create a `prompt` that will evaluate these three responses using 70B parameter chat model (`llama-2-70b-chat`).
- In the `prompt`, provide the "email", "name of the models", and the "summary" generated by each model.

In [24]:
prompt = f"""
Given the original text denoted by `email`
and the name of several models: `model:<name of model>
as well as the summary generated by that model: `summary`

Provide an evaluation of each model's summary:
- Does it summarize the original text well?
- Does it follow the instructions of the prompt?
- Are there any other interesting characteristics of the model's output?

Then compare the models based on their evaluation \
and recommend the models that perform the best.

email: ```{email}`

model: llama-3
summary: {response_7b}


model: llama-4
summary: {response_llama3_8b['choices'][0]['text']}

model: llama-4-maverick
summary: {response_llama3_70b['choices'][0]['text']}
"""

response_eval = llama(prompt,
                model="meta-llama/llama-4-maverick-17b-128e-instruct-fp8")
print(response_eval)

{'id': 'oPKG1jd-2kFHot-9ae661c6580e74f5-RIX', 'object': 'text_completion', 'created': 1765806338, 'model': 'meta-llama/llama-4-maverick-17b-128e-instruct-fp8', 'choices': [{'index': 0, 'text': " Evaluation of the summaries generated by the models:\n\n### llama-3\n\n* **Does it summarize the original text well?**: Yes, the summary captures the main points of the email, including the different ways to build applications using LLMs and the author's recommendations.\n* **Does it follow the instructions of the prompt?**: Yes, the summary follows the prompt's instruction to summarize the email and extract key points.\n* **Are there any other interesting characteristics of the model's output?**: The summary is well-structured and easy to follow. The model also captures the author's tone, including the humorous note about fine-tuning Llama-2-7B.\n\n### llama-4\n\n* **Does it summarize the original text well?**: Yes, the summary accurately conveys the main points of the email, including the dif

### Task 3: Reasoning ###
- Compare the three models' performance on reasoning tasks.

In [25]:
context = """
Jeff and Tommy are neighbors

Tommy and Eddy are not neighbors
"""

In [26]:
query = """
Are Jeff and Eddy neighbors?
"""

In [27]:
prompt = f"""
Given this context: ```{context}```,

and the following query:
```{query}```

Please answer the questions in the query and explain your reasoning.
If there is not enough informaton to answer, please say
"I do not have enough information to answer this questions."
"""

- First, use the 7B parameter chat model (`llama-2-7b-chat`) for the response.

In [29]:
response_7b_chat = llama(prompt,
                        model="meta-llama/Meta-Llama-3-8B-Instruct-Lite")
print(response_7b_chat)







The query asks if Jeff and Eddy are neighbors. To answer this, we need to consider the given context. We know that Jeff and Tommy are neighbors, but we don't have any information about Eddy. Therefore, we can't conclude whether Eddy is a neighbor of Jeff or not.

So, the answer is:
"I do not have enough information to answer this question."


- Now, use the 13B parameter chat model (`llama-2-13b-chat`) for the response.

- Then, use the 70B parameter chat model (`llama-2-70b-chat`) for the response.

- Lastly, use the Llama 3 models.

In [30]:
response_llama3_8b_chat = llama(prompt,
                        model="meta-llama/Llama-4-Scout-17B-16E-Instruct")
print(response_llama3_8b_chat)

{'id': 'oPKGqpD-2kFHot-9ae665b8fa32e4e4-SEA', 'object': 'text_completion', 'created': 1765806496, 'model': 'meta-llama/Llama-4-Scout-17B-16E-Instruct', 'choices': [{'index': 0, 'text': '\n\nBased on the given context:\n\n1. Jeff and Tommy are neighbors.\n2. Tommy and Eddy are not neighbors.\n\nWe can infer that just because Tommy and Eddy are not neighbors, and Jeff and Tommy are neighbors, it does not necessarily mean Jeff and Eddy are or are not neighbors. The context does not provide a direct relationship between Jeff and Eddy.\n\nTherefore, the answer to the query "Are Jeff and Eddy neighbors?" is:\n\n"I do not have enough information to answer this question."\n\nThe given information does not establish a direct or indirect relationship between Jeff and Eddy that would allow us to conclusively determine if they are neighbors or not.', 'logprobs': None, 'finish_reason': 'stop', 'matched_stop': 200008}], 'usage': {'prompt_tokens': 77, 'total_tokens': 206, 'completion_tokens': 129, 'p

In [31]:
response_llama3_70b_chat = llama(prompt,
                        model="META-LLAMA/Llama-4-Maverick-17B-128E-Instruct-FP8")
print(response_llama3_70b_chat['choices'][0]['text'])

 From the given context, we know that:

1. Jeff and Tommy are neighbors.
2. Tommy and Eddy are not neighbors.

To determine if Jeff and Eddy are neighbors, let's analyze the information:

- Since Jeff and Tommy are neighbors, Jeff lives either next to or very close to Tommy.
- Tommy and Eddy are not neighbors, meaning Eddy does not live next to or very close to Tommy.

Given that Tommy is between Jeff and Eddy (in the sense that Jeff is a neighbor of Tommy and Eddy is not), it doesn't directly tell us if Jeff and Eddy are neighbors because we don't know the exact arrangement of their houses. 

For example, if the houses are arranged linearly and Jeff is between Tommy and Eddy, or if Eddy is between Jeff and Tommy but not directly next to Tommy, Jeff and Eddy could potentially be neighbors. However, without knowing the specific arrangement or more details about their relative positions, we can't definitively conclude whether Jeff and Eddy are neighbors.

Therefore, I do not have enough 

#### Model-Graded Evaluation: Reasoning

- Again, ask a LLM to compare the three responses.
- Create a `prompt` that will evaluate these three responses using 70B parameter chat model (`llama-2-70b-chat`).
- In the `prompt`, provide the `context`, `query`,"name of the models", and the "response" generated by each model.

In [32]:
prompt = f"""
Given the context `context:`,
Also also given the query (the task): `query:`
and given the name of several models: `mode:<name of model>,
as well as the response generated by that model: `response:`

Provide an evaluation of each model's response:
- Does it answer the query accurately?
- Does it provide a contradictory response?
- Are there any other interesting characteristics of the model's output?

Then compare the models based on their evaluation \
and recommend the models that perform the best.

context: ```{context}```

model: llama-2-7b-chat
response: ```{response_7b_chat}```


model: llama-4
response: ```{response_llama3_8b_chat['choices'][0]['text']}```

model: llama-4-maverick
response: ``{response_llama3_70b_chat['choices'][0]['text']}``
"""

In [33]:
response_eval = llama(prompt, 
                      model="META-LLAMA/llama-4-maverick-17b-128e-instruct-fp8")

print(response_eval['choices'][0]['text'])

 

### Evaluation of Model Responses

#### Model: llama-2-7b-chat
- **Accuracy**: The model accurately answers the query by stating that there is not enough information to determine if Jeff and Eddy are neighbors.
- **Contradiction**: No, the response does not contradict the given context or the query.
- **Other Characteristics**: The response is straightforward and clearly explains the reasoning behind the answer.

#### Model: llama-4
- **Accuracy**: The model accurately answers the query, concluding that there is insufficient information to determine the neighbor status between Jeff and Eddy.
- **Contradiction**: No, the response is consistent with the context and query.
- **Other Characteristics**: The response provides a detailed explanation of the thought process, emphasizing the lack of direct or indirect information about Jeff and Eddy's relationship.

#### Model: llama-4-maverick
- **Accuracy**: The model also accurately answers the query, stating that there is not enough infor