## Comparing Llama Models

In [12]:
from modules.utils import llama, llama_chat

### Task 1: Sentiment Classification

Compare the models on few-shot prompt sentiment classification.

You are asking the model to return a one word response.

In [13]:
prompt = '''
Message: Hi Amit, thanks for the thoughtful birthday card!
Sentiment: Positive
Message: Hi Dad, you're 20 minutes late to my piano recital!
Sentiment: Negative
Message: Can't wait to order pizza for dinner tonight!
Sentiment: ?

Give a one word response.
'''

First, use the 7B parameter chat model (llama-2-7b-chat) to get the response.

In [14]:
response = llama(prompt,
                 model="togethercomputer/llama-2-7b-chat")
print(response)

  Hungry


Now, use the 70B parameter chat model (llama-2-70b-chat) on the same task

In [15]:
response = llama(prompt,
                 model="togethercomputer/llama-2-70b-chat")
print(response)

  Positive


### Task 2: Summarization

Compare the models on summarization task.

This is the same "email" as the one you used previously in the course.

In [16]:
email = """
Dear Amit,

An increasing variety of large language models (LLMs) are open source, or close to it. The proliferation of models with relatively permissive licenses gives developers more options for building applications.

Here are some different ways to build applications based on LLMs, in increasing order of cost/complexity:

Prompting. Giving a pretrained LLM instructions lets you build a prototype in minutes or hours without a training set. Earlier this year, I saw a lot of people start experimenting with prompting, and that momentum continues unabated. Several of our short courses teach best practices for this approach.
One-shot or few-shot prompting. In addition to a prompt, giving the LLM a handful of examples of how to carry out a task — the input and the desired output — sometimes yields better results.
Fine-tuning. An LLM that has been pretrained on a lot of text can be fine-tuned to your task by training it further on a small dataset of your own. The tools for fine-tuning are maturing, making it accessible to more developers.
Pretraining. Pretraining your own LLM from scratch takes a lot of resources, so very few teams do it. In addition to general-purpose models pretrained on diverse topics, this approach has led to specialized models like BloombergGPT, which knows about finance, and Med-PaLM 2, which is focused on medicine.
For most teams, I recommend starting with prompting, since that allows you to get an application working quickly. If you're unsatisfied with the quality of the output, ease into the more complex techniques gradually. Start one-shot or few-shot prompting with a handful of examples. If that doesn't work well enough, perhaps use RAG (retrieval augmented generation) to further improve prompts with key information the LLM needs to generate high-quality outputs. If that still doesn't deliver the performance you want, then try fine-tuning — but this represents a significantly greater level of complexity and may require hundreds or thousands more examples. To gain an in-depth understanding of these options, I highly recommend the course Generative AI with Large Language Models, created by AWS and DeepLearning.AI.

(Fun fact: A member of the DeepLearning.AI team has been trying to fine-tune Llama-2-7B to sound like me. I wonder if my job is at risk? 😜)

Additional complexity arises if you want to move to fine-tuning after prompting a proprietary model, such as GPT-4, that's not available for fine-tuning. Is fine-tuning a much smaller model likely to yield superior results than prompting a larger, more capable model? The answer often depends on your application. If your goal is to change the style of an LLM's output, then fine-tuning a smaller model can work well. However, if your application has been prompting GPT-4 to perform complex reasoning — in which GPT-4 surpasses current open models — it can be difficult to fine-tune a smaller model to deliver superior results.

Beyond choosing a development approach, it's also necessary to choose a specific model. Smaller models require less processing power and work well for many applications, but larger models tend to have more knowledge about the world and better reasoning ability. I'll talk about how to make this choice in a future letter.

Keep learning!

Andrew
"""

prompt = f"""
Summarize this email and extract some key points.

What did the author say about llama models?
```
{email}
```
"""

First, use the 7B parameter chat model (llama-2-7b-chat) to summarize the email.

In [17]:
response_7b = llama(prompt,
                model="togethercomputer/llama-2-7b-chat")
print(response_7b)

  The author discusses the proliferation of large language models (LLMs) and the various ways to build applications using them, ranging from prompting to fine-tuning. Key points include:

1. LLMs are becoming more open source, giving developers more options for building applications.
2. Different approaches to building applications include prompting, one-shot or few-shot prompting, fine-tuning, and pretraining.
3. The author recommends starting with prompting for most teams, as it allows for quick development, and gradually moving to more complex techniques if needed.
4. Fine-tuning a smaller model can yield superior results than prompting a larger, more capable model in some cases, depending on the application.
5. Choosing a specific model also requires consideration, as smaller models require less processing power but may not have as much knowledge about the world or reasoning ability as larger models.
6. The author plans to discuss how to make this choice in a future letter.

Regard

Now, use the 13B parameter chat model (llama-2-13b-chat) to summarize the email.

In [18]:
response_13b = llama(prompt,
                model="togethercomputer/llama-2-13b-chat")
print(response_13b)

  Sure! Here's a summary of the email and some key points:

Summary:
The author discusses different approaches to building applications using large language models (LLMs), ranging from prompting to fine-tuning. They recommend starting with prompting and gradually increasing the complexity of the techniques based on the desired output quality. The author also discusses the choice of model size and the trade-offs between smaller models and larger, more capable models.

Key points:

1. Prompting is a quick and easy way to build applications using LLMs, and it can be done with minimal resources.
2. One-shot or few-shot prompting can yield better results than just giving a pretrained LLM instructions.
3. Fine-tuning can deliver high-quality outputs, but it requires more resources and a larger dataset.
4. Pretraining an LLM from scratch is a complex and resource-intensive task, but it can lead to specialized models with a high level of knowledge.
5. The choice of development approach and mod

Lastly, use the 70B parameter chat model (llama-2-70b-chat) to summarize the email.

In [19]:
response_70b = llama(prompt,
                model="togethercomputer/llama-2-70b-chat")
print(response_70b)

  The author of the email discusses the various ways to build applications using large language models (LLMs), including prompting, one-shot or few-shot prompting, fine-tuning, and pretraining. The author recommends starting with prompting and gradually moving on to more complex techniques if necessary. They also mention that fine-tuning a smaller model may not always yield superior results compared to prompting a larger, more capable model, and that the choice of model depends on the application.

The author also mentions that there are now many open-source LLMs available, which gives developers more options for building applications. They also mention that there are courses available to learn more about these options and how to use them effectively.

In a humorous aside, the author mentions that a member of the DeepLearning.AI team has been trying to fine-tune Llama-2-7B to sound like the author, and wonders if their job is at risk.

Some key points from the email include:

* There a

### Model-Graded Evaluation: Summarization

Interestingly, you can ask a LLM to evaluate the responses of other LLMs.

This is known as `Model-Graded Evaluation`.

Create a prompt that will evaluate these three responses using 70B parameter chat model (llama-2-70b-chat).

In the prompt, provide the "email", "name of the models", and the "summary" generated by each model.

In [20]:
prompt = f"""
Given the original text denoted by `email`
and the name of several models: `model:<name of model>
as well as the summary generated by that model: `summary`

Provide an evaluation of each model's summary:
- Does it summarize the original text well?
- Does it follow the instructions of the prompt?
- Are there any other interesting characteristics of the model's output?

Then compare the models based on their evaluation \
and recommend the models that perform the best.

email: ```{email}`

model: llama-2-7b-chat
summary: {response_7b}

model: llama-2-13b-chat
summary: {response_13b}

model: llama-2-70b-chat
summary: {response_70b}
"""

response_eval = llama(prompt,
                model="togethercomputer/llama-2-70b-chat")
print(response_eval)

  Based on the summaries provided, it seems that all three models (Llama-2-7B-chat, Llama-2-13B-chat, and Llama-2-70B-chat) were able to accurately summarize the main points of the email. However, there are some differences in the level of detail and clarity provided by each model.

Llama-2-7B-chat's summary is the shortest and most concise of the three, focusing on the main points of the email such as the different approaches to building applications using LLMs and the trade-offs between model size and complexity. However, it does not provide as much detail or examples as the other two models.

Llama-2-13B-chat's summary is slightly longer and provides more examples and details about the different approaches to building applications using LLMs. It also highlights the author's recommendation to start with prompting and gradually increase the complexity of the techniques based on the desired output quality. However, it does not provide as much information about the choice of model size 