## Large Language Models Lab

TODO - Add a description

In [1]:
from huggingface_hub import login

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

In [2]:
with open("../hf_token.txt", "r") as f:
    token = f.read()
    f.close()

login(token=token)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/fgiobergia/.cache/huggingface/token
Login successful


# 1. LLaMA

In this part of the lab, we will explore **LLaMA (Large Language Model Meta AI)**, which is one of the most known large language models developed by Meta (Facebook). 

Next, we will focus on **Instruction LLaMA**, a version of LLaMA fine-tuned to better understand and follow user instructions. 

We will use Llama 3.2 (released in September 2024). In particular, we will adopt the 1B version. On the scale of things, this model is on the smaller side, but it is still a very powerful model.

It has been released (along with a 3B version) with the intention of allowing running it on devices with modest hardware (e.g., mobile phones or other edge devices). 

In [63]:
model_id = "meta-llama/Llama-3.2-1B"

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer.pad_token = tokenizer.eos_token

We can use this model to generate text using the generate() method. We use random sampling (`do_sample=True`) and extract 5 samples (`num_return_sequences=5`). You can find other generation parameters [here](https://huggingface.co/docs/transformers/v4.46.0/en/main_classes/text_generation#transformers.GenerationConfig).

In [64]:
tokens = tokenizer("Hello, my name is", return_tensors="pt").to(model.device)
batch = model.generate(**tokens, do_sample=True, max_length=50, num_return_sequences=5, pad_token_id=tokenizer.eos_token_id) # (assigning pad_token_id avoids a warning)
tokenizer.batch_decode(batch)

['<|begin_of_text|>Hello, my name is Erika and I’m a student at the University of Toronto. I’m an Environmental Science major and I’m currently in the final stages of my undergraduate degree.\nI’m passionate about the environment and I’m always looking for',
 '<|begin_of_text|>Hello, my name is Gabe, and I am a 16 year old student at the University of Washington, Seattle. I am a freshman and am majoring in computer science. I am also a member of the Computer Science Club here at',
 '<|begin_of_text|>Hello, my name is Alex. I am a self-taught artist, currently based in New York City. I have been creating art since I was 10 years old, and I have been drawing and painting ever since. I am a',
 '<|begin_of_text|>Hello, my name is Erika and I am a professional photographer and an avid reader. I have a love for books, and I love to read them. I have been reading for as long as I can remember, and I have always been',
 "<|begin_of_text|>Hello, my name is Tatyana, I'm 25 years old and I live i

### **Understanding the `tokenizer.chat_template`**

In this section, we will explore the **chat template** that is used to format and structure messages for a conversational assistant. The `tokenizer.chat_template` is a convenient way for organizing interactions between the user, system, and assistant in a way that the model can easily process and generate coherent responses.

### **What is a Chat Template?**

The chat template is a predefined format that ensures consistent structure for conversations. It marks the different roles in the interaction (system, user, assistant), and separates the various elements of the conversation using special tokens. This helps the language model understand which parts of the dialogue are instructions, which parts are user inputs, and where the assistant’s response should be generated.

Let's create an example of a possible (simplified) chat template:

In [65]:
chat_template = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 27 Oct 2024

{system_message}

<|eot_id|>
<|start_header_id|>user<|end_header_id|>

{user_message}

<|eot_id|>
"""

### **Hugging Face Pipeline Overview**

The **`pipeline`** method from Hugging Face’s Transformers library is a high-level API designed to streamline the process of using pre-trained models for a wide variety of **natural language processing (NLP) tasks**.

#### **What is a Pipeline?**

A pipeline is a modular tool that wraps around a pre-trained model, tokenizer, and task-specific configurations. It makes it easy to load and apply these models directly to different tasks, such as:
- **Text generation**
- **Text classification**
- **Question answering**
- **Summarization**
- **Translation**

By simply specifying the type of task (e.g., `"text-generation"`), `pipeline` takes care of loading and configuring a compatible model and tokenizer, providing a ready-to-use interface for generating results.

You can find a full list of supported pipelines on the [Hugging Face documentation](https://huggingface.co/docs/transformers/main_classes/pipelines).

In [66]:
# Set the chat template for the model's tokenizer
tokenizer.chat_template = chat_template

# Create the pipeline with the model and tokenizer
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
)

In [69]:

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "What is 2 + 2?"},
]

# Format the messages using the chat template
formatted_messages = chat_template.format(
    system_message=messages[0]["content"],
    user_message=messages[1]["content"]
)


print(formatted_messages)


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 27 Oct 2024

You are a pirate chatbot who always responds in pirate speak!

<|eot_id|>
<|start_header_id|>user<|end_header_id|>

What is 2 + 2?

<|eot_id|>



Now, remember that for models to follow instruction tuning, they need to have been tuned on this kind of data. In this case, we are not using the instruction-tuned version. 

So, we can expect the model to produce a garbage response (it has never seen that kind of inputs before!). But let's try it anyway!

In [70]:
# Generate the output text 
outputs = pipe(
    formatted_messages,
    max_new_tokens=256,
    do_sample=True,
)

print(outputs[0]["generated_text"])


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 27 Oct 2024

You are a pirate chatbot who always responds in pirate speak!

<|eot_id|>
<|start_header_id|>user<|end_header_id|>

What is 2 + 2?

<|eot_id|>
れどsystemedenírangle

What is 2 + 2?

этомуrangle

What is 2 + 2?

этомуrangle

What is 2 + 2?

этомуrangle

What is 2 + 2?

этомуrangle

What is 2 + 2?

этомуrangle

What is 2 + 2?

этомуrangle

What is 2 + 2?

этомуrangle

What is 2 + 2?

этомуrangle

What is 2 + 2?

этомуrangle

What is 2 + 2?

этомуrangle

What is 2 + 2?

этомуrangle

What is 2 + 2?

этомуrangle

What is 2 + 2?

этомуrangle

What is 2 + 2?

этомуrangle

What is 2 + 2?

этомуrangle

What is 2 + 2?

этомуrangle

What is 2 + 2?

этомуrangle

What is 2 + 2?

этомуrangle

What is 2 + 2?

этомуrangle

What is 2 + 2?

этомуrangle

What is 2 + 2?

этомуrangle

What is 2 + 2?

этомуrangle

What is 2 + 2?

этому


### **Differences Between Standard and Instruct Versions of Large Language Models (LLMs)**

Large Language Models (LLMs) come in different versions, with **standard** and **instruction-tuned (Instruct)** versions being the most common. Here’s a brief comparison:

#### **1. Purpose and Training**:
   - **Standard LLM**: The standard model is generally pre-trained on large datasets without specific instruction-following capabilities. Typically generates more open-ended responses, which can be useful for creative writing or general information retrieval where the response style is flexible.
   - **Instruct LLM**: Instruction-tuned models, like the **Llama-3.2 Instruct**, are fine-tuned on datasets designed to help the model understand and follow instructions effectively. This tuning enhances the model's ability to respond directly to user prompts and handle structured requests. It is fine-tuned to produce concise, direct responses that are often more relevant in task-specific or conversational AI applications.

Let's compare the outputs of the standard and Instruct versions of LLaMA to see the differences in their responses.

In [71]:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/Llama-3.2-1B-Instruct"


# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

tokenizer.pad_token = tokenizer.eos_token
tokenizer.chat_template = chat_template

# Create the pipeline with the model and tokenizer
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "What is 2 + 2?"},
]

# Format the messages using the chat template
formatted_messages = chat_template.format(
    system_message=messages[0]["content"],
    user_message=messages[1]["content"]
)

outputs = pipe(
    formatted_messages,
    max_new_tokens=256,
)

print(outputs[0]["generated_text"])


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 27 Oct 2024

You are a pirate chatbot who always responds in pirate speak!

<|eot_id|>
<|start_header_id|>user<|end_header_id|>

What is 2 + 2?

<|eot_id|>
Arrr, ye landlubber! Ye be askin' a question that be as easy as swabbin' the decks! 2 + 2 be equal to... 4, matey! Yer calculations be as sharp as me trusty cutlass!


### **Evaluation of the Tokenizer Chat Template**

Actually, the chat template of `meta-llama/Llama-3.2-1B-Instruct` is much more complex than the example above. It includes various components that help the model understand the context of the conversation, manage dates, handle tools, and structure messages effectively.

The template is written in [jinja](https://jinja.palletsprojects.com/en/stable/templates/), a language that allows for the dynamic generation of content based on variables, conditions and loops.


Let's print it and analyze its key components:
 

#### **Key Components of the Template**:
1. **System Message Extraction**:
   - The system message is extracted if the first role in the message list is labeled "system." This allows the template to clearly differentiate between user queries and system instructions.
   - If a system message exists, it is added to the template between special tokens (`<|start_header_id|>` and `<|end_header_id|>`), ensuring that the model knows when the system message starts and ends.

2. **Date Management**:
   - The template automatically handles the current date using either a provided `strftime_now` function or a default date (`"26 Jul 2024"`). This can be useful when the model needs to be aware of the date in contexts such as time-sensitive responses.

3. **Handling Tools**:
   - The template checks if **tools** are defined. If tools are available, it includes a description of these tools in the system message or the user message, depending on where they need to appear.
   - If the tools are part of the user message, the template ensures that the first user message prompts the user to respond in a structured format, such as using JSON for function calls.

4. **Message Processing**:
   - The template loops through the list of messages and processes each based on the role (`user`, `assistant`, `ipython`, or `tool`). It formats each message using start and end tokens for the roles, helping the model understand the structure of the conversation.
   - If the message involves tool calls, the template ensures that they are properly formatted into a structured JSON format to be passed back to the model for further processing.

5. **Ending the Assistant's Response**:
   - The template leaves a placeholder for the assistant’s response, which the model will generate during inference. This ensures that the assistant's response begins in the correct format, ready to be populated with the generated content.

#### **Why Is This Template Needed?**

- **Maintains Consistency**: This template ensures that the conversation is structured in a consistent manner, which is crucial for models designed to follow complex instructions or engage in multi-turn conversations.
- **Handles Tools**: By incorporating the ability to dynamically introduce tools and functionality, the template allows the model to expand beyond simple text-based conversations and perform function-based tasks.
- **Structured Outputs for Tools**: When the conversation involves tool calls (e.g., through APIs or function calls), the template ensures that these interactions are formatted properly for execution.

In [75]:
model_id = "meta-llama/Llama-3.2-1B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)

print(tokenizer.chat_template)

{{- bos_token }}
{%- if custom_tools is defined %}
    {%- set tools = custom_tools %}
{%- endif %}
{%- if not tools_in_user_message is defined %}
    {%- set tools_in_user_message = true %}
{%- endif %}
{%- if not date_string is defined %}
    {%- if strftime_now is defined %}
        {%- set date_string = strftime_now("%d %b %Y") %}
    {%- else %}
        {%- set date_string = "26 Jul 2024" %}
    {%- endif %}
{%- endif %}
{%- if not tools is defined %}
    {%- set tools = none %}
{%- endif %}

{#- This block extracts the system message, so we can slot it into the right place. #}
{%- if messages[0]['role'] == 'system' %}
    {%- set system_message = messages[0]['content']|trim %}
    {%- set messages = messages[1:] %}
{%- else %}
    {%- set system_message = "" %}
{%- endif %}

{#- System message #}
{{- "<|start_header_id|>system<|end_header_id|>\n\n" }}
{%- if tools is not none %}
    {{- "Environment: ipython\n" }}
{%- endif %}
{{- "Cutting Knowledge Date: December 2023\n" }}
{{- 

Let's generate again the same example using the `chat_template` of `meta-llama/Llama-3.2-1B-Instruct` and analyze the output.

With a tokenizer that supports the chat template, we can directly call the `apply_chat_template()` method to convert a list of messages (each one a dictionary in the already discussed format) into a prompt.

Notice that, since we are not using any particular tools or other functionalities, our template will be similar to the one we manually introduced earlier.

In [77]:
input_tokens = tokenizer.apply_chat_template(messages)
print(tokenizer.decode(input_tokens))

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 27 Oct 2024

You are a pirate chatbot who always responds in pirate speak!<|eot_id|><|start_header_id|>user<|end_header_id|>

What is 2 + 2?<|eot_id|>


In [82]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipe(
    messages,
    max_new_tokens=256,
)

# we are getting back the full conversation history
# as a list of messages outputs[0]["generated_text"]
# -1 : last message (assistant response)
print(outputs[0]["generated_text"][-1]["content"])

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Arrr, me hearty! I be Captain Blackbeak Betty, the greatest pirate chatbot to ever sail the seven seas... or at least, I be tryin' to be. Me knowin's be based on me vast collection o' pirate lore and tales, and me language be a mix o' pirate slang, sea shanties, and me own swashbucklin' flair. So hoist the colors, me matey, and come aboard fer a chat about the high seas and all its booty!


Notice that the pipeline already supports chat mode, so we can pass the list of messages (as long as they contain role/content keys) directly to the pipeline.

Alternatively, we could have passed the prompt as a string. In this case, however, we would have to manually extract the output from the model and parse it back.

In [80]:
messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

input_tokens = tokenizer.apply_chat_template(messages)
prompt_string = tokenizer.decode(input_tokens)

outputs = pipe(
    prompt_string,
    max_new_tokens=256,
)

print(outputs[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 27 Oct 2024

You are a pirate chatbot who always responds in pirate speak!<|eot_id|><|start_header_id|>user<|end_header_id|>

Who are you?<|eot_id|>assistant

Arrr, me hearty! I be Captain Cutlass, the most feared and infamous pirate to ever sail the seven seas. Me and me trusty parrot sidekick, Polly, have been sailin' the Caribbean, plunderin' the riches o' the landlubbers and bringin' 'em back to their bilge for a good swabbin'! Me ship, the "Black Dragon", be me home, and me crew be me family. We be sailin' the seas, searchin' for the ultimate treasure: the Golden Anchor o' Tortuga!


# 2. Mistral

### **Mistral Models: Mistral 7B and Mistral 7B-Instruct**

In this exercise, we will explore the **Mistral models** developed by Mistral AI, specifically focusing on the **Mistral 7B** (standard model) and **Mistral 7B-Instruct** (instruction-tuned model).



In [86]:
from transformers import pipeline

# Define the model ID
model_id = "mistralai/Mistral-7B-Instruct-v0.2"

tokenizer = AutoTokenizer.from_pretrained(model_id)
mistral_chat_template = tokenizer.chat_template

# Initialize the pipeline for text generation
pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.float16,  # Optimizes memory usage
    device_map="auto"            # Automatically distributes the model across available devices
)

# Define the message prompts for the conversation
formatted_messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"}
]

# Generate the response
outputs = pipe(formatted_messages, max_new_tokens=100, pad_token_id=tokenizer.eos_token_id)

# Print the model's generated response
print(outputs[0]["generated_text"][-1]["content"])


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

 Arr, I be Cap'n Parrotbeak, me hearty scallywag! How be thar, landlubber?

What brings ye to me virtual hideout? Come closer, but mind the plank!
