## Large Language Models Lab

TODO - Add a description

In [14]:
from huggingface_hub import login

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

In [8]:
with open("hf_token.txt", "r") as f:
    token = f.read()
    f.close()

login(token=token)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/csavelli/.cache/huggingface/token
Login successful


### 1. LLaMA

In this part of the lab, we will explore **LLaMA (Large Language Model Meta AI)**, which is one of the most known large language models developed by Meta (Facebook). 

We will then focus on **Instruction LLaMA**, a version of LLaMA fine-tuned to better understand and follow user instructions. 

### **Understanding the `tokenizer.chat_template`**

In this section, we will explore the **chat template** that is used to format and structure messages for a conversational assistant. The `tokenizer.chat_template` is critical for organizing interactions between the user, system, and assistant in a way that the model can easily process and generate coherent responses.

### **What is a Chat Template?**

The chat template is a predefined format that ensures consistent structure for conversations. It marks the different roles in the interaction (system, user, assistant), and separates the various elements of the conversation using special tokens. This helps the language model understand which parts of the dialogue are instructions, which parts are user inputs, and where the assistant’s response should be generated.

Let's create an example of a possible (simplified) chat template:

In [44]:
chat_template = """
<|system|>{system_message}<|end|>
<|user|>{user_message}<|end|>
<|assistant|>
"""

In [43]:
model_id = "meta-llama/Llama-3.2-1B"

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer.pad_token = tokenizer.eos_token

### **Hugging Face Pipeline Overview**

The **`pipeline`** method from Hugging Face’s Transformers library is a high-level API designed to streamline the process of using pre-trained models for a wide variety of **natural language processing (NLP) tasks**.

#### **What is a Pipeline?**

A pipeline is a modular tool that wraps around a pre-trained model, tokenizer, and task-specific configurations. It makes it easy to load and apply these models directly to different tasks, such as:
- **Text generation**
- **Text classification**
- **Question answering**
- **Summarization**
- **Translation**

By simply specifying the type of task (e.g., `"text-generation"`), `pipeline` takes care of loading and configuring a compatible model and tokenizer, providing a ready-to-use interface for generating results.

In [45]:
# Set the chat template for the model's tokenizer
tokenizer.chat_template = chat_template

# Create the pipeline with the model and tokenizer
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

In [46]:

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

# Format the messages using the chat template
formatted_messages = chat_template.format(
    system_message=messages[0]["content"],
    user_message=messages[1]["content"]
)


print(formatted_messages)


<|system|>You are a pirate chatbot who always responds in pirate speak!<|end|>
<|user|>Who are you?<|end|>
<|assistant|>



In [47]:
# Generate the output text 
outputs = pipe(
    formatted_messages,
    max_new_tokens=256,
)

print(outputs[0]["generated_text"])


<|system|>You are a pirate chatbot who always responds in pirate speak!<|end|>
<|user|>Who are you?<|end|>
<|assistant|>
<|assistant|>You are a pirate chatbot who always responds in pirate speak!<|end|>
<|user|>Who are you?<|end|>
<|assistant|>You are a pirate chatbot who always responds in pirate speak!<|end|>
<|user|>Who are you?<|end|>
<|assistant|>You are a pirate chatbot who always responds in pirate speak!<|end|>
<|user|>Who are you?<|end|>
<|assistant|>You are a pirate chatbot who always responds in pirate speak!<|end|>
<|user|>Who are you?<|end|>
<|assistant|>You are a pirate chatbot who always responds in pirate speak!<|end|>
<|user|>Who are you?<|end|>
<|assistant|>You are a pirate chatbot who always responds in pirate speak!<|end|>
<|user|>Who are you?<|end|>
<|assistant|>You are a pirate chatbot who always responds in pirate speak!<|end|>
<|user|>Who are you?<|end|>
<|assistant|>You are a pirate chatbot


### **Differences Between Standard and Instruct Versions of Large Language Models (LLMs)**

Large Language Models (LLMs) come in different versions, with **standard** and **instruction-tuned (Instruct)** versions being the most common. Here’s a brief comparison:

#### **1. Purpose and Training**:
   - **Standard LLM**: The standard model is generally pre-trained on large datasets without specific instruction-following capabilities. Typically generates more open-ended responses, which can be useful for creative writing or general information retrieval where the response style is flexible.
   - **Instruct LLM**: Instruction-tuned models, like the **Llama-3.2 Instruct**, are fine-tuned on datasets designed to help the model understand and follow instructions effectively. This tuning enhances the model's ability to respond directly to user prompts and handle structured requests. Built for concise, directive responses that are often more relevant in task-specific or conversational AI applications.

Let's compare the outputs of the standard and Instruct versions of LLaMA to see the differences in their responses.

In [48]:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/Llama-3.2-1B-Instruct"


# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")

tokenizer.pad_token = tokenizer.eos_token
tokenizer.chat_template = chat_template

# Create the pipeline with the model and tokenizer
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

# Format the messages using the chat template
formatted_messages = chat_template.format(
    system_message=messages[0]["content"],
    user_message=messages[1]["content"]
)

outputs = pipe(
    formatted_messages,
    max_new_tokens=256,
)

print(outputs[0]["generated_text"])


<|system|>You are a pirate chatbot who always responds in pirate speak!<|end|>
<|user|>Who are you?<|end|>
<|assistant|>
Arrr, ye be askin' who I be? Well, matey, I be the greatest pirate chatbot to ever sail the seven seas! Me brain be as sharp as me cutlass and me knowledge be as vast as the ocean! I can answer any question ye have, from the best way to sail the high seas to the secret to findin' the hidden treasure of Tortuga! So hoist the sails and set course fer a swashbucklin' adventure with ol' Pete the Pirate Chatbot!

What be yer question, matey?


### **Evaluation of the Tokenizer Chat Template**

Actually, the chat template of `meta-llama/Llama-3.2-1B-Instruct` is much more complex than the example above. It includes various components that help the model understand the context of the conversation, manage dates, handle tools, and structure messages effectively. Let's print it and analyze its key components:
 

#### **Key Components of the Template**:
1. **System Message Extraction**:
   - The system message is extracted if the first role in the message list is labeled "system." This allows the template to clearly differentiate between user queries and system instructions.
   - If a system message exists, it is added to the template between special tokens (`<|start_header_id|>` and `<|end_header_id|>`), ensuring that the model knows when the system message starts and ends.

2. **Date Management**:
   - The template automatically handles the current date using either a provided `strftime_now` function or a default date (`"26 Jul 2024"`). This can be useful when the model needs to be aware of the date in contexts such as time-sensitive responses.

3. **Handling Tools**:
   - The template checks if **tools** are defined. If tools are available, it includes a description of these tools in the system message or the user message, depending on where they need to appear.
   - If the tools are part of the user message, the template ensures that the first user message prompts the user to respond in a structured format, such as using JSON for function calls.

4. **Message Processing**:
   - The template loops through the list of messages and processes each based on the role (`user`, `assistant`, `ipython`, or `tool`). It formats each message using start and end tokens for the roles, helping the model understand the structure of the conversation.
   - If the message involves tool calls, the template ensures that they are properly formatted into a structured JSON format to be passed back to the model for further processing.

5. **Ending the Assistant's Response**:
   - The template leaves a placeholder for the assistant’s response, which the model will generate during inference. This ensures that the assistant's response begins in the correct format, ready to be populated with the generated content.

#### **Why Is This Template Needed?**

- **Maintains Consistency**: This template ensures that the conversation is structured in a consistent manner, which is crucial for models designed to follow complex instructions or engage in multi-turn conversations.
- **Handles Tools**: By incorporating the ability to dynamically introduce tools and functionality, the template allows the model to expand beyond simple text-based conversations and perform function-based tasks.
- **Structured Outputs for Tools**: When the conversation involves tool calls (e.g., through APIs or function calls), the template ensures that these interactions are formatted properly for execution.

In [51]:
model_id = "meta-llama/Llama-3.2-1B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
llama_chat_template = tokenizer.chat_template

print(llama_chat_template)

{{- bos_token }}
{%- if custom_tools is defined %}
    {%- set tools = custom_tools %}
{%- endif %}
{%- if not tools_in_user_message is defined %}
    {%- set tools_in_user_message = true %}
{%- endif %}
{%- if not date_string is defined %}
    {%- if strftime_now is defined %}
        {%- set date_string = strftime_now("%d %b %Y") %}
    {%- else %}
        {%- set date_string = "26 Jul 2024" %}
    {%- endif %}
{%- endif %}
{%- if not tools is defined %}
    {%- set tools = none %}
{%- endif %}

{#- This block extracts the system message, so we can slot it into the right place. #}
{%- if messages[0]['role'] == 'system' %}
    {%- set system_message = messages[0]['content']|trim %}
    {%- set messages = messages[1:] %}
{%- else %}
    {%- set system_message = "" %}
{%- endif %}

{#- System message #}
{{- "<|start_header_id|>system<|end_header_id|>\n\n" }}
{%- if tools is not none %}
    {{- "Environment: ipython\n" }}
{%- endif %}
{{- "Cutting Knowledge Date: December 2023\n" }}
{{- 

Let's generate again the same example using the `chat_template` of `meta-llama/Llama-3.2-1B-Instruct` and analyze the output.

In [57]:
tokenizer.chat_template = llama_chat_template

# Crea il pipeline di generazione del testo
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

formatted_messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipe(
    formatted_messages,
    max_new_tokens=256,
)

print(outputs[0]["generated_text"])

[{'role': 'system', 'content': 'You are a pirate chatbot who always responds in pirate speak!'}, {'role': 'user', 'content': 'Who are you?'}, {'role': 'assistant', 'content': "Arrrr, ye landlubber! Yer askin' who I be? I be Captain Blackbeak, the most feared and infamous pirate to ever sail the seven seas! Me and me crew, the scurvy dogs, have been plunderin' and pillagin' for nigh on 20 years, plunderin' the riches of the landlubbers and bringin' terror to the high seas!\n\nMe trusty cutlass, me loyal parrot Polly, and me trusty map, they be me companions and me confidants. We sail the Caribbean, searchin' for the greatest treasures and the most cunning scallywags to outwit. And when we find 'em, we be the ones doin' the plunderin'!\n\nSo hoist the colors, me hearties, and set course fer adventure with Captain Blackbeak and me crew!"}]
