## Assignment 4: Large language models and in-context learning
In this assignment, we are going to work with a large language model locally.

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch 
import numpy as np 

### 1. Generation and hyperparameters (10')
1.1 Load a pre-trained language model, `meta-llama/Llama-3.2-1B`, using `AutoModelForCausalLM` class. This requires creating an account at huggingface and creating an access token. You'll also need to submit a request at [huggingface](https://huggingface.co/meta-llama/Llama-3.2-1B) to use it.
An $X$ B model contains $X$ billion parameters. To run inference on this model, you'll need $4X$ GB memory on full precision. You can reduce it to $2X$ GB by specifying `torch_dtype=torch.float16` in the `from_pretrained()` step. If your computer does not have 2GB spare memory, you can use Google Colab or a smaller model like `gpt2-large`. To speed up the inference, you can load the model onto cuda (for the machines with NVidia GPU) or mps (for the machines with Apple Silicon).  

In [21]:
import os

auth = ""

os.environ[auth] = ""

from huggingface_hub import login
login(token=os.environ[auth])

modelname = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(modelname, token=auth)
model = AutoModelForCausalLM.from_pretrained(
    modelname, 
    torch_dtype=torch.float16, 
    use_auth_token=True
)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to C:\Users\wheel\.cache\huggingface\token
Login successful




config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

1.2 Use the built-in `generation` method. Starting from the first word of the sentence "An apple a day keeps the doctors away", generate up to 10 new tokens, using three methods: greedy, beam search, nucleus sampling. Following are the corresponding configurations:  
- Greedy: `do_sample=False`    
- Beam search: `num_beams=k`    
- Nucleus sampling: `do_sample=True, top_p=0.8`  

[The official doc](https://huggingface.co/docs/transformers/en/generation_strategies#customize-text-generation) briefly describes the generation modes, and [the GenerationConfig doc](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig) explains in details the generation parameters.  
How long does it take to generate a sequence? What is the generation quality?  

The documentation explains that modifying the decoding strategy can have an impact on the generation quality, possibly lowering repetition or forming more coherent sentences. The amount of time it takes to generate a sequence is based on the amount of tokens that need to be generated. For example in this exercise we are maxing the decoding at 10 tokens. There are also configuration options to end computation of a certain amount of time has passed. 

In [35]:
# Greedy search decoding
import time

input_text = "An"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(model.device)
start = time.time()
outputs = model.generate(input_ids, max_new_tokens=10, do_sample=False)
end = time.time()

print(tokenizer.decode(outputs[0]))
print(f"Time taken: {end - start}s")


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


<|begin_of_text|>An 18-year-old woman presents to the emergency department
Time taken: 17.331308841705322s


This sequence generation took around 17.33 seconds and has good sentence quality.

In [36]:
# TODO: Beam search decoding
start = time.time()
outputs = model.generate(input_ids, max_new_tokens=10, num_beams=5)
end = time.time()
print(tokenizer.decode(outputs[0]))
print(f"Time taken: {end - start}s")


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


<|begin_of_text|>An 8-year-old boy is brought to the physician
Time taken: 93.51181674003601s


This sequence generation took around 93.5 seconds and has good sentence quality.

In [37]:
# TODO: Nucleus sampling
start = time.time()
outputs = model.generate(input_ids, max_new_tokens=10, do_sample=True, top_p=0.8)
end = time.time()
print(tokenizer.decode(outputs[0]))
print(f"Time taken: {end - start}s")


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


<|begin_of_text|>Anatomy of a Killer: The Secret Science of At
Time taken: 24.856630325317383s


This sentence generation took around 24.86 seconds and has good sentence quality however this time the intput text was continued to form the world Anatomy instead of keeping the first word as "An". This time it had also formed maybe a title of some type.

1.3 Change the `temperature` generation parameter. Comment on the results.  

In [38]:
# TODO: Change the temperature parameter
start = time.time()
outputs = model.generate(input_ids, max_new_tokens=10, do_sample=True, top_p=0.8, temperature=0.7)
end = time.time()
print(tokenizer.decode(outputs[0]))
print(f"Time taken: {end - start}s")


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


<|begin_of_text|>An Introduction to the Mathematics of Physics
by Paul M
Time taken: 20.635331392288208s


This senquence generation took around 20.64 seconds and has good sentence quality this time forming a title of some kind along with the author.

### 2. Zero-shot classification using a language model (10')  
Here we are going to turn the pre-trained language model into a classifier.  
Give a classification question, an obvious solution is to prompt the model so that it only generates the answer. However, it's not always possible to let the LM generate the final answer. Following is a way: Frame the input text as "[Problem] The answer is". Query the probabilities of the next token being "0" vs "1" (or "True" vs "False", depending on the problem). Then, pass them through a softmax to get the probabilities. This is the probability of the model. Now you have an LM that can do the sentiment analysis task.  

Pick N=10 questions in the validation set of SST in Assignment 1-2. Use this zero-shot classifier to solve them. Comment on the results.  

In [31]:
from datasets import load_dataset 
import pandas as pd
import torch.nn.functional as F

ds = load_dataset("glue", "sst2")
val_data = pd.DataFrame(ds["validation"][:10])
X_val_text = val_data["sentence"]
Y_val = val_data["label"]

In [42]:
def zero_shot_learner(model, tokenizer, X_val_text, Y_val):

    correct_predictions = 0
    total_predictions = len(X_val_text)
    
    for text, true_label in zip(X_val_text, Y_val):
        prompt_0 = f"{text} The answer is 0."
        prompt_1 = f"{text} The answer is 1."
        
        inputs_0 = tokenizer(prompt_0, return_tensors="pt", truncation=True, max_length=512)
        inputs_1 = tokenizer(prompt_1, return_tensors="pt", truncation=True, max_length=512)
        
        with torch.no_grad():
            outputs_0 = model(**inputs_0)
            outputs_1 = model(**inputs_1)
        
        logits_0 = outputs_0.logits[0, -1, :]
        logits_1 = outputs_1.logits[0, -1, :]
        
        prob_0 = F.softmax(logits_0, dim=-1)[tokenizer.convert_tokens_to_ids("0")]
        prob_1 = F.softmax(logits_1, dim=-1)[tokenizer.convert_tokens_to_ids("1")]
        
        predicted_label = 0 if prob_0 > prob_1 else 1
        
        if predicted_label == true_label:
            correct_predictions += 1
    
    accuracy = correct_predictions / total_predictions
    return accuracy

accuracy = zero_shot_learner(model, tokenizer, X_val_text, Y_val)
print(f"Validation Accuracy: {accuracy:.2f}")

Validation Accuracy: 0.70


The zero shot classifier had a validation accuracy of 0.7 meaning that it performs relatively well considering that it relies solely based on the pre-trained knowledge and has a generally good grasp of the semantics in the data. 

### 3. Few-shot in-context learning with language model (10')  
How to avoid querying the probabilities? One popular method to improve the format-following behavior of the model using in-context learning (few-shot).  
Few-shot learning places a few demonstration examples in the prompt. Each demonstration example follows the same format, for example: "[Example Problem] The answer is False\n". Finally, the problem is concatenated into prompt: "[Problem] The answer is" 
Hopefully, through this few-shot demonstration, the model learns the format specification and can output the results directly. 

Randomlly pick $k=2$ examples in the SST's training set. Use this in-context learner to attempt the same set of N questions in the previous problem. Comment on the results.  

In [44]:
import random
def in_context_learner(model, tokenizer, X_val_text, Y_val):
    
    examples = random.sample(list(zip(X_val_text, Y_val)), 2)

    demonstration_prompt = ""
    for example_text, example_label in examples:
        demonstration_prompt += f"{example_text} The answer is {example_label}\n"
    
    correct_predictions = 0
    total_predictions = len(X_val_text)
    
    for text, true_label in zip(X_val_text, Y_val):
        prompt = demonstration_prompt + f"{text} The answer is"
        
        inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
        
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_length=inputs.input_ids.shape[1] + 2,  # Allow space for the answer
                pad_token_id=tokenizer.eos_token_id  # Handle padding for models like GPT-2
            )
        
        # Decode the model's output
        output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # Extract the predicted answer
        predicted_label = output_text.split("The answer is")[-1].strip().split()[0]
        
        # Compare with the true label
        if predicted_label == str(true_label):
            correct_predictions += 1
    
    # Calculate accuracy
    accuracy = correct_predictions / total_predictions
    return accuracy

accuracy = in_context_learner(model, tokenizer, X_val_text, Y_val)
print(f"Validation Accuracy: {accuracy:.2f}")

Validation Accuracy: 0.60


The in context learner had a validation accuracy of 0.6 showing that it struggles a little more than the zero shot classifier. This could be due to the small sample size not being sufficiently diverse