# HW4: Text Generation and Attention Mechanism Analysis

This homework has two section:
1. **Text Generation:** using multiple models with different parameters like temperature and max tokens.
2. **Understanding Attention Mechanisms:** using BertViz for models of different sizes to analyze how their attention mechanisms differ.


## Part 1: Text Generation with HuggingFace Models
We'll experiment with different models and generation parameters, including temperature and max tokens, to see how they affect the model's responses.


In [1]:
!pip install transformers torch
!pip install huggingface_hub



### Load Models and Tokenizers
We'll use a few models for text generation and tweak generation parameters.

In [8]:
from huggingface_hub import login

# Log into Hugging Face accout using token
login('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', add_to_git_credential=True)

Token is valid (permission: fineGrained).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [9]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load GPT-2
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
gpt2_model = AutoModelForCausalLM.from_pretrained("gpt2")

# Load LLaMA-2-7B
model_name = "meta-llama/Llama-2-7b-chat-hf"
llama_tokenizer = AutoTokenizer.from_pretrained(model_name)
llama_model = AutoModelForCausalLM.from_pretrained(model_name)



OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/meta-llama/Llama-2-7b-chat-hf.
403 Client Error. (Request ID: Root=1-6726499e-2c020e6d5c1a06c572eb4e48;c97ec144-6c97-4f64-9648-f8fbcbde04f2)

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json.
Access to model meta-llama/Llama-2-7b-chat-hf is restricted and you are not in the authorized list. Visit https://huggingface.co/meta-llama/Llama-2-7b-chat-hf to ask for access.

### Generate Text with Different Parameters
We'll use the GPT-2 model and experiment with the following generation parameters:
- `temperature`
- `max_new_tokens`

In [None]:
# Define a function to generate text
def generate_text(model, tokenizer, prompt, temperature=1.0, max_new_tokens=50):
    inputs = tokenizer(prompt, return_tensors="pt")
    output = model.generate(
        **inputs,
        temperature=temperature,
        max_new_tokens=max_new_tokens,
        do_sample=True
    )
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Test the function with GPT-2 and different parameters
prompt = "The future of AI is"
gpt2_output1 = generate_text(gpt2_model, gpt2_tokenizer, prompt, temperature=0.7)
gpt2_output2 = generate_text(gpt2_model, gpt2_tokenizer, prompt, temperature=1.5, max_new_tokens=100)

print("Output with temperature 0.7:")
print(gpt2_output1)
print("
Output with temperature 1.5:")
print(gpt2_output2)

Now, we'll generate text using the LLaMA-2 model with different parameters.

In [None]:
# Generate text with LLaMA-2 and different parameters
llama_output1 = generate_text(llama_model, llama_tokenizer, prompt, temperature=0.8)
llama_output2 = generate_text(llama_model, llama_tokenizer, prompt, temperature=1.2, max_new_tokens=100)

print("LLaMA-2 output with temperature 0.8:")
print(llama_output1)
print("
LLaMA-2 output with temperature 1.2:")
print(llama_output2)

## Part 2: Attention Mechanism Analysis with BertViz
We'll use BertViz to analyze how the attention mechanisms differ between a smaller model (GPT-2) and a larger model (LLaMA-2-7B).

In [None]:
!pip install bertviz

### Load the Models
We'll load GPT-2 and LLaMA-2 for the attention visualization task.

In [None]:
# Load the smaller model (GPT-2)
small_tokenizer = AutoTokenizer.from_pretrained("gpt2")
small_model = AutoModelForCausalLM.from_pretrained("gpt2", output_attentions=True)

# Load the larger model (LLaMA-2-7B)
large_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
large_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", output_attentions=True)

### Tokenize Input for Both Models

In [None]:
text = "The future of AI holds immense potential."

# Tokenize for GPT-2
small_inputs = small_tokenizer(text, return_tensors="pt")

# Tokenize for LLaMA-2
large_inputs = large_tokenizer(text, return_tensors="pt")

### Get the Outputs with Attention

In [None]:
# Get attention from GPT-2
small_outputs = small_model(**small_inputs)

# Get attention from LLaMA-2
large_outputs = large_model(**large_inputs)

### Visualize Attention using BertViz

In [None]:
from bertviz import head_view

# GPT-2 visualization
head_view(small_outputs.attentions, small_inputs.input_ids, small_tokenizer)

# LLaMA-2 visualization
head_view(large_outputs.attentions, large_inputs.input_ids, large_tokenizer)

## Analysis of Attention Mechanisms
Now that we have visualized the attention heads of both models, let's analyze the differences:

- **GPT-2 (Small model)**: With fewer parameters, GPT-2 has fewer attention heads. These heads tend to focus on a limited context, making the model more likely to attend to recent tokens or single-word relationships.
- **LLaMA-2-7B (Large model)**: With a larger number of parameters, LLaMA-2 has more attention heads, which can focus on both local context and long-range dependencies in the text. This leads to richer attention distributions.