# Introduction to Language models (LMs) - Homework

In [1]:
import os
from transformers import AutoTokenizer,AutoModelForCausalLM, AutoConfig, pipeline
os.environ["HTTP_PROXY"]="proxy.alcf.anl.gov:3128"
os.environ["HTTPS_PROXY"]="proxy.alcf.anl.gov:3128"
os.environ["http_proxy"]="proxy.alcf.anl.gov:3128"
os.environ["https_proxy"]="proxy.alcf.anl.gov:3128"
os.environ["ftp_proxy"]="proxy.alcf.anl.gov:3128"

  from .autonotebook import tqdm as notebook_tqdm


## Homework

1. Load in a generative model using the HuggingFace pipeline and generate text using a batch of prompts.
  * Play with generative parameters such as temperature, max_new_tokens, and the model itself and explain the effect on the legibility of the model response. Try at least 4 different parameter/model combinations.
  * Models that can be used include:
    * `google/gemma-2-2b-it`
    * `microsoft/Phi-3-mini-4k-instruct`
    * `meta-llama/Llama-3.2-1B`
    * Any model from this list: [Text-generation models](https://huggingface.co/models?pipeline_tag=text-generation)
    * `gpt2` if having trouble loading these models in
  * This guide should help! [Text-generation strategies](https://huggingface.co/docs/transformers/en/generation_strategies)

  
2. Load in 2 models of different parameter size (e.g. GPT2, meta-llama/Llama-2-7b-chat-hf, or distilbert/distilgpt2) and analyze the BertViz for each. How does the attention mechanisms change depending on model size?

### 1. Load in a generative model using the HuggingFace pipeline and generate text using a batch of prompts.

In [2]:
prompts = ['What should I do tomorrow?',
           "Hello, my name is",
           "Now, to be honest, I "]

# I was having trouble loading the other models
generator = pipeline("text-generation", model="gpt2", framework='pt', device=0)

for max_length in [10,20,30]:
    print("max_length:", max_length)
    for input_text in prompts:
        print(generator(input_text, max_length=max_length, num_return_sequences=3, temperature = 1.0))


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


max_length: 10


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': 'What should I do tomorrow?\n\nAt the'}, {'generated_text': 'What should I do tomorrow?\n\nI can'}, {'generated_text': 'What should I do tomorrow?\n\nAfter completing'}]
[{'generated_text': 'Hello, my name is Nita, and I'}, {'generated_text': 'Hello, my name is William (Wyatt'}, {'generated_text': 'Hello, my name is Lilli and I am'}]
[{'generated_text': 'Now, to be honest, I \xa0did'}, {'generated_text': 'Now, to be honest, I \xa0still'}, {'generated_text': 'Now, to be honest, I \xa0didn'}]
max_length: 20
[{'generated_text': 'What should I do tomorrow?\n\nAfternoon dinner, 9/1/24\n\n4'}, {'generated_text': "What should I do tomorrow? I don't have the words to answer my question about what to do"}, {'generated_text': 'What should I do tomorrow?\n\nWhen should I look up?\n\nWhen am I ready'}]


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': "Hello, my name is Nick Lyon. I'm a web developer who creates and builds web projects from"}, {'generated_text': 'Hello, my name is Zane, and here is an overview of the new additions to the community'}, {'generated_text': 'Hello, my name is Ben, and I just came back from Australia to go with my parents.'}]
[{'generated_text': 'Now, to be honest, I urned them the more because they were of high caliber.'}, {'generated_text': "Now, to be honest, I \xa0didn't really like the current version of the interface."}, {'generated_text': "Now, to be honest, I \xa0didn't realize how much of a big deal it was"}]
max_length: 30


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': 'What should I do tomorrow?\n\nWhen you are ready to make your next meal, ask the waiter to put on the dishwasher.\n\n'}, {'generated_text': 'What should I do tomorrow? What should I do next?\n\nIf I see this or want to learn more about it, you can read more'}, {'generated_text': "What should I do tomorrow?\n\nHere's the basic outline.\n\nI just wrote the code in Python. This means this program should save"}]
[{'generated_text': "Hello, my name is Mike. I'm an actor and I play a character who happens to have lived a bit in the States in the last three"}, {'generated_text': "Hello, my name is Jack. I'm the owner of the company called Aventura, LLC and can't stop talking about the product with my"}, {'generated_text': "Hello, my name is Richard. I'm very small, I have a very large dick. I have been in business. I have lived here for"}]
[{'generated_text': "Now, to be honest, I iced up at lunch a little less before the game in a way I hadn't done in all my years with"}

Using a smaller max_length makes it hard for the text generation to make much sense because it's too short. Adding a bit more length allows a complete thought to be generated. An even longer max_length turns the responses into rambling, run-on sentences.

In [3]:
prompts = ['What should I do tomorrow?',
           "Hello, my name is",
           "Now, to be honest, I "]

generator = pipeline("text-generation", model="gpt2", framework='pt', device=0)

for temp in [0.5,1.0,1.5]:
    print("temperature:", temp)
    for input_text in prompts:
        print(generator(input_text, max_length=20, num_return_sequences=3, temperature = temp))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


temperature: 0.5
[{'generated_text': "What should I do tomorrow?\n\nI've got a lot of questions.\n\nWhat should"}, {'generated_text': 'What should I do tomorrow?\n\n"I think I\'ll be fine. I\'m just going'}, {'generated_text': 'What should I do tomorrow?\n\nI will do everything I can to make sure my kids are'}]
[{'generated_text': 'Hello, my name is Joe. I was just about to say goodbye to my family, but I'}, {'generated_text': "Hello, my name is Dan. I'm a professional writer and I'm a big fan of the"}, {'generated_text': 'Hello, my name is Nick, and I am a professional athlete. I am a professional wrestler,'}]


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': "Now, to be honest, I \xa0didn't really like this. I was a little bit"}, {'generated_text': "Now, to be honest, I \xa0didn't really get to see that movie. The movie"}, {'generated_text': "Now, to be honest, I \xa0didn't really care about the question about the number of"}]
temperature: 1.0
[{'generated_text': "What should I do tomorrow?\n\nThe following are several things I've been following on a daily"}, {'generated_text': "What should I do tomorrow? First, I'll set up a project and ask for it. Then"}, {'generated_text': 'What should I do tomorrow?\n\n\nI have recently been experiencing a feeling of dread and insecurity in'}]
[{'generated_text': 'Hello, my name is Eren J. He graduated from college and has lived in Japan since 2011'}, {'generated_text': "Hello, my name is Matt, and I'm using Emacs for all kinds of stuff. So I"}, {'generated_text': 'Hello, my name is Ego", readjusted from the very beginning, and not in my'}]


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': "Now, to be honest, I \xa0had little trouble with using this. There's also a"}, {'generated_text': 'Now, to be honest, I \xa0didn\'t really think that I\'ll get my "final'}, {'generated_text': "Now, to be honest, I \xa0didn't care about his personal security at all - I"}]
temperature: 1.5
[{'generated_text': 'What should I do tomorrow?\nAfternoon coffee. You may want another toasted, and perhaps'}, {'generated_text': 'What should I do tomorrow? I should do nothing.\n\nSo we started getting ready to head'}, {'generated_text': 'What should I do tomorrow? Well, most obviously before he has finished with his party or before the'}]
[{'generated_text': "Hello, my name is Sankhopp. Is this the most well-known nickname you're"}, {'generated_text': 'Hello, my name is Richard Gifford of Toronto and my mission to provide a space for families'}, {'generated_text': 'Hello, my name is Jelena, and thank you all for following in this footsteps I am'}]
[{'generated_text': 'Now, to b

Increasing the temperature makes the generated text more "unexpected." For example, the names generated in response to "my name is" were more common American first names with a lower temperature and less common with a higher temperature.

In [4]:
from huggingface_hub import login
login(token="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")
generator = pipeline("text-generation", model="meta-llama/Llama-3.2-1B", device=0)
for input_text in prompts:
    print(generator(input_text, max_length=20, num_return_sequences=3, temperature = 1.0))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': 'What should I do tomorrow? Is there anything to do? Or should I just be in the'}, {'generated_text': 'What should I do tomorrow? Can I talk to my brother?\nHe’s been talking about the'}, {'generated_text': "What should I do tomorrow? (and today?)\nI've been thinking a lot about that today"}]


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': "Hello, my name is Jason. Welcome to my site!\nI'm an avid collector of vintage"}, {'generated_text': 'Hello, my name is Mr. Garry. This is my first year teaching in the '}, {'generated_text': 'Hello, my name is Landon and I have a blog. This is an intro blog and'}]
[{'generated_text': 'Now, to be honest, I \xa0wasn’t sure that I wanted a new project'}, {'generated_text': 'Now, to be honest, I \xa0have been looking for a while on the internet for'}, {'generated_text': "Now, to be honest, I \xa0didn't really want a new computer and a new"}]


Llama seems better able to match the tone of the prompt than gpt2. For example, in response to "what should I do tomorrow" it asks more questions whereas gpt2 did not. This made the responses, in general, make more sense.

### 2. Load in 2 models of different parameter size and analyze the BertViz for each. How does the attention mechanisms change depending on model size?

#### smaller model - gpt2

#### larger model - llama

In [1]:
! pip install bertviz

Defaulting to user installation because normal site-packages is not writeable


In [2]:
from transformers import AutoTokenizer, AutoModel, utils, AutoModelForCausalLM

from bertviz import model_view
utils.logging.set_verbosity_error()  # Suppress standard warnings
input_text = "The fault, dear Brutus, is not in our stars"

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
model_name='openai-community/gpt2'
model = AutoModelForCausalLM.from_pretrained(model_name, output_attentions=True)

tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer.encode(input_text, return_tensors='pt')
outputs = model(inputs)  # Run model

attention = outputs[-1]  # Retrieve attention from model outputs
tokens = tokenizer.convert_ids_to_tokens(inputs[0])  # Convert input ids to token strings
model_view(attention, tokens)  # Display model view



<IPython.core.display.Javascript object>

In [4]:
model_name = 'meta-llama/Llama-3.2-1B'
model = AutoModelForCausalLM.from_pretrained(model_name, output_attentions=True)

tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer.encode(input_text, return_tensors='pt')
outputs = model(inputs)  # Run model

attention = outputs[-1]  # Retrieve attention from model outputs
tokens = tokenizer.convert_ids_to_tokens(inputs[0])  # Convert input ids to token strings
model_view(attention, tokens)  # Display model view

<IPython.core.display.Javascript object>

For gpt2, there were fewer heads and layers for bertviz to analyze than Llama-3.2-1B. In both cases, most of the attention was paid to the first 1-2 words of the sentence. However, there are a few heads and layers that correlated words most strongly with themselves.
In the proportion of words correlated with themselves seems higher in gpt2 than in Llama-3.2-1B.