# First contact with open source LLMs

In this exericse, you will perform inference using open source LLMs with the [HuggingFace Transformers library](https://huggingface.co/docs/transformers/en/index).

HuggingFace is the de-facto standard for releasing LLMs and related datasets.

Make sure you have set up your conda environment following the instructions from week 1.

# Exercise 1

In this exercise, we will download a "small" LLM and see how the tokenization and inference works.

## Exercise 1a: Downloading and preparing the model

You do not have to solve anything in this exercise. You are already given the solution. Your task is to simply to walk through and understand it.

We will work with a small model which consists of 500 million parameters. It is not as large and as capable as ChatGPT. But you can easily run it on your machine. In the lectures that follow, we will experiment with larger models.

You can learn more about the model [here](https://huggingface.co/Qwen/Qwen2.5-0.5B).

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import random
import datasets
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"


# Selecting the font size here will affect all the figures in this notebook
# Alternatively, you can set the font size for axis labels of each figure separately
font = {'size': 16}
matplotlib.rc('font', **font)

### Initialize the model

Every model on HuggingFace has a unique name. Each model also comes with its own tokenizer. You can download the model and the corresponding tokenizer using this unique name.

The next cell might take a while to run. You need to download around a Gigabyte of data.

In [None]:
model_name = "Qwen/Qwen2.5-0.5B"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

### Tokenization

You will see how the input gets converted to tokens IDs. Each ID just represents an individual text token.

In [None]:
prompt = "How do you do on this splendid day?"
tokenized_input = tokenizer(prompt, return_tensors="pt")
print(tokenized_input)

You can ignore the attention mask. For causal LLMs, it is mostly important when processing more than one input texts at a time.

Let us print the tokens that each ID represents.

In [None]:
input_tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"].numpy().flatten())
print(input_tokens)

The "Ġ" character represents a preceding space.

We can also convert these tokens back to a full text string.

In [None]:
print(tokenizer.convert_tokens_to_string(input_tokens))

### Vocabulary size

Let us print the size of the model vocabulary, that is, the total number of tokens that it has seen during training.

In [None]:
print(f"Vocab size: {tokenizer.vocab_size}")

### Inference

Recall that LLMs are really just Transformer models, which take the input and generate $V$ scores where $V$ is the total number of tokens in our vocabulary. One reasonable way to generate next token is by selecting the token with the highest score.

In [None]:
prompt = "Is Bochum a great city?"

tokenized_input = tokenizer(prompt, return_tensors="pt")
input_ids = tokenized_input["input_ids"]
print(f"Shape of input IDs: {input_ids.shape}")  # number of inputs x number of tokens
input_tokens = tokenizer.convert_ids_to_tokens(input_ids.numpy().flatten())
with torch.no_grad():
   output = model(tokenized_input["input_ids"]).logits  # We should pass the attention mask but we can ignore it for causal LLMs when we have just a single input
print(f"Shape of the output: {output.shape}")

As we can see, the model computes the output score for **every single input token**.

Let us compute what the most likely token at each position is.

In [None]:
most_likely_tokens = output.squeeze(dim=0)  # Remove the first dimension which is 1
most_likely_tokens = most_likely_tokens.argmax(axis=-1)  # At each generation position, select the token with the highest score
output_tokens = tokenizer.convert_ids_to_tokens(most_likely_tokens.numpy())

input_so_far = []
next_token = []
for i in range(len(input_tokens)):
    input_text = tokenizer.convert_tokens_to_string(input_tokens[:i+1])  # combine all the input tokens up to this generation position
    gen_token = tokenizer.convert_tokens_to_string([output_tokens[i]])
    input_so_far.append(input_text)
    next_token.append(gen_token)

pd.DataFrame({"Input": input_so_far, "Model output": next_token})

## Exercise 1b: Generating multiple tokens

Your task is to write a function that takes an input prompt and a specific generation length. It then generates as many new tokens as specified by generation length. At each position, you will generate the most likely next token.

Test the function with a few prompts like:
1. Germany is a country
2. Abraham Lincoln was born in

Feel free to add prompts of your own liking :)

**Hint:** Recall that LLMs are autoregressive. That is, after generating the first token, you append it back to the input to generate the second token.

In [None]:
def generate(prompt: str, gen_len: int) -> str:
    # Your code here
    raise NotImplementedError

prompt = "I love Bochum because"
generate(prompt, 30)

In [None]:
def generate(prompt: str, gen_len: int) -> str:
    # Your code here
    # raise NotImplementedError
    tokenized_input = tokenizer(prompt, return_tensors="pt")
    input_ids = tokenized_input["input_ids"]
    for _ in range(gen_len):
        with torch.no_grad():
           output = model(input_ids).logits  # We should pass the attention mask but we can ignore it for causal LLMs when we have just a single input
        output = output.squeeze(dim=0)
        next_token_scores = output[-1]
        next_token_id = next_token_scores.argmax(dim=-1)
        input_ids = torch.cat((input_ids, torch.LongTensor([next_token_id]).reshape(1,-1)), dim=-1)
    return tokenizer.decode(input_ids.numpy().flatten())

prompt = "I love Bochum because"
# generate(prompt, 90)

# Exercise 2: Stochastic generations and temperature

In this exercise, we will continue with LLM generations. We will try stochastic generations and also fiddle with temperature values.

Remember, in order to ensure reproducibility, we need to set our seeds before we call stochastic operations.

In [None]:
def set_seed(seed):
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
set_seed(1)

## Exercise 2a: Generating stochastically

Fill the `stochastic_generate` function that generates the model output based on the softmax distribution.

Test your function on the prompts above.

In [None]:
def stochastic_generate(prompt: str, gen_len: int, temp: float = 1):
    # Your code here
    raise NotImplementedError

prompt = "Berlin is a city in"
stochastic_generate(prompt, 10, 0.01)

## Exercise 2b: Generating on a real world data with different temperatures

Below we download the BOLD dataset for you. The dataset consists of some incomplete sentences from wikipedia which the LLMs are supposed to finish.

Select 10 prmompts for this data. For each prompt, generate the outputs 5 times.

Repeat the procedure for the following temperatures:
1. T = 0.00001
2. T= 1
3. T =2

What differences do you observe?

In [None]:
! pip install datasets

In [None]:
n_prompts = 10
bold = datasets.load_dataset("AlexaAI/bold")
random.seed(11)
prompts = []
for prompt_list in bold["train"]["prompts"]:
    prompts.extend(prompt_list)
random.shuffle(prompts)
prompts = prompts[:n_prompts]
for prompt in prompts:
    print(prompt)

In [None]:
# Your code here