# Exercise: Generating one token at a time

In this exercise, we will get to understand how an LLM generates text--one token at a time, using the previous tokens to predict the following ones.


## Step 1. Load a tokenizer and a model

First we load a tokenizer and a model from HuggingFace's transformers library. A tokenizer is a function that splits a string into a list of numbers that the model can understand.

In this exercise, all the code will be written for you. All you need to do is follow along!

In the context of NLP and Hugging Face's `transformers` library, **tokens** are the basic units into which a piece of text is split before being fed into a model. The model processes these tokens as numerical representations.

Here’s how **tokens** are defined and used in these cases:

---

### **1. What are tokens?**
- Tokens are **smaller units of text**, such as:
  - Words: "Udacity", "is", "the", etc.
  - Subwords: Complex words like "generative" may be split into "gener", "ative".
  - Special characters: Punctuation marks like `.` or `,` are also treated as tokens.
  - Spaces: Some tokenizers may treat spaces as tokens.
  
These tokens are mapped to **unique numerical IDs** using a tokenizer's vocabulary.

---

### **2. How does tokenization work in GPT-2?**
GPT-2 uses a **Byte Pair Encoding (BPE)** tokenizer, which splits text into subwords to balance vocabulary size and coverage. Here's what happens during tokenization:

- **Word-level tokens**: Simple words like "Udacity" or "is" are tokens.
- **Subword tokens**: For words not in the vocabulary (e.g., "generative"), it may split into smaller subunits, such as "gener" and "ative".
- **Character-level tokens**: Rare or unusual characters may be treated as standalone tokens.
- **Special tokens**: Reserved tokens such as `<|endoftext|>` represent special functions like indicating the end of a sentence.

For example:
- Input: `"Udacity is the best place to learn about generative"`
- Tokenized:
  - `"Udacity"` → Token ID 7597
  - `"is"` → Token ID 318
  - `"the"` → Token ID 262
  - `"best"` → Token ID 674
  - `"place"` → Token ID 1476
  - `"to"` → Token ID 284
  - `"learn"` → Token ID 1005
  - `"about"` → Token ID 833
  - `"generative"` → Tokenized into subwords, e.g., `"gener"` → 12345, `"ative"` → 67890.

---

### **3. Why are tokens important?**
Tokens are critical for:
1. **Model Understanding**: Neural networks process numbers, not text. Token IDs convert text into a format the model understands.
2. **Efficiency**: Subword tokenization (like in GPT-2) ensures that:
   - Rare words are split into reusable subunits.
   - The model doesn’t need to store an enormous vocabulary.
3. **Flexibility**: It allows handling of unseen words or complex languages.

---

### **4. What do tokens look like in this code?**
When you inspect `inputs["input_ids"]`, you see a tensor of numerical IDs, representing the sequence of tokens. For example:
```python
inputs["input_ids"] = tensor([[7597, 318, 262, 674, 1476, 284, 1005, 833, 12345, 67890]])
```
This is the tokenized representation of `"Udacity is the best place to learn about generative"`.

---

### **5. Key Advantages of Tokenization**
- **Compact Vocabulary**: Subword tokenization minimizes the size of the vocabulary while still covering diverse text.
- **Handling Rare Words**: Instead of treating "generative" as one token, splitting it into "gener" and "ative" allows the model to handle similar words (e.g., "generation") efficiently.
- **Preprocessing for Models**: Converts human-readable text into numerical input.

You can explore the tokens more explicitly using:
```python
tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
```
This will give you the actual tokens corresponding to the numerical IDs.

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# To load a pretrained model and a tokenizer using HuggingFace, we only need two lines of code!
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

# We create a partial sentence and tokenize it.
text = "Udacity is the best place to learn about generative"
inputs = tokenizer(text, return_tensors="pt")

# Show the tokens as numbers, i.e. "input_ids"
inputs["input_ids"]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tensor([[  52,   67, 4355,  318,  262, 1266, 1295,  284, 2193,  546, 1152,  876]])

## Step 2. Examine the tokenization

Let's explore what these tokens mean!

In [9]:
# Show how the sentence is tokenized
import pandas as pd


def show_tokenization(inputs):
    return pd.DataFrame(
        [(id, tokenizer.decode(id)) for id in inputs["input_ids"][0]],
        columns=["id", "token"],
    )


show_tokenization(inputs)

Unnamed: 0,id,token
0,tensor(52),U
1,tensor(67),d
2,tensor(4355),acity
3,tensor(318),is
4,tensor(262),the
5,tensor(1266),best
6,tensor(1295),place
7,tensor(284),to
8,tensor(2193),learn
9,tensor(546),about


In [10]:
inputs

{'input_ids': tensor([[  52,   67, 4355,  318,  262, 1266, 1295,  284, 2193,  546, 1152,  876]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [11]:
[(id, tokenizer.decode(id)) for id in inputs["input_ids"][0]]

[(tensor(52), 'U'),
 (tensor(67), 'd'),
 (tensor(4355), 'acity'),
 (tensor(318), ' is'),
 (tensor(262), ' the'),
 (tensor(1266), ' best'),
 (tensor(1295), ' place'),
 (tensor(284), ' to'),
 (tensor(2193), ' learn'),
 (tensor(546), ' about'),
 (tensor(1152), ' gener'),
 (tensor(876), 'ative')]

### Subword tokenization

The interesting thing is that tokens in this case are neither just letters nor just words. Sometimes shorter words are represented by a single token, but other times a single token represents a part of a word, or even a single letter. This is called subword tokenization.

## Step 2. Calculate the probability of the next token

Now let's use PyTorch to calculate the probability of the next token given the previous ones.

In [24]:
# Calculate the probabilities for the next token for all possible choices. We show the
# top 5 choices and the corresponding words or subwords for these tokens.

import torch

with torch.no_grad():
    logits = model(**inputs).logits[:, -1, :]
    probabilities = torch.nn.functional.softmax(logits[0], dim=-1)


def show_next_token_choices(probabilities, top_n=5):
    return pd.DataFrame(
        [
            (id, tokenizer.decode(id), p.item())
            for id, p in enumerate(probabilities)
            if p.item()
        ],
        columns=["id", "token", "p"],
    ).sort_values("p", ascending=False)[:top_n]


show_next_token_choices(probabilities)

Unnamed: 0,id,token,p
13,13,.,0.352222
11,11,",",0.135989
290,290,and,0.109372
287,287,in,0.06953
8950,8950,languages,0.058291


In [28]:
import torch

with torch.no_grad():
    logits = model(**inputs)
    probabilities = torch.nn.functional.softmax(logits[0], dim=-1)


In [30]:
logits.logits[:,-1,90]

tensor([-111.5971])

Interesting! The model thinks that the most likely next word is "programming", followed up closely by "learning".

In [31]:
# Obtain the token id for the most probable next token
next_token_id = torch.argmax(probabilities).item()

print(f"Next token id: {next_token_id}")
print(f"Next token: {tokenizer.decode(next_token_id)}")

Next token id: 301826
Next token: 


In [32]:
# We append the most likely token to the text.
text = text + tokenizer.decode(8300)
text

'Udacity is the best place to learn about generative programming. programming'

## Step 3. Generate some more tokens

The following cell will take `text`, show the most probable tokens to follow, and append the most likely token to text. Run the cell over and over to see it in action!

In [33]:
# Press ctrl + enter to run this cell again and again to see how the text is generated.

from IPython.display import Markdown, display

# Show the text
print(text)

# Convert to tokens
inputs = tokenizer(text, return_tensors="pt")

# Calculate the probabilities for the next token and show the top 5 choices
with torch.no_grad():
    logits = model(**inputs).logits[:, -1, :]
    probabilities = torch.nn.functional.softmax(logits[0], dim=-1)

display(Markdown("**Next token probabilities:**"))
display(show_next_token_choices(probabilities))

# Choose the most likely token id and add it to the text
next_token_id = torch.argmax(probabilities).item()
text = text + tokenizer.decode(next_token_id)

Udacity is the best place to learn about generative programming. programming


**Next token probabilities:**

Unnamed: 0,id,token,p
318,318,is,0.125294
8950,8950,languages,0.078196
13,13,.,0.062432
287,287,in,0.051712
351,351,with,0.032723


## Step 4. Use the `generate` method

In [35]:
from IPython.display import Markdown, display

# Start with some text and tokenize it
text = "Once upon a time, generative models"
inputs = tokenizer(text, return_tensors="pt")

# Use the `generate` method to generate lots of text
output = model.generate(**inputs, max_length=100, pad_token_id=tokenizer.eos_token_id)

# Show the generated text
display(Markdown(tokenizer.decode(output[0])))

Once upon a time, generative models of the human brain were used to study the neural correlates of cognitive function. In the present study, we used a novel model of the human brain to investigate the neural correlates of cognitive function. We used a novel model of the human brain to investigate the neural correlates of cognitive function. We used a novel model of the human brain to investigate the neural correlates of cognitive function. We used a novel model of the human brain to investigate the neural correlates of cognitive function.

### That's interesting...

You'll notice that GPT-2 is not nearly as sophisticated as later models like GPT-4, which you may have experience using. It often repeats itself and doesn't always make much sense. But it's still pretty impressive that it can generate text that looks like English.

## Congrats for completing the exercise! 🎉

Give yourself a hand. And please take a break if you need to. We'll be here when you're refreshed and ready to learn more!