## Exercise: generating one token at a time

In this exercise, we will get to understand how an LLM generates text-one token at a time, using the previous tokens to predict the following ones.

## Step 1: Load a tokenizer and a model

First we load a tokenizer and a model from HuggingFace's transformers library. A tokenizer is a function that splits a string into a list of numbers that the model can understand.

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# To load a pre-trained model and a tokenizer using HuggingFace, we only need two lines of code!
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

# We create a partial sentence and tokenize it.
text = "Udacity is the best place to learn about generative"
inputs = tokenizer(text, return_tensors="pt")

# Show the tokens as numbers, i.e "input_ids"
inputs["input_ids"]

tensor([[  52,   67, 4355,  318,  262, 1266, 1295,  284, 2193,  546, 1152,  876]])

In [14]:
# Let's experiment
# inputs["input_ids"]: Retrieves the entire tensor of token IDs from the dictionary
# [0]: Accesses the first row of that tensor, which contains the token IDs for the input text

decode_text = tokenizer.decode(inputs["input_ids"][0])
print(decode_text)

Udacity is the best place to learn about generative


## Step 2: Examine the tokenization

Let's explore what these tokens mean!

In [12]:
# Show how the sentence is tokenized
import pandas as pd

def show_tokenization(inputs):
    return pd.DataFrame(
        [(id, tokenizer.decode(id)) for id in inputs["input_ids"][0]],
        columns=["id", "token"],
    )
    
show_tokenization(inputs)

Unnamed: 0,id,token
0,tensor(52),U
1,tensor(67),d
2,tensor(4355),acity
3,tensor(318),is
4,tensor(262),the
5,tensor(1266),best
6,tensor(1295),place
7,tensor(284),to
8,tensor(2193),learn
9,tensor(546),about


### Subword tokenization

The interesting thing is that tokens in this case are neither just letters nor just words. Sometimes shorter words are respresented by a single token, but other times a single token represents a part of a word, or even a single letter. This is calles subword tokenization.

## Step 2: Calculate the probability of the next token

Now let's use PyTorch to calculate the probability of the next token given the previous ones