
## ðŸ§© Introduction: Converting Text to Tokens (and Back) for GPT Training

Before training or experimenting with a GPTâ€‘style language model, itâ€™s essential to understand how text is represented internally.  
Transformer models like **GPTâ€‘2** do not operate directly on raw textâ€”they work with **token IDs**, numerical representations produced by a tokenizer.

This first section introduces a minimal and practical workflow for:

1. **Encoding text into token IDs**  
2. **Passing token IDs through a GPT model**  
3. **Decoding generated token IDs back into readable text**

We use the `tiktoken` library (the same tokenizer family used by OpenAI models) together with a lightweight GPTâ€‘2 implementation. This lets us test the full roundâ€‘trip:  
**text â†’ tokens â†’ model â†’ tokens â†’ text**.

### Helper Functions

The notebook defines two small utility functions:

- **`text_to_token_ids(text, tokenizer)`**  
  Converts text into a PyTorch tensor of token IDs and adds a batch dimension.

- **`token_ids_to_text(token_ids, tokenizer)`**  
  Converts modelâ€‘generated token IDs back into a humanâ€‘readable string.

### Model and Tokenizer Setup

We instantiate:

- A GPTâ€‘2 tokenizer (`gpt2` vocabulary)
- A compact GPTâ€‘2â€‘style model using the configuration `GPT_CONFIG_124M`
- A simple greedy generation function `generate_text_simple`

### Running a Minimal Generation

Finally, we feed an input prompt through the encoder â†’ model â†’ decoder pipeline and print the generated continuation.  
This establishes a clear, minimal foundation for understanding **how data flows through a GPT model**, which is crucial before implementing training loops, loss calculation, and optimization.

---


### ðŸ“š Inspiration & Citation

This exercise is inspired by the following work. If you use this notebook or its accompanying code, please cite it accordingly:

```yaml
cff-version: 1.2.0
message: "If you use this book or its accompanying code, please cite it as follows."
title: "Build A Large Language Model (From Scratch), Published by Manning, ISBN 978-1633437166"
abstract: "This book provides a comprehensive, step-by-step guide to implementing a ChatGPT-like large language model from scratch in PyTorch."
date-released: 2024-09-12
authors:
  - family-names: "Raschka"
    given-names: "Sebastian"
license: "Apache-2.0"
url: "https://www.manning.com/books/build-a-large-language-model-from-scratch"
repository-code: "https://github.com/rasbt/LLMs-from-scratch"
keywords:
  - large language models
  - natural language processing
  - artificial intelligence
  - PyTorch
  - machine learning
  - deep learning




In [13]:
import torch
import tiktoken
from gpt2 import GPTModel, generate_text_simple,GPT_CONFIG_124M

def text_to_token_ids(text,tokenizer):
    encoded = tokenizer.encode(text,allowed_special={'<|endoftext|>'})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0)
    return encoded_tensor

def token_ids_to_text(token_ids,tokenizer):
    flat = token_ids.squeeze(0)
    return tokenizer.decode(flat.tolist())

start_context = "Let me see if I can make it"
gpt_tokenizer = tiktoken.get_encoding("gpt2")
model = GPTModel(GPT_CONFIG_124M)
model.eval()

token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(start_context,gpt_tokenizer),
    max_new_tokens=10,
    context_size=GPT_CONFIG_124M["context_length"]
)

print ("Output: ", token_ids_to_text(token_ids,gpt_tokenizer))


Output:  Let me see if I can make it Fuk brewer recycle exerted Anchorage baff Herogle respondents breakdown


***Incoherent text is gereated*** as result of not trained model. Let's now focus on a simple technique to train the model!

### Instruct the model to target the next token

As GPT are unsupervised learner we now use a couple of sample to explain how, based on tuples of samples <--> targets, processing a batch can be generalized for unsupervised training of a GPT.

In [38]:
# Define here two input text an relative Embeddings
text1 = "every effort moves"
text2 = "I really like"

inputs = torch.cat(
    (text_to_token_ids(text1,gpt_tokenizer),
     text_to_token_ids(text2,gpt_tokenizer)),
    dim=0
)

print ("Input batches: ",inputs)

# Define the target string we like to measure the distance in probability
# againsta the actual model
target1 = " effort moves you"
target2 = " really like chocolate"

targets = torch.cat(
    (text_to_token_ids(target1,gpt_tokenizer),
     text_to_token_ids(target2,gpt_tokenizer)),
    dim=0
) 

print ("To target: ", targets)

Input batches:  tensor([[16833,  3626,  6100],
        [   40,  1107,   588]])
To target:  tensor([[ 3626,  6100,   345],
        [ 1107,   588, 11311]])


We encoded the two input and target text to underline how important is to target the shift of the input (space befor token includedd) as this is our "natural" label of the prediction to feed the model with.

Now let's take out the probability in the next section:

In [48]:

with torch.no_grad():
    logits = model(inputs)
    
probas = torch.softmax(logits,dim=-1)
print("Probas shape: ", probas.shape)

token_ids = torch.argmax(probas,dim=-1,keepdim=True)
print("Token Ids: ", token_ids)

print("Target batch1: ", token_ids_to_text(token_ids=targets[0],tokenizer=gpt_tokenizer))
print("Model output batch1: ", token_ids_to_text(token_ids=token_ids[0].flatten(),tokenizer=gpt_tokenizer))

Probas shape:  torch.Size([2, 3, 50257])
Token Ids:  tensor([[[19127],
         [ 1790],
         [18350]],

        [[45721],
         [32673],
         [ 2132]]])
Target batch1:   effort moves you
Model output batch1:   rack short Fa


running the above code makes clear that the model is producing random text because ***it's not trained yet***. Here comese to help the loss evaluation that's not only helpful to evalutate the quality of the produced text but also can be stacked information for training.

### Softmax probability evaluation

It's now time to look at the initial probability scores of the target tokens generated by the model printed below as at each specifc index of the two targets there is the probability the model need to increase compared to each other element in the probas distribution.


In [55]:

text_idx = 0
target_probas_1 = probas[text_idx,[0,1,2],targets[text_idx]]
print("Text 1: ", target_probas_1)


text_idx = 1
target_probas_2 = probas[text_idx,[0,1,2],targets[text_idx]]
print("Text 2: ", target_probas_2)

Text 1:  tensor([1.3901e-05, 1.2262e-05, 1.8820e-05])
Text 2:  tensor([1.4962e-05, 7.2043e-06, 1.0832e-05])


to evaluate the probability ... to improve the model in order to reach as close as possible an negative log probability of 0. 

> The sequential computation of:
> - logits
> - probabilities
> - target probabilities
> - log probabilities
> - average log probability
> - negative everage log probability

***is known as cross entropy loss calculation***

Below the cross entropy (measure of the difference from target to predicted) and the Perplexity that measure the exact size of the incertainty size of the predicted word. In the next result think about Perplexity as the number of words in the vocabulary set the model is likely to pick from for the prediction.

In [65]:
log_probas = torch.log(torch.cat((target_probas_1,target_probas_2)))
print("Logaritmic probability: ",log_probas)

avg_log_probas = torch.mean(log_probas) * -1
print("Negative avg loss prob: ",avg_log_probas)

# flatting from three to two dimension 
logits_flat = logits.flatten(0,1)
# flatting from two to one dimension
targets_flat = targets.flatten()

print("\n\nFlatten logits: ", logits_flat.shape)
print("Flatten targets:", targets_flat.shape)

loss = torch.nn.functional.cross_entropy(logits_flat,targets_flat)
print ("Cross entropy loss calculation", loss)
print ("\nPerplexity: ", torch.exp(loss))


Logaritmic probability:  tensor([-11.1836, -11.3090, -10.8806, -11.1100, -11.8408, -11.4330])
Negative avg loss prob:  tensor(11.2928)


Flatten logits:  torch.Size([6, 50257])
Flatten targets: torch.Size([6])
Cross entropy loss calculation tensor(11.2928)

Perplexity:  tensor(80243.4922)
