# Transformers

Source: Sansievero-2024

Many trace the most recent wave of advances in generative AI to the introduction of a class of models called transformers in 2017.
Their most well-known applications are large language models (LLMs). In this notebook, we’ll explore the core ideas behind transformers and how they work, with a focus on one of the most common applications: language modeling.

At its core, a language model (LM) is a probabilistic model that learns to predict the next word (or token) in a sequence based on the preceding or surrounding words. Doing so captures the language’s underlying structure and patterns, allowing the model to generate realistic and coherent text. For example, given the sentence “I began my day eating”, an LM might predict the next word as “breakfast” with a high probability.

Transformers are designed to handle long-range dependencies and complex relationships between words efficiently and expressively. For example, imagine that you want to use an LM to summarize a news article, which might contain hundreds or even thousands of words. Traditional LMs, such as RNNs, struggle with long contexts, so the summary might skip critical details from the beginning of the article. Transformer-based LMs, however, show strong results in this task. Besides high-quality generations, transformers have other properties, such as efficient parallelization of training, scalability, and knowledge transfer, making them
popular and well suited for multiple tasks. At the heart of this innovation lies a mechanism called self-attention, which allows the model to weigh the importance of each word in the context of the entire sequence.

## A Language Model in Action
In this section, we will load and interact with an existing small (pretrained) transformer model to get a high-level understanding of how they work. We’ll pick a small model you can run directly in your hardware, but consider that the same principles apply to the larger (over 100 times larger!) and more powerful models that have since been released.

### Tokenizing Text
Let’s begin our journey to generate text based on an initial input. For example, given the phrase "it was a dark and stormy", we want the model to generate words to continue it. Models can’t receive text directly as input; their input must be data represented as numbers. To feed text into a model, we must first find a way to turn sequences into numbers. This process is called tokenization, a crucial step in any NLP pipeline.
An easy option would be to split the text into individual characters and assign each a unique numerical ID. This scheme could be helpful for languages such as Chinese, where each character carries much information. In languages like English, this creates a small token vocabulary, and there will be few unknown tokens (characters not found during training) when running inference. However, this method requires many tokens torepresent a string, which is bad for performance and erases some of the structure and meaning of the text—a downside for accuracy. Each character carries little information, making it hard for the model to learn the underlying structure of the text.

Another approach could be to split the text into individual words. While this lets us capture more meaning per token, it has the downsides that we need to deal with more unknown words (e.g., typos or slang), we need to deal with different forms of the same word (e.g., “run”, “runs”, and “running”), and we might end up with a very large vocabulary, which could easily be over half a million words for languages such as English.
Modern tokenization strategies strike a balance between these two extremes, splitting the text into subwords that capture both the structure and meaning of the text while still being able to handle unknown words and different forms of the same word (Figure 2-3). Characters that are usually found together (like most frequent words) can be assigned a single token that represents the whole word or group. Long or complicated words, or words with many inflections, may be split into multiple tokens, where each one usually represents a meaningful section of the word.
There is no single best tokenizer; each LM comes with its own. The differences between tokenizers reside in the number of tokens supported and the tokenization strategy. For example, the GPT-2 tokenizer averages 1.3 tokens per word.
Let’s find out how the Qwen tokenizer handles a sentence. We’ll first use the transformers library to load the tokenizer corresponding to Qwen. Then we’ll run the input text (also called prompt) through the tokenizer to encode the string into numbers representing the tokens. We’ll use the decode() method to convert each ID back into its corresponding token for demonstration purposes:

In [24]:
from transformers import AutoTokenizer

prompt = "It was a dark and stormy"
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-135M")
input_ids = tokenizer(prompt).input_ids
input_ids

[1589, 436, 253, 3605, 284, 43471]

In [25]:
for t in input_ids:
    print(t, "\t:", tokenizer.decode(t))

1589 	: It
436 	:  was
253 	:  a
3605 	:  dark
284 	:  and
43471 	:  stormy


In [26]:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-135M")

In [27]:
# We tokenize again but specifying the tokenizer that we want it to
# return a PyTorch tensor, which is what the model expects,# rather than a list of integers
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
outputs = model(input_ids)
outputs.logits.shape # An output for each input token

torch.Size([1, 6, 49152])

- The first dimension of the output is the number of batches (1 because we just ran a single sequence through the model).
- The second dimension is the sequence length, or the number of tokens in the input sequence (6 in this case).
- The third dimension is the vocabulary size. We get a list of ~50,000 numbers for each token in the original sequence.
These are the raw model outputs, or logits, that correspond to the tokens in the vocabulary. For every input token, the model predicts how likely each token in the vocabulary is to continue the sequence up to that point. With our example sentence, the model will predict logits for “It”, “It was”, “It was a”, and so on. Higher logit values mean the model considers the corresponding token a more likely continuation of the sequence. Table 2-1 shows the input sequences, the most likely token ID, and its corresponding token.

We get for each token a series of ~50000 logits (for each token in the tokenizer's vocabulary). Here we retrieve the series for token 2 int he prompt

In [28]:
p2_logits = model(input_ids).logits[0, 0]
p2_logits

tensor([14.2227,  2.9400,  2.8643,  ...,  9.5122,  9.8202,  8.5702],
       grad_fn=<SelectBackward0>)

In [29]:
p2_logits.argmax()

tensor(314)

In [30]:
tokenizer.decode(p2_logits.argmax())

' is'

In [32]:
# Last series of logits (i.e. after the last token in the prompt)
p7_logits = model(input_ids).logits[0, -1]
p7_logits.argmax()

tensor(3163)

In [33]:
# Token corresponding to id 3163
tokenizer.decode(p7_logits.argmax())

' night'

In [36]:
# Other potential candidates
import torch

top10_logits = torch.topk(p7_logits, 10)
for index in top10_logits.indices:
    print(tokenizer.decode(index))

 night
 day
 time
 evening
 winter
 sea
 morning
 month
 summer
 afternoon
