# How Transformer LLMs Work

Instructors: Jay Alammar, Maarten Grootendorst

Co-authors of ["Hands-On Large Language Models"](https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/)

Official code repo [here](https://github.com/HandsOnLLM/Hands-On-Large-Language-Models)

## Introduction

The transformer architecture was first introduced in the 2017 paper ["Attention is all you need"](https://arxiv.org/abs/1706.03762) for machine translation tasks. The idea was like to input an English sentence and have the network output a German sentence. The same architecture tends to be great at inputting a **prompt** and outputting a **response** to that prompt, like a a question and the answer to that question. So it started the rise of **large language models**.

The original transformer architecture consists in two parts an encoder and a decoder. Consider the translation example:

- the encoder preprocesses the input text to extract the context needed to perform the translation
- the decoder uses that context to generate the translated sentence

In language models, the encoder provides rich context-sensitive representations of the input text, and is the basis for the Bert model and most of the embedding models using RAG applications. The decoder performs text generation tasks such as summarizing text, writing code, answering questions and is the basis for most popular LLMs, such as those from OpenAI and Meta.

A generative model takes in a text prompt and generates a text in response by generating a token at a time. Tokens are words of words' fragments that can be fed into the LLM.

Here is how the generation process works:

- The model starts by mapping each input token into an embedding vector that captures the meaning of that token.

- After that, the model parses those token embeddings through a stack of transformer blocks, where each block is a specific NN designed to learn flexibly from the data and scale well on GPUs. 
    - Each block is made up of an **attention layer** and a feed-forward network.
    
- The model then uses the output vectors of the transformer blocks and passes them to the language modelling head, which generates the output token.

<p align="center" width="100%">
    <img src="images/intro.png" width="400px" />
</p>

The *magic* of LLMs comes not just from the architecture but also from the incredibly rich data that the models learnt from.

## Understanding Language Models: Language as a Bag-of-Words

Language is a tricky concept for computers since text is unstructured in nature and loses its meaning when represented by zeros and ones or individual characters.

Early techniques of language representation were non-transformer, such as **Bag-of-Words** or embeddings.

<img src="images/historic.png" width="400px" />

Bag-of-words consists in representing text by dividing it in tokens and creating a "vocabulary" as a collection of distinct tokens. Then the input text can be represented as a vector with the same length of the vocabulary in which each element represents the frequency of that token in the input.

<img src="images/bow.png" width="400px" />

Consider that also the tokens that do not appear in the input are represented (as zeros), because a sentence not only gives meaning to the words it contains, but also words it does not.

Notice that the B-o-W representation ignores the semantic nature and meaning of text.

## Understanding Language Models: (Word) Embeddings

Released in 2013, **Word2Vec** was one of the first successful attempts at capturing the meaning of text in embeddings.

It learned the semantic representation of words by training on a vast amounts of textual data, like the entire Wikipedia.

Word2vec is a framework aimed at learning word embeddings (vector of values) by estimating the likelihood that a given word is surrounded by other words.

Starting from a random vector for every token (the embedding), the NN tries to estimate the neighboring tokens and by doing so it "learns" the embedding, which captures the meaning of the word. The number of properties or values an embedding has is called the number of dimensions, which can be quite large. In practice, you do not know what the properties exactly represents. However, words with similar meaning share similar embeddings, dimension by dimension.

<img src="images/word_embeddings.png" width="400px" />

Models like Word2Vec that convert textual input to embeddings as called "representation models" as they attempt to represent text as values.

Similar techniques can be used for entire sentences, to create sentence embeddings, and the same for longer texts such as documents to create document embedding.

<img src="images/types_of_embeddings.png" width="400px" />

In the example above the word "vocalization" is split into two tokens, then, after passing through the representation model one can average the two embeddings  to get the meaning of the original word.

## Understanding Language Models: Encoding and Decoding Context with Attention

The flaw of Word2Vec is that it creates static embeddings: the same embedding is generated for the same word regardless of the context.

An improvement was to use **Recurrent Neural Networks (RNNs)** to model entire sequence of words. They are used for **encoding**, that is representing an input sentence in one embedding, and **decoding**, that is to generate the output sentence.

Each word of the output is generated in an autoregressive way since it uses all input and the previously generated output words.

The input words are first tokenized using for example the Word2Vec representation, they are fed simultaneously into the encoder to generate the context in the form of an embedding. The decoder uses this context embedding to generate the outputs (one at a time using the previously generated words).

Having a single context embedding was problematic to capture the entire context of a long sentence. In 2014 a solution called **"attention"** was introduced. It allows a model to focus on parts of the input that are relevant to one another. It selects which words are important in a given sentence to generate an output.

Instead of passing only a context embedding to the decoder we pass the hidden states of all input words. The decoder then uses the attention mechanism to look at the entire sequence but highlighting words that are more relevant.

<img src="images/attention_model.png" width="500px" />

The problem with the sequential nature of this architecture is that it precludes parallelization during the training of the model.

## Understanding Language Models: Transformers

The true power of attention was first explored in the ["Attention is all you need"](https://arxiv.org/abs/1706.03762) 2017 paper. This paper introduced the **transformers** architecture which is based solely on attention without the RNN. This architecture allows the model to be trained in parallel which speeds up calculation significantly.

The transformer consists in stacked encoder and decoders blocks.

In the encoder the input is converted to embeddings (starting with random values). Then **self-attention**, which is attention focused only on the input, processes these embeddings and updates them. The updated embeddings contain more contextualized information as a result of the attention mechanism. They are then passed to a feed-forward NN to finally create contextualized token word embeddings.

<img src="images/encoder.png" width="500px" />

Notice that the encoder is meant for representing text and does a good job in generating embeddings.

After the encoder has processed the information the decoder can now take the previously generated words and pass them to the **masked self-attention** (similar to the encoding part). These intermediate embedding are now generated and passed to another attention network together with the embeddings of the encoder, thus processing both what has been generated and what you already have. This output is again passed to a feed-forward NN and finally generates the next word in the sequence.

<img src="images/decoder.png" width="500px" />

Mask self-attention masks future positions so that any given token can only attend to tokens that came before it.

In 2018, a new architecture called **Bidirectional Encoder Representations from Transformers (BERT)** was introduced. It has more application beside translation than the first transformer.

Bert is an encoder only architecture that focuses on representing language and generating contextualized word embeddings.

The encoder blocks are the same as before (self-attention + NN) but the input contains an additional "classification" (CLS) token which is used as a representation for the entire input. This CLS token is often used for fine-tuning the model on specific tasks like classification.

<img src="images/bert.png" width="600px" />

To train a model like BERT you use a technique called masked language modelling in which you first randomly mask a number of words from the input sequence and have the model predict these masked words. By doing so the model learns to represent language as it attempts to deconstruct these masked words.

<img src="images/bert_train.png" width="500px" />

Training is typically a two-step approach:

- First you apply masked  language modelling on a very large amount of data, and this is called "pre-training".

- After which you can fine-tune the pre-trained model on different downstream tasks such as classification, name entity recognition, paraphrase identification.

From the other side, **generative models** such as **Generative Pre-Trained transformer (GPT-1)** only use a decoder with masked self-attention and a NN to generate the next word.

<img src="images/GPT1.png" width="400px" />

Models can have a maximum "context length", which is the maximum number of tokens they can process at a given time. It is made by both the input tokens and the previously generated tokens.

GPT-1 had 117 million parameters, GPT-3 already reached 175 billions parameters.

The year 2024 has been the year of generative AI (so far), with many proprietary and open-source models:

<img src="images/2024.png" width="400px" />

## Architectural Overview

One important intuition to understand how transformers work is to understand that it generates tokens one by one. Moreover, every generated token depends also on the previously generated ones.

The transformer is made of three major components:

- Tokenizer: the component that breaks down the text into multiple chunks.
- Stack of Transformer Blocks: the NN where the computing is.
    - The model has an embedding associated to every token which substitute the token in the beginning once the model is processing its inputs
- Language Modelling Head: another NN that gives a probability distribution on all the tokens. From this token probability there are many *decoding strategies*:
    - One can then choose the token with the highest probability (greedy decoding, `temperature = 0`)
    - Alternatively one can add randomness among the tokens with the highest probability (`temperature > 0`)

Another intuition that makes transformers better than RNN is that they process all of their input tokens in parallel, and that parallelization is time efficient.

The generated token in decoder LLM transformers is the output of the final token in the model. Once the first token is generated we feed the entire prompt with the generated token in the transformer again. However, this time the calculation is cached to speed up the computation (KV caching).

<img src="images/archi.png" width="600px" />

## The Transformer Block

Each transformer block is made of two major components:

- Feed-Forward Neural Network (latter)
    - Intuitively it is a storage on information and statistics and it is able to estimate the most probable token after the ones already generated.

<img src="images/transformer_block.jpg" width="600px" />

- Self-Attention Layer (former)
    - Allows the model to attend to previous tokens and to incorporate the context in its understanding of the token it is currently looking at.
        - It first assigns a score to how relevant each of the input tokens are to the token currently processed ("**relevance scoring**").
        - It then combines the relevant information into the representation.

<img src="images/transformer_block_2.jpg" width="600px" />

## Self-Attention

Self-Attention happens in what it is called an *attention head*, and it is conducted using three matrices called Query, Key and Value Projection matrices, made by parameters to be estimated and the tokens (both previous and current). Intuitively each rows of these matrices is a vector representing a token of the series.

The end goal of **relevance scoring**  is to assign a score to every previous token to determine which are the most relevant for the token currently being processed. This is done combining the row of the current token in the Query matrix and the whole Key matrix. The scoring is then combined with the Value Matrix to get weighted values for each of the previous tokens. We then sum the weighted values and get the output of the *attention head*.

<img src="images/self_attention.png" width="600px" />

For details on the calculation see the course [Attention in Transformers: Concepts and Code in PyTorch](https://learn.deeplearning.ai/courses/attention-in-transformers-concepts-and-code-in-pytorch/lesson/han2t/introduction?courseName=attention-in-transformers-concepts-and-code-in-pytorch).

In self-attention the calculation happens in parallel in multiple attention heads, where each attention head has its own set of keys, queries and values weight matrices. Then the information is combined across all heads.

<img src="images/self_attention_base.jpg" width="400px" />

More efficient strategies are to use the same keys and values so that only the queries are specific for every head, with a lower number of parameters to be estimated. This is approach is called *Multi-Query attention*.

More recently, *grouped query attention* is an efficient attention mechanism that allows to use multiple keys and values for different groups of attention heads, leading to better results than just sharing the same matrices of keys and values across all heads.

<img src="images/grouped_attention.jpg" width="500px" />


So that if a model uses *Multi-Query attention* it will state how many heads and how many groups it is using.

Another important recent idea for improving the efficiency of attention is the idea of *sparse attention*, so that a token only looks back at a subsample of the previous ones. Local attention boosts performance by only paying attention to a small number of previous positions. See the paper [Generating Long Sequences with Sparse Transformers](https://arxiv.org/abs/1904.10509)

More recently, and to allow these models to go through 100,000 or 1 million token context sizes are ideas like [*ring attention*](https://coconut-mode.com/posts/ring-attention/).

## Recent Improvements

In the original paper from 2017 author used **positional encoding** which is a technique used in transformers to apply positional information since tokens are processed in parallel but the order of the words in the sequence matters. It is applied right before entering the transformer block.

In a more modern version (2024) positional encoding is no longer used. The idea has been substituted by **rotary embeddings** and positional information is now added at the self-attention level. Moreover, layer normalization has been moved before the self-attention and feed-forward NN because some experimental results showed that the models do better with this setup.

Both versions use **residual connections** that go around the layers to repack the information from the beginning of the layer.

<img src="images/modern_transformer.jpg" width="800px" />

## Mixture of Experts (MoE)

One more recent development in LLMs is the idea of Mixture of Experts (MoE). The idea is that instead of the single feed-forward NN, at each layer you have multiple sub-neural networks, each one called an *expert*. And then a *router* in each layer which decides which expert should process the token. The *router* works as a classification score and chooses the best *expert* in each layer. It is a small feed-forward NN itself, with the goal to create a probability score to indicate how likely is that the expert is suited for that particular input. It can choose the best expert or mixing many experts and then aggregate their output using a weighted mean

Experts might tend to focus on specific kind of tokens such as punctuation (, . : & ?, etc...), verbs, conjunctions (the, and, if, not, etc...), visual descriptions (dark, outer, yellow, etc...) and focus on how to process them.

<img src="images/moe.jpg" width="700px" />

The traditional feed-forward NN is called a *dense network*, being the largest component of the LLM, since all parameters are activated and used (at least to some degree) to find complex relationships in the information processed by the attention mechanism.

On the contrary, with MoE there are many networks. When the input flows through the expert layer, one or more experts is selected that will process the inputs, and the rest of the experts will remain inactivated. This is called a *sparse model* since only a subset of experts is activated at a given time.

<img src="images/moe_detail.jpg" width="700px" />

A major benefit of MoE is its computational requirement. From one side you have more parameters to load (one set of parameter for each expert), but only the parameters of the active experts are used in inference. For example, in the model *Mixtral 8x7B* there are 8 experts with 5,6B parameters each, for a total of 45B parameters among the experts. However, when the model runs only 2 experts are used together, resulting in using "only" 11B parameters.

<img src="images/mixtral.jpg" width="700px" />

Although the sparse parameters might seem large, during inference the model actually uses much fewer resources. This makes MoE models excellent when you run them in production. Moreover, performance tends to be higher than traditional models, however there is the risk of overfitting on a single expert and should be trained carefully.