## Introduction

The purpose of these notebooks is for me to have an understanding of LLMs in order to implement one from stratch and to finetune an existing model.

## What is an LLM

An LLM is a neural network with a specific architecture which allows it to "understand" and generate human language. More precisely it is a model that learns the probability of a sequence of words 

$$P(w_1,w_2, \ldots, w_n)$$

or, more practically

$$P(w_t | w_1,w_2, \ldots, w_n)$$


This lets it: predict the next word in a sentence, fill in missing words .With this raw skill LLMs can later be fine-tuned for more complex tasks such as classification and instruction following. Another characteristic of LLMs is that they are trained with large amounts of data, and tend to have a large number of parameters. 

What allows LLMs to predict the next work accurately is te thype of architecture they use: the transformer architecture. To be precise, the important part of this architecture that makes LLMs particularly useful is the self-attention mechanism. This mechanism allows LLMs to pay attention to specific parts of the input in order to produce an output.

## Building An LLM

The process of building an LLM involves **pretraining** and **fine-tuning**.

1. Pretraining: Pretraining consists of giving the model the ability to predict the next word in a sequence by training the neural network utilizing large amounts of data. The data utilized during pretraining is self-supervised.
2. Fine-tuning: Fine-tuning consists of further training after the pretraining phase. Its purpose is to make the hidden abilities of the model to come out. There are two types of standard fine-tuning: instruction finetuning, where a model is trained on instruction answer pairs, and classification fine-tuning. Notice that The data utilized at this stage has to be labeled.

## Transformer Architecture

Let's understand the original transformer architecture, whose purpose was to perform machine translation.

The transformer architecture consists of two submodules: an **encoder** and a **decoder**

- Encoder: The encoder gets the raw input text and *encodes* it into a fixed set of numerical representations, such as vectors, (this is know as the context) that captures the contextual information of the input.
- Decoder: The decoder gets the input of the Encoder and transforms it into output text. The ouput is generated one token at a time, where each step depends on all previously generated tokens, this is known as autoregressive decoding

In translation the encoder would get the input text to be translated and transform it into vectors. The decoder then would obtain this vectors and output the final translation from these vectors. In a sense the encoder is transforming the text to be translated into its essential semantic form, while the decoder uses this semantic form to instantiate it with the desired language. 

![image.png](attachment:image.png)

### Self Attention ###

A key part of the transformer architecture is the self-attention mechanism, which allows the model to weight the importance of certain words in an input sequence with respect to each other. This allows the model to capture long-range dependencies  and contextual relationships within the input data.

After the introduction of the transformer architecture, variants such as BERT and the various GPT models appeared. While the purpose of GPT models is generation, BERT excels at masked word prediction. 

## BERT ##

BERT is a model whose purpose is to fill in the blank; it consists of an encoder plus other modules. BERT receives a list of tokens with some of them being unknown, then it passes this list to the encoder head, which produces a list of contextual embeddings. We then pass the embeddings of the unknown tokens to a classification head, which transforms the embeddings to the size of the vocabulary and gives you logits, which can then be transformed to probabilities using softmax.

## GPT ##

GPT is a model whose purpose is to predict the next word given a sequence. Unlike BERT, GPT has no encoder head, but a decoder head. Notice that even though GPT has no encoder head, it has an embedding layer, whose purpose is to transform the input sequence into embeddings. This layer does not have a self-attention mechanism in contrast to the encoder layer. GPT will receive a list of tokens, it will embed those tokes, and it will then apply the decoder head to those tokens: it will basically perform autoregressive masked self-attention to those embeddings.

![image.png](attachment:image.png)

## Building an LLM ##