<a href="https://colab.research.google.com/github/datacraft-paris/2311-Cerisara-LLM/blob/main/LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Table of contents
1. [Introduction](https://www.google.com/)
2. [A Brief Overview of LLMs](#paragraph1) (This notebook)
    1. [Background: _Decoder-only_ Language Model](#language_models_are_conditional_probabilities")
    2. [Transformers](#transformers)
    3. [Large Language Model](#llms)
    4. [Exercices](#exercices)


# Background: _Decoder-only_ Language Model (LM) <a name="language_models_are_conditional_probabilities"></a>



Traditionally, a language model (LM) is just a function that assigns a probability to a given sequence of words (it can also be letters, tokens, etc). For a sequence of words $\textbf{s}$ and a trained language model $P_{LM}$, the language model can assign a probability to this sequence:

> $$P(\textbf{s}) = \prod_{i=1}^{|\textbf{s}|} P_{LM}({s_{i}}|s_{<i})$$
where $s_{<i}$ stands for all the previous words.

for example, for the sentence "He runs fast":

> $P$(He runs fast) = $P_{LM}$("He") $\times$ $P_{LM}$("runs" | "He") $\times$ $P_{LM}$("fast" | "he runs")


Indeed, we can see this as **the probability of predicting the next word** at each timestamp in the sentence.

In the case of neural language model, words are represented as vectors. We can illustrate the language modelling task as:

<p align="right">
  <img src="https://github.com/datacraft-paris/2311-Cerisara-LLM/blob/main/illustrations/lm_head.png?raw=true:, width=400" alt="transformer" width=800 class="right">
</p>

Each word vector in the sentence goes through the Linear layer that predicts the score for the possible following words. We normalize then these scores in order to get probabilities.

As you can see in the figure, the model don't always give higher score to the right tokens. The learning procedure consists of ensuring that the model assigns a higher probability to the next true tokens.

Also, as you can see in the figure, if we wan to predict right tokens, we need to find a way of getting meaningfull word vectors. One model than can give us meaningfull word vectors is the Transformer model.



# Transformers <a name="transformers"></a>

The transformers architecture was first developed for the machine translation task. It was designed to model the long-range dependenies between words in a sentence, which is essential for machine translation task.

The transformer has two core components: the Attention module and the MLP module. We will not define the attention mechanism in details here, nor the transformer architecture itself. What interest us is the parameters prensent in the Transformer.

<p align="center">
  <img src="https://github.com/datacraft-paris/2311-Cerisara-LLM/blob/main/illustrations/transformer_parameters.png?raw=true:, width=300" alt="transformer" width=600 class="center">
</p>


The transformer takes as input a sequence of vectors $E$, and also outputs a sequence of vectors $C$ of the same length. So what is the difference between the two sequences ?

The difference is that:
1. $E$, the first sequence of vectors, consists of vectors that are independent of each other.
2. $C$, the vectors outputed by the Transformer, are contextualized, meaning each vector incorporates information about all other tokens in the sequence.

So, the Transformer litterally _transforms_ the input sequence, and this transformation is a contextualization. These contextual vectors are then passed to the LM Head.




# LLM = LM trained with *HUGE* Transformers on *HUGE* text corpora. <a name="llms"></a>

The transformer takes as inputs a $X$ sequence of vectors and output $C$ also a sequence of vectors. As such, this is not a language model, but just a neural network that can be used on any task / domain (computer vision, speech processing, etc). To perform language modeling, we add a matrix $W \in \mathbb{R}^{d * |V|}$ on top of the transformer's output. This matrix contains one vector for each word (or token) seen in the training set. Given an output vector $\textbf{c}$ of the transformer, the probability of a given word $v$ in the vocabulary $V$ is computed as:

$$
P_{LM}(y = v|\textbf{c}) = \frac{W_{v} \cdot c}{\sum\limits_{v' \in V}W_{v'} \cdot c}
$$

For example, let's say we have the input sentence "He runs fast" that got forward into the transformer, giving the output $C=[\textbf{c}_{\small He}, \textbf{c}_{\small runs}, \textbf{c}_{\small fast}]$. The probability that "fast" follows the word "runs" is computed as:

$$
P_{LM}(fast|\textbf{c}_{\small runs}; W) = \frac{W_{fast} \cdot \textbf{c}_{\small runs}}{\sum\limits_{v' \in V}W_{v'} \cdot \textbf{c}_{\small runs}}
$$

This is the language modelling objective applied to a tranformer. Usually, the transformer language model is trained on a huge quantity of text corpora $D$ chunked into long sequences $s$ (for example Llama is trained on sequences of 4096 tokens).

Training the transformer language model is achieved using the maximum likelihood of training corpora $D$:

>$$
\underset{\theta}{\mathrm{argmax}} \prod_{s \in D} \prod_{i=1}^{|\textbf{s}|} P_{LM}({s_{i}}|s_{<i}; \theta)
$$
where $\theta$ is the parameter of the LLM: the matrix $W$ and all the trainable matrices in the transformer model.

When the training corpora is enough huge (>1T tokens) and the transformer contains enough parameters (>7B parameters), the language model learns to perform well on different language tasks (translation, classification, summarization, coding, etc.)

# Exercices <a name="exercices"></a>

In [None]:
from transformers import AutoModelForCausalLM

## Dissecting an LLM

1. Download the StableLM 3B and print the model.

2. Identify the parameters of the attention module. What are their dimensions ?

3. Identify the paramaters of the MLP module. What are their dimensions ?

4. Identify the language modelling matrix (LM Head), what is its dimension ?

5. Get the tokenizer associated with the LLM. Tokenize the sentence "He runs" and forward it into the LLM in order to get the last hidden states. What is the dimension ?

# The numer of parameters of the LLM

1. What is exactly the number of paramaters of the model ?

In [None]:
# use the methode .num_parameters() of the model

2. How many parameters does the attention module have ? and for MLP module ? In terms of percentage ?