# Large language models (LLMs): Theoretical applications

LLMs are models that have a large number of parameters and are able to process language in various ways such as language generation and translation from one language to another (e.g. French ---> English).

These models commonly use the Transformer architecture that was introduced in 2017 in the "Attention is all you need" paper. Since then a multitude of LLM architectures have been designed.


![en_chapter1_transformers_chrono.svg](images/en_chapter1_transformers_chrono.svg)

Image credit: https://huggingface.co/learn/nlp-course/chapter1/4

Generally, there are 3 types of LLMs I will discuss here:

* Encoder-decoder Transformers
* Encoder-only Transformers
* Decoder-only Transformers

## Overview of Transformers

We will now introduce the "vanilla" Transformer architecture introduced in "Attention is all you need". This is an encoder and decoder region with connections in between.


![Transformer_Arch.png](images/Transformer_Arch.png)

The encoder/decoder regions are each made of stacked blocks.

![Transformer_Enc_Dec_Blocks.png](images/Transformer_Enc_Dec_Blocks.png)

Each encoder block consists of a self-attention layer connected to a feed-forward layer.   

The decoder block also starts with a self-attention layer which is then connected to an encoder-decoder attention layer and followed by a feed-forward layer.

I will go into detail on what "attention" means later.

![encode_decode.png](images/encode_decode.png)

### Adding text + tensors into the picture

Words are turned into vectors based on their location within a vocabulary.

For example with a vocabulary of nine words, each word in the vocabulary can be depicted as a one-hot encoding within this vocab.

![wordembedding.png](images/wordembedding.png)

Image credit: https://www.shanelynn.ie/get-busy-with-word-embeddings-introduction/

Once the words are encoded as vectors then each vector streams through the encoding layers as in the following example for the two words "Thinking" and "Machines".

While each word streams through the model by itself, there are connections within the attention layers.

### Positional encoding

Positional encoding accounts for the order of the words in the input sequence.

The Transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word.

![transformer_positional_encoding_vectors.png](images/transformer_positional_encoding_vectors.png)

Image credit: https://jalammar.github.io/illustrated-transformer/

![encoder_with_tensors_2.png](images/encoder_with_tensors_2.png)

Image credit: https://jalammar.github.io/illustrated-transformer/

### Self-attention mechanisms

Now I will explain self-attention at a very high level.

Say the following sentence is an input sentence we want to translate:

**”The animal didn't cross the street because it was too tired”**

When the model processes the word “it”, self-attention associates “it” with “animal”.

As the model processes each word in the input sequence, self attention looks at other positions in the input sequence for clues to a better encoding for this word.

![transformer_self-attention_visualization.png](images/transformer_self-attention_visualization.png)

Image credit: https://jalammar.github.io/illustrated-transformer/

For self-attention there are 5 general steps:

1. Generate query, key and value vectors for each word:

* These vectors are created by multiplying the embedding by three matrices that we trained during the training process.
*  These vectors are abstractions useful for calculating and thinking about attention


2. Calculate a score for each word in the input sentence against each other.

* Say we’re calculating the self-attention for the first word in this example, “Thinking”.
* We need to score each word of the input sentence against this word.
* The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.

3. Divide the scores by the square root of the dimension of the key vectors to stabilize the gradients. This is then passed through a softmax operation.

4. Multiply each value vector by the softmax score.
* Want to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words

5. Sum up the weighted value vectors.
* This produces the output of the self-attention layer at this position for each word.

![self-attention-output.png](images/self-attention-output.png)

Image credit: https://jalammar.github.io/illustrated-transformer/

### Multi-head attention

In practice, multiple attention heads are used which
1. Expands the model’s ability to focus on different positions and prevent the attention to be dominated by the word itself.
2. Have multiple “representation subspaces”. Have multiple sets of Query/Key/Value weight matrices

![transformer_multi-headed_self-attention-recap.png](images/transformer_multi-headed_self-attention-recap.png)

Image credit: https://jalammar.github.io/illustrated-transformer/

The attention mechanisms can be significantly more complex as the number of heads increases!
* Note: each color here represents the attention from different attention heads.

![transformer_self-attention_visualization_3.png](images/transformer_self-attention_visualization_3.png)

Image credit: https://jalammar.github.io/illustrated-transformer/

![encoder_with_tensors_2.png](images/encoder_with_tensors_2.png)

Image credit: https://jalammar.github.io/illustrated-transformer/

### Adding in decoders

Now that we know how the encoder layers work the decoder layers are much more straightforward to understand:

The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence.

In the decoder, the self-attention layer only attends to earlier positions in the output sequence. The future positions are masked (setting them to -inf) before the softmax step in the self-attention calculation.

The “Encoder-Decoder Attention” layer creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.

![encode_decode.png](images/encode_decode.png)

The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output.

The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did.

And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.

![transformer_decoding_2.gif](images/transformer_decoding_2.gif)



Image credit: https://jalammar.github.io/illustrated-transformer/

### How do we turn the output of the decoder stack into a word?

Using the final Linear layer and a Softmax Layer.

The Linear layer projects the vector produced by the stack of decoders, into a larger vector called a logits vector.

If our model knows 10,000 unique English words learned from its training dataset the logits vector is 10,000 cells wide – each cell corresponds to the score of a unique word.

The softmax layer turns those scores into probabilities. The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

![transformer_decoder_output_softmax.png](images/transformer_decoder_output_softmax.png)

Image credit: https://jalammar.github.io/illustrated-transformer/

### Training

To visualize training, let’s assume our output vocabulary only contains six words(“a”, “am”, “i”, “thanks”, “student”, and “\<eos\>” (short for ‘end of sentence’)).

Each word in the vocabulary can be outputted as a one-hot encoding.

![one-hot-vocabulary-example.png](images/one-hot-vocabulary-example.png)

Image credit: https://jalammar.github.io/illustrated-transformer/

Say it’s our first step in the training phase, and we’re training it on a simple example – translating “merci” into “thanks”.

We want the output to be a probability distribution indicating the word “thanks”. But since this model is not yet trained, that’s unlikely to happen just yet.

![transformer_logits_output_and_label.png](images/transformer_logits_output_and_label.png)

Image credit: https://jalammar.github.io/illustrated-transformer/

To compare these distributions we can simply look at the difference between them a loss like cross-entropy or Kullback–Leibler divergence. Then the training uses back-propagation to optimize this loss function.

A more complex situation is translating the sentence: “je suis étudiant” into “i am a student” as can be seen in the example:

![output_target_probability_distributions.png](images/output_target_probability_distributions.png)

Image credit: https://jalammar.github.io/illustrated-transformer/

### Advantages and disadvantages

**Advantages:**

* Sequence-to-sequence tasks: Well-suited for tasks where the input and output sequences have different lengths, such as machine translation or summarization.
* Information compression: The encoder compresses input information into a fixed-size context vector, which the decoder then uses to generate the output sequence.

**Disadvantages:**

* Computationally expensive: Requires processing the entire input sequence before generating any part of the output sequence, which can be computationally expensive.
* Not suitable for autoencoding tasks: May not be the best choice for tasks where the input and output sequences are expected to be similar or identical.


## Tokenization for language models

Now we will have discuss different ways that language models recognize and “read” text.

Humans do this inherently because they previously learned phonetic sounds. Machines don’t have phonetic knowledge so they need to be told how to break text into standard units to process it.
They use a system called “tokenization”, where sequences of text are broken into smaller parts, or “tokens”, and then fed as input.

![text-processing---machines-vs-humans.png](images/text-processing---machines-vs-humans.png)

Image credit: https://blog.floydhub.com/tokenization-nlp/

### Tokenizing based on "words"

Based on syntax of English language a likely answer is just that breaking sentences into word-level chunks or tokens seems like the best approach.

Although this seems easy, it can actually be done in different ways as shown in the following diagram.

![tokenize_words.png](images/tokenize_words.png)

Image credit: https://blog.floydhub.com/tokenization-nlp/

There are some issues with this approach though:

* You need a big vocabulary: You can only learn those words in your training vocab. Any words not in the training set will be treated as unknown words. It does not break words into sub-words so it would miss anything like “talk” vs. “talks” vs. “talked” and “talking”.
* Words are combined: There may be some confusion about what exactly constitutes a word. Some words such as “sun” and “flower” are compounded to make sunflower. Are these one word or multiple?
* Some languages don’t segment by spaces.

### Character-based tokenization

To potentially solve this we can try to simply tokenize the input text character by character.

![chars-tokenization.png](images/chars-tokenization.png)

Image credit: https://blog.floydhub.com/tokenization-nlp/

Issues with this approach:
* Lack of meaning: Unlike words, characters don’t have any inherent meaning, so there is no guarantee that the resultant learned representations will have any meaning.
* Increased input computation: If you use word level tokens then you will spike a 7-word sentence into 7 input tokens. However, assuming an average of 5 letters per word (in the English language) you now have 35 inputs to process. This increases the complexity of the scale of the inputs you need to process
* Limits network choices: Increasing the size of your input sequences at the character level also limits the type of neural networks you can use.

### Subword tokenization
This tokenization type deals with an infinite potential vocabulary via a finite list of known words.

There are different ways of doing this:

**Byte-pair encoding**

![Byte_Pair_enc.webp](images/Byte_Pair_enc.webp)

Image credit: https://towardsdatascience.com/tokenization-algorithms-explained-e25d5f4322ac

BPE was initially introduced to help compress data by finding common byte pair combinations.

This tokenization first forms a base vocabulary which is a collection of all unique characters present in the corpus. We also calculate frequency of each token and represent each token as a list of individual characters from base vocabulary.

Now merging begins. We keep adding tokens to our base vocab as long as the maximum size is not breached on the basis of following criteria — the pair of tokens occurring most number of times is merged and introduced as a new token.

**Word-piece tokenizer**



Word-piece tokenization is similar to BPE but instead maximizes the likelihood of token pairs:

![WordPieceTok.webp](images/WordPieceTok.webp)

Word-piece and BPE will go through every potential option at each step and pick the tokens to merge based on the highest frequency/likelihood. In this way it is a greedy algorithm which optimizes for the best solution at each step in its iteration.

However, this greedy algorithm can result in a potentially ambiguous final token vocabulary. This is especially for when there is more than one way to encode a particular word. How do you choose which subword units to use?

**Unigram**
Unigram sticks to predicting the most likely result token taking into account learned probability during training. How likely it is that the next word is “learning” depends only on the probability of the word “learning” turning up in the training set.

To generate a unigram subword token set you need to first define the desired final size of your token set and also a starting seed subword token set.

Then:

1. Work out the probability for each subword token
2. Work out a loss value which would result if each subwork token were to be dropped. The loss is worked out via an algorithm described in the paper (Kudo 2018) (an expectation maximization algorithm).
3. Drop the tokens which have the largest loss value. You can choose a value here, e.g. drop the bottom 10% or 20% of subword tokens based on their loss calculations. Note you need to keep single characters to be able to deal with out-of-vocabulary words.
4. Repeat these steps until you reach your desired final vocabulary size or until there is no change in token numbers after successive iterations.

## Encoder-only Transformers

In addition to the encoder-decoder architecture shown here there various other architectures which are either only encoder or decoder models.

### Bidirectional Encoder Representations from Transformers (BERT) model



Encoder-only models only use the encoder layer of the Transformer.

These models are usually used for "understanding" natural language; however, they typically are not used for text generation. Examples of uses for these models are:

1. Determining how positive or negative a movie’s reviews are. (Sentiment Analysis)
2. Summarizing long legal contracts. (Summarization)
3. Differentiating words that have multiple meanings (like ‘bank’) based on the surrounding text. (Polysemy resolution)

These models are often characterized as having “bi-directional” attention, and are often called auto-encoding models.
The attention mechanisms of these models can access all the words in the initial sentence.

The most common encoder only architectures are:

* ALBERT
* BERT
* DistilBERT
* ELECTRA
* RoBERTa

As example, let's consider BERT model in a little more detail.

![BERT_Explanation.webp](images/BERT_Explanation.webp)

Image credit: https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

The BERT model is bidirectionally trained to have a deeper sense of language context and flow than single-direction language models.

The Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).

In the BERT training process, the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. During training, 50% of the inputs are a pair in which the second sentence is the subsequent sentence in the original document, while in the other 50% a random sentence from the corpus is chosen as the second sentence.

To help the model distinguish between the two sentences in training, the input is processed in the following way before entering the model:

1. A [CLS] token is inserted at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence.
2. A sentence embedding indicating Sentence A or Sentence B is added to each token. Sentence embeddings are similar in concept to token embeddings with a vocabulary of 2.
3. A positional embedding is added to each token to indicate its position in the sequence. The concept and implementation of positional embedding are presented in the Transformer paper.

![BERT_input_sent.webp](images/BERT_input_sent.webp)

Image credit: https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

To predict if the second sentence is indeed connected to the first, the following steps are performed:

1. The entire input sequence goes through the Transformer model.
2. The output of the [CLS] token is transformed into a 2×1 shaped vector, using a simple classification layer (learned matrices of weights and biases).
3. Calculating the probability of IsNextSequence with softmax.

### Advantages and disadvantages:

**Advantages**:

* Contextualized embeddings: Good for tasks where contextualized embeddings of input tokens are crucial, such as natural language understanding.
* Parallel processing: Allows for parallel processing of input tokens, making it computationally efficient.

**Disadvantages:**

* Not designed for sequence generation: Might not perform well on tasks that require sequential generation of output, as there is no inherent mechanism for auto-regressive decoding.

Here is an example of a BERT code that can be used to

## Decoder-only models

### Generative Pre-trained Transformer (GPT)-2

Decoder models use only the decoder of a Transformer model. At each stage, for a given word the attention layers can only access the words positioned before it in the sentence. These models are often called auto-regressive models.

The pretraining of decoder models usually revolves around predicting the next word in the sentence.

These models are best suited for tasks involving text generation.

Examples of these include:
* CTRL
* GPT
* GPT-2
* Transformer XL

Let's discuss one of the most popular models, GPT-2 in a little more detail.

The architecture of GPT-2 is inspired by the paper: "Generating Wikipedia by Summarizing Long Sequences" which is another arrangement of the transformer block that can do language modeling. This model threw away the encoder and thus is known as the “Transformer-Decoder”.

![transformer-decoder-intro.png](images/transformer-decoder-intro.png)

Image credit: https://jalammar.github.io/illustrated-gpt2/

An important difference of the GPT-2 architecture compared to the encoder-Transformer architecture has to do with the type of attention mechanism used.

In models such as BERT, the self-attention mechanism has access to tokens to the left and right of the query token. However, in decoder-based models such as GPT-2, masked self-attention is used instead which allows access only to tokens to the left of the query.

The masked self-attention mechanism is important for GPT-2 since it allows the model to be trained for token-by-token generation without simply "memorizing" the future tokens.

![self-attention-and-masked-self-attention.png](images/self-attention-and-masked-self-attention.png)

Image credit: https://jalammar.github.io/illustrated-gpt2/

The masked self-attention adds understanding of associated words to explain contexts of certain words before passing it through a neural network. It assigns scores to how relevant each word in the segment is, and then adds up the vector representation. This is then passed through the feed-forward network resulting in an output vector.

![gpt2-self-attention-example-2.png](images/gpt2-self-attention-example-2.png)

Image credit: https://jalammar.github.io/illustrated-gpt2/

The resulting vector then needs to be converted to an output token. A common method of obtaining this output token is known as top-k.

Here, the output vector is multiplied by the token embeddings which results in probabilities for each token in the vocabulary. Then the output token is sampled according to this probability.

![gpt2-output.png](images/gpt2-output.png)

Image credit: https://jalammar.github.io/illustrated-gpt2/

### Advantages and disadvantages

**Advantages:**

* Auto-regressive generation: Well-suited for tasks that require sequential generation, as the model can generate one token at a time based on the previous tokens.
* Variable-length output: Can handle tasks where the output sequence length is not fixed.

**Disadvantages:**

* No direct access to input context: The decoder doesn't directly consider the input context during decoding, which might be a limitation for certain tasks.
* Potential for inefficiency: Decoding token by token can be less computationally efficient compared to parallel processing.

## Additional architectures

In addition to text, LLMs have also been applied on other data sources such as images and graphs. Here I will describe two particular architectures:
1. Vision Transformers
2. Graph Transformers

### Vision Transformers

Vision Transformers (ViT) is an architecture that uses self-attention mechanisms to process images.

The way this works is:

1. Split image into patches (size is fixed)
2. Flatten the image patches
3. Create lower-dimensional linear embeddings from these flattened image patches and include positional embeddings
4. Feed the sequence as an input to a transformer encoder
5. Pre-train the ViT model with image labels, which is then fully supervised on a big dataset Fine-tune the downstream dataset for image classification

![vision-transformer-vit.png](images/vision-transformer-vit.png)

Image credit: Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).

### Graph Transformers

![Graphformer.png](images/Graphformer.png)

Image credit: Yang, Junhan, et al. "GraphFormers: GNN-nested transformers for representation learning on textual graph." Advances in Neural Information Processing Systems 34 (2021): 28798-28810.

References:



https://huggingface.co/learn/nlp-course/chapter1/4

https://www.shanelynn.ie/get-busy-with-word-embeddings-introduction/

https://jalammar.github.io/illustrated-transformer/

https://towardsdatascience.com/tokenization-algorithms-explained-e25d5f4322ac

https://blog.floydhub.com/tokenization-nlp/

https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

https://jalammar.github.io/illustrated-gpt2/

Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).

Yang, Junhan, et al. "GraphFormers: GNN-nested transformers for representation learning on textual graph." Advances in Neural Information Processing Systems 34 (2021): 28798-28810.