# Introduction to Transformers

In [19]:
import sys
!{sys.executable} -m pip install transformers
!{sys.executable} -m pip install torch
!{sys.executable} -m pip install sentencepiece

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Why Transformers?

The transformers structure was introduced regarding the task of Machine Translation. Models, that have been used for this task prior were RNN's (recurrent neural networks) and LSTM (long-short term memory) models. Their main problems were: being slow and dropping performace with longer sentences (inputs).

What do the Transformers do differently?
* The input sentence is processed at once, where the RNN and LSTM processed the sentence word by word. This allows for parallelization and thus improving the speed 
* Using the "self-attention" and the sentence being processed at once helps to keep the information about dependencies intact, where the RNN and LSTM approach was that the dependencies with the other words were passed down in the hidden states and as such the information was lost the longer the output sentence was.
* Apart from having word embeddings, which is it vector representation the transformers also have positional embeddings, because the sentence is no longer processed word by word, so we need a different way to include the information about the position of the word in the sentence. 


## 1. 🏞️ The journey from the input...

First we need to tokenize the input. Tokenizers are a dictionary where each word, or a syllable has an ID. You always need to use the tokenizer that was used during the pre-training and fine-tuning of your model, otherwise you will get complete gibberish. The tokenizer also has special tokens such as EOS (end of sentence), BOS (beggining of sentence) etc.

When training (or even in inference) it is possible, that you will want to process multiple inputs at once. As the model works with tensors, we need the inputs tokenized to input_ids of the same length. Becuase of that, the tokenized input also has the "attention mask" which is an indicator which parts of the tokenized sequence is relevant and which is only the padding for ensuring the same length requirement.

In [20]:
from transformers import AutoTokenizer


tokenizer_t5 = AutoTokenizer.from_pretrained("t5-small") # Download the corresponding tokenizer
input_sentence = ["This is the supercalifragilisticexpialidocious input text.", "second text"]

# We enable padding, as we need both sentence tokenized to the sequence of the same length
tokenized_input = tokenizer_t5(input_sentence, padding=True)

print("Original output: {}".format(input_sentence))
print("Tokenized output (IDs): {}".format(tokenized_input.input_ids))

# attention mask vector provides information on which parts of the sample are relevant (are not padding)
print("Tokenized output (attention mask): {}".format(tokenized_input.attention_mask))

# The tokenizer has a way of dealing with words which are not in its vocab.
# Here if the word start after a whitespace it has "_" in front whereas if the word has to be split its parts have tokens without the "_"
print("Tokenized output (in tokens) - 1st sequence: {}".format(tokenizer_t5.convert_ids_to_tokens(tokenized_input.input_ids[0])))


Original output: ['This is the supercalifragilisticexpialidocious input text.', 'second text']
Tokenized output (IDs): [[100, 19, 8, 1355, 15534, 20791, 173, 3040, 994, 102, 23, 4288, 7171, 2936, 3785, 1499, 5, 1], [511, 1499, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
Tokenized output (attention mask): [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
Tokenized output (in tokens) - 1st sequence: ['▁This', '▁is', '▁the', '▁super', 'cali', 'frag', 'il', 'istic', 'ex', 'p', 'i', 'ali', 'doc', 'ious', '▁input', '▁text', '.', '</s>']


### 1.1 Embeddings
The tokenized input then needs to be transformed into a vector representation. This representantion needs to hold information about the semantic property of a word (*Input Embedding*) and positional information about where the word is in the input sequence (*Positional Embedding*). Each word's embedding has a size of a hidden layer (default 512). 


In [21]:
from transformers import T5ForConditionalGeneration

input_sentence = ["This is the supercalifragilisticexpialidocious input text."]
model_t5 = T5ForConditionalGeneration.from_pretrained("t5-small")
input_ids = tokenizer_t5(input_sentence, return_tensors="pt")
outputs=model_t5.generate(input_ids.input_ids, return_dict_in_generate=True,output_hidden_states=True, output_attentions=True, output_scores=True)

#encoder hidden states contain word embedding and then all outputs of each encoder layer
word_embeddings_for_input_sentence = outputs.encoder_hidden_states[0] 
print("Size of the input ids tensor: {}".format(input_ids.input_ids.shape))
print("Size of the word embedding for the input sequence: {}".format(word_embeddings_for_input_sentence.shape))



Size of the input ids tensor: torch.Size([1, 18])
Size of the word embedding for the input sequence: torch.Size([1, 18, 512])


### 1.2 What is Self-attention?
Self-attention layer is a building block of the Transformer infrastructure, which is able to include the information about the connection between  individual words/tokens from the input sequence. Below you can see how the attention score is computed

![attention_score.png](https://jalammar.github.io/images/t/self-attention-matrix-calculation-2.png)
[[source]](https://jalammar.github.io/illustrated-transformer/)

Input and output of every self-attention layer with be embeddings / hidden states.

The embeddings are passed to three separate linear layers, which compute the desired matrices **query matrix Q**, **key matrix K** and a **value matrix V** by multiplying the embbedding by their own weights ($W^{Q}$, $W^{K}$, $W^{V}$).  

Each word queries the scores to all words from the input sequence, this is done by matrix multiplication of $Q$ and $K^T$. The results are divided (for stability purposes) and a softmax function (values in a row ad up to 1) is applied. The result is then multiplied with the matrix $V$. This applies the attention scores to the corresponding values (imagine $q_1$ being applied to all of matrix $K$ to be multiplied by $v_1$)

#### 1.2.1 Multi-head attention
Instead of computing only one attention score we use the multi-head attention. What changes?
* ($W^{Q}$, $W^{K}$, $W^{V}$) are split and resized into a tuple of smaller ($W^{Q}$, $W^{K}$, $W^{V}$) for each head
* in each head Q, K and V is computed and the attention score is computed using $d_k$ = embedding size/ # heads
* the results for each head are concatonated together to again by of the size as the input embedding.

Allowing multiple attention score to be computed we allow for a richer representation.

#### 1.2.2 Types of self-attention layers

* Encoder self-attention - Q,K and V are computed from the input sequence embedding or a hidden state of a previous encoder
* Decoder self-attention - Q,K and V are computed from the target/output sequence embedding or a hidden state of a previous decoder
* Encoder-Decoder self-attention - Q is computed from the target/output sequence embedding or a hidden state of a previous decoder, K and V are from the last hidden state of the Encoder



![transformer_architecture.png](https://heidloff.net/assets/img/2023/02/transformers.png)
[[source]](https://heidloff.net/article/foundation-models-transformers-bert-and-gpt/)


## 2.💪 Pre-training
The magic of Deep language models lays in their good preliminary knowledge of language. They obtain this knowledge during so-called pre-training phase, where they are trained for a different token classification task an instance of Language Modeling.


### 2.1 Language Modeling

In the instances of Language Modeling task, models are asked to solve a task of 'guessing' the **right word in the context**.

This task comes in two main instances:

* **Masked Language Modeling (MLM)**: Models guess the correct token **within context**. This objective best prepares the model for **classification tasks** (Named Entity Recognition, Sequence Classification)

![image.png](https://www.rohanawhad.com/content/images/size/w1600/2022/04/image.png)
[[source]](https://www.rohanawhad.com/improvements-of-spanbert-over-bert/)

* **Causal Langauge Modeling (CLM)**: Models guess the **following token** from previous context. This objective is better for preparing the model for **generation**, such as Dialogue, Summarization, Translation.

![image.png](https://gcdnb.pbrd.co/images/Bx4h6Lordx0y.png?o=1)  
![image.png](https://gcdnb.pbrd.co/images/rb7bmZS11gtl.png?o=1)
![image.png](https://gcdnb.pbrd.co/images/gXYffjzLIk7n.png?o=1)

Other variances: Synthetic token classification (ALBERT), Mask infilling (BART, T5), Sequence reconstruction (BART), Sequence Classification (BERT).


## 3.💡Fine-tuning for specific tasks
As stated above Transformers can be used in multiple ways, depending on what the desired output is. We can either use the whole architecture as shown above or use only the decoder or the encoder. In every task the input is sequence of tokens. The output and how we aquire it depends on which task we choose. However we are able to sort them into two categories: **Token classification** and **Text generation**

### 3.1 Token classification
For these tasks we need to have the best representation for each word within its context(left and right). Model pretrained on tasks such as MLM are best suited for token classification. For the classification itself, we want the best representation of the input sequence and therefore we use the last hidden state of the encoder. 

* **Text classification (positive/negative)**: Before the sequence we wish to classify we add a $[CLS]$ token. After recieving the hidden states,we process them through a linear layer and based on the value for the $[CLS]$ token we classify the sentence.

* **Token classification (NER)**: Linear layer processes the hidden states to logits, where cross-entropy loss is computed, to return a most probable label for each token.

* **Extractive QA**: Linear layer processes the hidden states to logits, where cross-entropy loss is computed, to find the best candidates for start and end of the answer tokens in the sequence.

### 3.2 Text generation
With these tasks the end goal is to guess the most probable token following the previous tokens. Model pre-trained on tasks similar to CLM are best suitable for text generation task. The following tasks have a certain overlap, as they both have text as input and output. These tasks need next token generation, which is provided by the decoder and as such can be performed by either a traditional Encoder-Decoder or Decoder-only Transformer.

In both cases the embeddings are processed (either through a ED or D transformer) and the decoder hidden states are processed by a linear layer and we get logits from which we determine the most probable token.


* Text generation (Completion, code generation): Given a sequence of words, the model generates the next word. This task can be trained on unlabeled data and is often used with Decoder-only transformers. 
* Sequence to sequence (Summarization, Translation, etc.): These tasks need the model to learn to map pairs of text (english to german, article to abstract). These tasks are mainly about transcribing the input into a different text (translate to a different language / simplify the text but maintain the meaning) and benefit from understanding the input sequence. Because of thatthey are used mainly with Encoder-Decoder models.

![animation.gif](https://jalammar.github.io/images/t/transformer_decoding_2.gif)
[[source]](https://jalammar.github.io/illustrated-transformer/)

### 3.3 Decoder or Encoder-Decoder?
With the rise of ChatGPT and seing its capabilities it is a valid question if a GPT (Decoder only) model would not suffice even on tasks, previously though better suited for a seq2seq task. A [Paper comparing Encoder-Decoder and Decoder-only models](https://arxiv.org/pdf/2304.04052.pdf) on machine translation showed some reasons, why we shouldn't focus only on decoder-only large language models for all text generation tasks. 

Reasons for using Decoder-only models:
* Smaller size and can be trained on much more data (unsupervised)

Reasons against using Decoder-only models:
* No encoder hidden states in the decoder self-attention layers
* The only information about the input sequence is in the decoder hidden states
This results in more hallucination with the growing index of generation and the attention degeneration, as the hidden states cannot hold information about the input sequence

The best rule-of-thunb would be: If you need the model to not diverge from initial input -> Encoder-Decoder. Would you greatly benefit from exposing the model to a large quantity of data and diverging from the input sequence meaning is not a massive drawback for you -> Decoder-only

## 4. 🏞️ The journey from the input...to the output - Text Generation

Even though it may seem, that once we have a trained model we are done, there is the generation itself. While the weights in the decoder (and encoder) are set, we can still tweak the generated output, by using multiple generation strategies as well as using Logit Processors.

### 4.1 🗺️ Decoding strategies
While iteratively generating the output tokens, there are multiple strategies, that can be used to achieve the "best" output text.

In [22]:
from transformers import AutoModelWithLMHead, AutoTokenizer
input_sentence = "Two thousand years ago, the "
model_gpt = AutoModelWithLMHead.from_pretrained("gpt2")
tokenizer_gpt = AutoTokenizer.from_pretrained("gpt2")
input = tokenizer_gpt(input_sentence, return_tensors="pt")


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['Two thousand years ago, the vernacular of the ancient Egyptians was the language of the gods. The language of the gods was the language of the gods. The language of the gods was the language of the gods. The language of the gods was']


* **Greedy search**: At each iteration, the most probable token is generated (most probable token at a time)

In [None]:
outputs=model_gpt.generate(**input, early_stopping=True, max_length=50) #greedy search
print(tokenizer_gpt.batch_decode(outputs, skip_special_tokens=True))

* **Beam search**: At each iteration $N$ number of beams with the best overall probability are stored. After the generation is complete the beam with highes overall probability is returned (most probable output as a whole)


In [23]:
outputs=model_gpt.generate(**input, num_beams=2, num_return_sequences=2, max_length=50) #beam search
print(tokenizer_gpt.batch_decode(outputs, skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['Two thousand years ago, the vernal equinox was the first time that the sun and moon met, and the sun and moon were the first time that the sun and moon met. The sun and moon were the first time that the sun', 'Two thousand years ago, the vernal equinox was the first time that the sun and moon met, and the sun and moon were the first time that the sun and moon met.\n\nThe sun and moon were the first time that']


* **Multinomial sampling**: sampling the token distribution 
* **Top-$K$ sampling**: sampling only $K$ samples with the highest probabilities.  
* **Nucleus sampling (top-$p$ sampling)**: Computing a cumulative distribution function and sampling only till the cut-off at the $p$ quantile


In [30]:
outputs=model_gpt.generate(**input, do_sample=True, max_length=50, early_stopping=True) # multinomial sampling
print(tokenizer_gpt.batch_decode(outputs, skip_special_tokens=True))
outputs=model_gpt.generate(**input, do_sample=True, top_k=50, max_length=50, early_stopping=True) # Top-k sampling
print(tokenizer_gpt.batch_decode(outputs, skip_special_tokens=True))
outputs=model_gpt.generate(**input, do_sample=True, top_p=0.1, top_k=0, max_length=50, early_stopping=True) #Top-p sampling
print(tokenizer_gpt.batch_decode(outputs, skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


["Two thousand years ago, the ices were broken through by a combination of magic and fire. When the last ices were broken through caffeine, caffeine turned into fire. So far, it's been known that the same happens to a lot of people"]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['Two thousand years ago, the vernal equinox was a great and beautiful place for the sun to shine in the warm, golden sunlight of the Eastern Mediterranean. The vernal equinox, known as the Golden Age of the Roman']
['Two thousand years ago, the vernacular of the English language filling the world with the words "mortal" and "flesh" was the English language. The English language was the language of the living. The English damage was done by the']


* **Temperature**: A parameter which adjusts the logits before applying the softmax function. Greater temperature can lead to less probable words generated (more unexpected generated sequence), small temperature will lead to more conservative outputs. 


In [24]:
outputs=model_gpt.generate(**input, early_stopping=True, max_length=50, do_sample=True, top_k=0, temperature=0.7) #multinomial sampling with high temp
print(tokenizer_gpt.batch_decode(outputs, skip_special_tokens=True))
outputs=model_gpt.generate(**input, early_stopping=True, max_length=50, do_sample=True, top_k=0, temperature = 0.001) #multinomial sampling with low temp
print(tokenizer_gpt.batch_decode(outputs, skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['Two thousand years ago, the vernal equinox Greyhound - which is now the most popular London-trained, £3 billion-a-year transport system in the world - welcomed an arrival in the east to help with the discovery of']
['Two thousand years ago, the vernacular of the ancient Egyptians was the language of the gods. The language of the gods was the language of the gods. The language of the gods was the language of the gods. The language of the gods was']


### 4.2 👮 Logit Processors
Logit processors can be applied to the computed logits, to alter them and therefore change the next most probable token. They are not dependant on any decoding strategy. Some of the basic usages are enforcing minimal or maximal length of the output, forbidding specific words etc. Here we will show you a naive approach to creating a Logit processor which enforces a token that should be in the output.

In [29]:

from transformers import LogitsProcessor, LogitsProcessorList, BatchEncoding
import torch

class EnforceWordProcessor(LogitsProcessor):
    def __init__(self, desired_input: BatchEncoding, param: float = 200.):
        print(desired_input.input_ids)
        assert len(desired_input.input_ids) ==1
        self.desired_input = desired_input.input_ids[0]
        self.param = param #the parameter which indicates how much we want to enforce the word
        pass

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor):
        last_id = input_ids[-1][-1] # ID of the last generated token (whose score we are adjusting)
        if last_id != self.desired_input:
            scores[-1,self.desired_input] = scores[-1,self.desired_input]/self.param
        else:
            scores[-1,self.desired_input] = -float("inf") # we only want to generate the token once
        return scores
    
enforced_word = " stupid" # we need the whitespace for the tokenizer to understand this as one token only

custom_processor = EnforceWordProcessor(tokenizer_gpt(enforced_word, add_special_tokens=False))

inputs = tokenizer_gpt(input_sentence, return_tensors="pt")

#repetition penalty is needed as the naive Logit Process leads to repetition, so to avoid that, we include a rep. penalty
# Here we are using a beam search decodeing strategy, which is much more likely to fall into the pit of repetition
outputs = model_gpt.generate(**inputs, logits_processor=LogitsProcessorList([custom_processor]), repetition_penalty = 5., max_length=140, num_beams=5)
output_text = tokenizer_gpt.batch_decode(outputs, skip_special_tokens=True)

output_text

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[8531]


["Two thousand years ago, the vernal equinox was a time of great stupidity stupidness. It stupidened stupid people and made stupid people dumb.\n\nBut now it stupidens everyone stupidly. And that stupidness has stupidened everybody stupidly because stupid people don't want stupid things to happen. They stupidly believe stupid things will be done stupidly if they do stupid things in stupid ways. So stupid people have stupid beliefs stupidly about stupid things. That's why I stupidly stupid myself stupidly.\n\nThe stupidest thing you can do is not stupid at stupid times. You just stupid yourself stupidly when you think stupid things are going on stupidly"]

### Are Logit Processors worth the work?

The are use-cases, where a logit processor can help you enforce a rule of how structured the output should be. A really good example of that is the [**jsonformers**](https://github.com/1rgs/jsonformer) repo.

# ✋ Hands-on: Adjusting the output without retraining using Logit Processors

In this first short session we would ask you, to take the previously shown naive Logit Processor and try to adjust it to perform better by: 
* enforce not a single token word, but a sequence of multiple tokens.
* Try to fix the repetition problem without using a repetition penalty

You can also experiment with different decoding strategies, which we talked about above.