# Introduction to Transformers

1. **A description of a Transformer model**
2. **Pre-training a language model**
3. **Overview of specific tasks**
4. **How to manipulate the output, without re-training the model (generation strategies etc.)**
5. **Hands-on: Logit Processors**


In [2]:
import sys
!{sys.executable} -m pip install transformers[sentencepiece]==4.19.1
!{sys.executable} -m pip install torch



## Why Transformers?

The transformers structure was introduced regarding the task of **Machine Translation**. Models, that have been used for this task prior were RNN's (recurrent neural networks) models. Their main problems were: being slow and dropping performance with longer sentences (inputs).

What do the Transformers do differently?

- The **input sentence is processed at once**, where the RNN processed the sentence word by word. This allows for parallelization and thus improving the speed
- Using the **"self-attention"** and the sentence being processed at once helps to keep the information about dependencies intact, where the RNN approach was that the dependencies with the other words were passed down in the hidden states and as such the information was lost the longer the output sentence was.
- Apart from having word embeddings, which is a vector representation of the input, the transformers also have positional embeddings, because the sentence is no longer processed word by word, so we need a different way to include the information about the position of the word in the sentence.


## 1. 🏞️ The journey from the input...


First we need to tokenize the input. Tokenizers are basically a dictionary where each word/ syllable/character has an ID. You **always** need to **use the tokenizer that was used during the pre-training and fine-tuning of your model**, otherwise you will get complete gibberish. The tokenizer also has special tokens such as EOS (end of sentence), BOS (beggining of sentence) etc. (+ you can add special tokens)

When training (or in inference) it is possible, that you will want to process multiple inputs at once. As the model works with tensors, we need the inputs tokenized to input_ids of the same length. Because of that, the tokenized input also has the "attention mask" which is an indicator which parts of the tokenized sequence are relevant and which are only the padding for ensuring the same length requirement.


In [3]:
from transformers import AutoTokenizer


tokenizer_t5 = AutoTokenizer.from_pretrained("t5-small") # Download the corresponding tokenizer
input_sentence = ["This is the supercalifragilisticexpialidocious input text.", "second text"]

# We enable padding, as we need both sentence tokenized to the sequence of the same length
tokenized_input = tokenizer_t5(input_sentence, padding=True)

print("Original output: {}".format(input_sentence))
print("Tokenized output (IDs): {}".format(tokenized_input.input_ids))

# attention mask vector provides information on which parts of the sample are relevant (are not padding)
print("Tokenized output (attention mask): {}".format(tokenized_input.attention_mask))

# The tokenizer has a way of dealing with words which are not in its vocab.
# Here if the word start after a whitespace it has "_" in front whereas if the word has to be split its parts have tokens without the "_"
print("Tokenized output (in tokens) - 1st sequence: {}".format(tokenizer_t5.convert_ids_to_tokens(tokenized_input.input_ids[0])))


Original output: ['This is the supercalifragilisticexpialidocious input text.', 'second text']
Tokenized output (IDs): [[100, 19, 8, 1355, 15534, 20791, 173, 3040, 994, 102, 23, 4288, 7171, 2936, 3785, 1499, 5, 1], [511, 1499, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
Tokenized output (attention mask): [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
Tokenized output (in tokens) - 1st sequence: ['▁This', '▁is', '▁the', '▁super', 'cali', 'frag', 'il', 'istic', 'ex', 'p', 'i', 'ali', 'doc', 'ious', '▁input', '▁text', '.', '</s>']


### 1.1 Embeddings

The tokenized input then needs to be transformed into a vector representation. This representantion needs to hold information about the semantic property of a word (_Input Embedding_) and positional information about where the word is in the input sequence (_Positional Embedding_). Each word's embedding has size equivalent of the hidden states (default 512).


In [4]:
from transformers import T5ForConditionalGeneration

input_sentence = ["This is the supercalifragilisticexpialidocious input text."]
model_t5 = T5ForConditionalGeneration.from_pretrained("t5-small")
input_ids = tokenizer_t5(input_sentence, return_tensors="pt")
outputs=model_t5.generate(input_ids.input_ids, return_dict_in_generate=True,output_hidden_states=True, output_attentions=True, output_scores=True)

#encoder hidden states contain word embedding and then all outputs of each encoder layer
word_embeddings_for_input_sentence = outputs.encoder_hidden_states[0] 
print("Size of the input ids tensor: {}".format(input_ids.input_ids.shape))
print("Size of the word embedding for the input sequence: {}".format(word_embeddings_for_input_sentence.shape))

Size of the input ids tensor: torch.Size([1, 18])
Size of the word embedding for the input sequence: torch.Size([1, 18, 512])


### 1.2 What is Self-attention?

Self-attention layer is a building block of the Transformer model, which is able to include the information about the connection between individual words/tokens from the input sequence. Below you can see how the attention score is computed

![attention_score.png](https://jalammar.github.io/images/t/self-attention-matrix-calculation-2.png)
[[source]](https://jalammar.github.io/illustrated-transformer/)

Input and output of every self-attention layer are embeddings / hidden states.

The embeddings are passed to three separate linear layers, which compute the desired matrices **query matrix Q**, **key matrix K** and a **value matrix V** by multiplying the embbedding by their own weights ($W^{Q}$, $W^{K}$, $W^{V}$).

Each word queries the scores to all words from the input sequence, this is done by matrix multiplication of $Q$ and $K^T$. The results are divided (for stability purposes) and a softmax function (values in a row ad up to 1) is applied. The result is then multiplied with the matrix $V$. This applies the attention scores to the corresponding values (imagine $q_1$ being applied to all of matrix $K$ to be multiplied by $v_1$)

#### 1.2.1 Multi-head attention

Instead of computing only one attention score we use the multi-head attention. What changes?

- ($W^{Q}$, $W^{K}$, $W^{V}$) are split and resized into a tuple of smaller ($W^{Q}$, $W^{K}$, $W^{V}$) for each head
- in each head Q, K and V is computed and the attention score is computed using $d_k$ = embedding size/ # heads
- the results for each head are concatonated together and transformed, so they could be used as input for another encoder/decoder.

Allowing multiple attention score to be computed we allow for a richer representation.

#### 1.2.2 Types of self-attention layers

- Encoder self-attention (Multi-head attention) - Q,K and V are computed from the input sequence embedding or a hidden state of a previous encoder
- Decoder self-attention (Masked multi-head attention)- Q,K and V are computed from the target/output sequence embedding or a hidden state of a previous decoder
- Encoder-Decoder cross-attention (Multi-head attention)- Q is computed from the target/output sequence embedding or a hidden state of a previous decoder, K and V are from the last hidden state of the Encoder

![attentions.png](https://jalammar.github.io/images/t/transformer_resideual_layer_norm_3.png)
[[source]](https://jalammar.github.io/illustrated-transformer/)


## 2.💪 Pre-training & Fine-tuning

The magic of Deep language models lays in their good preliminary knowledge of language. They obtain this knowledge during so-called pre-training phase, where they are trained for a different token classification task an instance of Language Modeling.

### 2.1 Language Modeling

In the instances of Language Modeling task, models are asked to solve a task of 'guessing' the **right word in the context**.

This task comes in two main instances:

- **Masked Language Modeling (MLM)**: Models guess the correct token **within context**. This objective best prepares the model for **classification tasks** (Named Entity Recognition, Sequence Classification)

![image.png](https://www.rohanawhad.com/content/images/size/w1600/2022/04/image.png)
[[source]](https://www.rohanawhad.com/improvements-of-spanbert-over-bert/)

- **Causal Langauge Modeling (CLM)**: Models guess the **following token** from previous context. This objective is better for preparing the model for **generation**, such as Dialogue, Summarization, Translation.

![CLM_1](images/CLM_1_new.png)  
![CLM_2](images/CLM_2_new.png)
![CLM_3](images/CLM_3_new.png)

Other variances: Synthetic token classification (ALBERT), Mask infilling (BART, T5), Sequence reconstruction (BART), Sequence Classification (BERT).

### 2.2 Fine-tuning

This is the process of training a pre-trained model for a specific task with new data (usually supervised learning). A specific task corresponds to a "head", aka a linear layer, which is trained to weight the last hidden states accordingly.


## 3.💡Specific tasks

As stated above Transformers can be used in multiple ways, depending on what the desired output is. We can either use the whole architecture as shown above or use only the decoder or the encoder. In every task the input is sequence of tokens. The output and how we aquire it depends on which task we choose. However we can sort them into two categories: **Token classification** and **Text generation**

### 3.1 Token classification

For these tasks we need to have the best representation for each word within its context(left and right). Model pretrained on tasks such as MLM are best suited for token classification. For the classification itself, we want the best representation not a generated token, so we usually use the last hidden state of the encoder.

- **Sequence classification (positive/negative)**: Before the sequence we wish to classify we add a $[CLS]$ token. We classify the sentence based on the value for the $[CLS]$ token.

- **Token classification (NER)**: For each token we compute cross-entropy loss across all labels classify each token as the most probable label.

- **Extractive QA**: For each token we compute cross-entropy loss to find the best candidates for start and end of the answer tokens in the sequence.

### 3.2 Text generation

With these tasks the end goal is to guess the most probable token following the previous tokens. Model pre-trained on tasks similar to CLM are best suitable for text generation task. The following tasks have a certain overlap, as they both have text as input and output. These tasks need next token generation, which is provided by the decoder and as such can be performed by either a traditional Encoder-Decoder or Decoder-only Transformer.

In both cases the embeddings are processed (either through a ED or D transformer) the "best" - in some cases most probable token (depends on the generation strategy) is generated.

- **Text generation (Completion, code generation)**: Given a sequence of words, the model generates the next word. This task can be trained on unlabeled data and is often used with Decoder-only transformers.
- **Sequence to sequence (Summarization, Translation, etc.)**: These tasks need the model to learn to map pairs of text (english to german, article to abstract). These tasks are mainly about transcribing the input into a different text (translate to a different language / simplify the text but maintain the meaning) and benefit from understanding the input sequence. Because of that they are used mainly with Encoder-Decoder models.

![animation.gif](https://jalammar.github.io/images/t/transformer_decoding_2.gif)
[[source]](https://jalammar.github.io/illustrated-transformer/)

### 3.3 Decoder or Encoder-Decoder?

With the rise of ChatGPT and seing its capabilities it is a valid question if a GPT (Decoder only) model would not suffice even on tasks, previously though better suited for a seq2seq task. A [Paper comparing Encoder-Decoder and Decoder-only models](https://arxiv.org/pdf/2304.04052.pdf) on machine translation showed some reasons, why we shouldn't focus only on decoder-only large language models for all text generation tasks.

Reasons for using Decoder-only models:

- Smaller size and can be trained on much more data (unsupervised)

Reasons against using Decoder-only models:

- No encoder hidden states in the decoder self-attention layers
- The only information about the input sequence is in the decoder hidden states
  This results in more hallucination with the growing index of generation and the attention degeneration, as the hidden states cannot hold information about the input sequence

The best rule-of-thunb would be: If you need the model to not diverge from initial input -> Encoder-Decoder. Would you greatly benefit from exposing the model to a large quantity of data and diverging from the input sequence meaning is not a massive drawback for you -> Decoder-only


## 4. 🏞️ The journey from the input...to the output - Text Generation

Even though it may seem, that once we have a trained model we are done, there is the generation itself. While the weights in the decoder (and encoder) are set, we can still tweak the generated output, by using multiple generation strategies as well as using Logit Processors.


### 4.1 🗺️ Decoding strategies

While iteratively generating the output tokens, there are multiple strategies, that can be used to achieve the "best" output text.


In [5]:
from transformers import AutoModelWithLMHead, AutoTokenizer
input_sentence = "Two thousand years ago, the "
model_gpt = AutoModelWithLMHead.from_pretrained("gpt2")
tokenizer_gpt = AutoTokenizer.from_pretrained("gpt2")
input = tokenizer_gpt(input_sentence, return_tensors="pt")




- **Greedy search**: At each iteration, the most probable token is generated (most probable token at a time)


In [6]:
outputs=model_gpt.generate(**input, early_stopping=True, max_length=50) #greedy search
print(tokenizer_gpt.batch_decode(outputs, skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['Two thousand years ago, the vernacular of the ancient Egyptians was the language of the gods. The language of the gods was the language of the gods. The language of the gods was the language of the gods. The language of the gods was']


- **Beam search**: At each iteration $N$ number of beams with the best overall probability are stored. After the generation is complete the beam with highes overall probability is returned (most probable output as a whole)


In [7]:
outputs=model_gpt.generate(**input, num_beams=2, num_return_sequences=2, max_length=50) #beam search
print(tokenizer_gpt.batch_decode(outputs, skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['Two thousand years ago, the vernal equinox was the first time that the sun and moon met, and the sun and moon were the first time that the sun and moon met. The sun and moon were the first time that the sun', 'Two thousand years ago, the vernal equinox was the first time that the sun and moon met, and the sun and moon were the first time that the sun and moon met.\n\nThe sun and moon were the first time that']


- **Multinomial sampling**: sampling the token distribution
- **Top-$K$ sampling**: sampling only $K$ samples with the highest probabilities.
- **Nucleus sampling (top-$p$ sampling)**: Computing a cumulative distribution function and sampling only till the cut-off at the $p$ quantile


In [8]:
outputs=model_gpt.generate(**input, do_sample=True, max_length=50, early_stopping=True) # multinomial sampling
print(tokenizer_gpt.batch_decode(outputs, skip_special_tokens=True))
outputs=model_gpt.generate(**input, do_sample=True, top_k=50, max_length=50, early_stopping=True) # Top-k sampling
print(tokenizer_gpt.batch_decode(outputs, skip_special_tokens=True))
outputs=model_gpt.generate(**input, do_sample=True, top_p=0.90, top_k=0, max_length=50, early_stopping=True) #Top-p sampling
print(tokenizer_gpt.batch_decode(outputs, skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['Two thousand years ago, the iced tea industry had an average harvest of 1,000 tons of fresh tea per year. In the 1950s and 60s, we were drinking 4 to 5 tons of new tea per day, making the average daily']


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['Two thousand years ago, the urns became the home of the ancient world by the rise of Prometheus. It became the center of the universe in some way or another.\n\nHow long has the earth been inhabited by men?\n\nThe']
["Two thousand years ago, the vernacular of 'Lordhun' is a whole new weapon, as magic is reworked with a different twist, those ooze crystals grow larger to represent their wielder.\n\nTheir intricacies change and"]


- **Temperature**: A parameter which adjusts the logits before applying the softmax function. Greater temperature can lead to less probable words generated (more unexpected generated sequence), small temperature will lead to more conservative outputs.


In [9]:
outputs=model_gpt.generate(**input, early_stopping=True, max_length=50, do_sample=True, top_k=0, temperature=0.7) #multinomial sampling with high temp
print(tokenizer_gpt.batch_decode(outputs, skip_special_tokens=True))
outputs=model_gpt.generate(**input, early_stopping=True, max_length=50, do_sample=True, top_k=0, temperature = 0.001) #multinomial sampling with low temp
print(tokenizer_gpt.batch_decode(outputs, skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['Two thousand years ago, the vernacular was a rich language of the ancient world, which was combined with an advanced culture that was identifiable with the written language of the vernacular. The vernacular was a mixture of two different dialects']
['Two thousand years ago, the vernacular of the ancient Egyptians was the language of the gods. The language of the gods was the language of the gods. The language of the gods was the language of the gods. The language of the gods was']


### 4.2 👮 Logit Processors

Logit processors can be applied to the computed logits, to alter them and therefore change the next most probable token. They are not dependant on any decoding strategy. Some of the basic usages are enforcing minimal or maximal length of the output, forbidding specific words etc. Here we will show you a naive approach to creating a Logit processor which enforces a token that should be in the output.


In [10]:

from transformers import LogitsProcessor, LogitsProcessorList, BatchEncoding
import torch

class EnforceWordProcessor(LogitsProcessor):
    def __init__(self, desired_input: BatchEncoding, param: float = 5.5):
        print(desired_input)
        assert len(desired_input.input_ids) ==1
        self.desired_input = desired_input.input_ids[0]
        self.param = param #the parameter which indicates how much we want to enforce the word
        pass

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor):
        last_id = input_ids[-1][-1] # ID of the last generated token (we are adjusting the next score)
        if last_id != self.desired_input:
            scores[-1,self.desired_input] = scores[-1,self.desired_input]*self.param
        else:
            scores[-1,self.desired_input] = -float("inf") # we only want to generate the token once
        return scores
    
enforced_word = " stupid" # we need the whitespace for the tokenizer to understand this as one token only

custom_processor = EnforceWordProcessor(tokenizer_gpt(enforced_word, add_special_tokens=False))

inputs = tokenizer_gpt(input_sentence, return_tensors="pt")

#repetition penalty is needed as the naive Logit Processor leads to repetition, so to avoid that, we include a rep. penalty
# Here we are using a beam search decodeing strategy, which is much more likely to fall into the pit of repetition
outputs = model_gpt.generate(**inputs, logits_processor=LogitsProcessorList([custom_processor]), max_length=140, repetition_penalty=3.0)
output_text = tokenizer_gpt.batch_decode(outputs, skip_special_tokens=True)

output_text

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{'input_ids': [8531], 'attention_mask': [1]}


['Two thousand years ago, the vernacular of English was a mixture between "the old" and stupid. The word for it is literally translated as:\nThe Old Man\'s Wife (or Woman) [ edit ]\n\n']

### Are Logit Processors worth the work?

The are use-cases, where a logit processor can help you enforce a rule of how structured the output should be. A really good example of that is the [**jsonformers**](https://github.com/1rgs/jsonformer) repo.


## 5. ✋ Hands-on: Adjusting the output without retraining using Logit Processors

In this first short session we would ask you, to take the previously shown naive Logit Processor and try to adjust it to perform better by:

- enforce not a single token word, but a sequence of multiple tokens.
- Try to fix the repetition problem without using a repetition penalty

You can also experiment with different decoding strategies, which we talked about above.


In [11]:
from transformers import LogitsProcessor, LogitsProcessorList
import torch

class YourVeryOwnLogitProcessor(LogitsProcessor):
    def __init__(self):
        pass

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor):
        # The __call__ method has to return scores
        return scores
   
custom_processor = YourVeryOwnLogitProcessor()

text = "This is the start of the sequence"

inputs = tokenizer_gpt(text, return_tensors="pt") # tokenize the input sequence
outputs = model_gpt.generate(**inputs, logits_processor=LogitsProcessorList([custom_processor])) # generate the output - now using greedy search
output_text = tokenizer_gpt.batch_decode(outputs, skip_special_tokens=True) # decode the output

output_text

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['This is the start of the sequence.\n\nThe first thing to do is to create a new']

## [Hands-on] solution


In [97]:
from transformers import LogitsProcessor, LogitsProcessorList, BatchEncoding
import torch
from numpy import sign, log

class EnforceTokenSequenceProcessor(LogitsProcessor):
    def __init__(self, desired_input: BatchEncoding, input: BatchEncoding, param: float =1.1):
        self.start_len = len(input.input_ids) # Length of the original input
        self.desired_input = desired_input.input_ids # Sequence of tokens we want to generate
        self.desired_input_len = len(self.desired_input) # Length of the desired sequence
        self.param = param # Parameter which indicates how much we want to enforce the sequence (trunctuated to the value of 1 if <1)
        self.num_gen_tokens = 0 # Index on how many tokens from the sequence are generated
        self.blacklist = [] # A list of tokens from the desired sequence which have already been generated
        pass

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor):
        last_id = input_ids[-1][-1] # ID of the last generated token (we are adjusting the next score)
        gen_input_length = input_ids.shape[1]-self.start_len # length of generated text AFTER the input text

        # the exponent is for enforcing the sequence more, the longer the token is not generated
        # logarithm function because it is a lower growing function but
        # we have to trunctuate the function for param < 1 
        enforcement_param = log(self.param**gen_input_length) if not log(self.param**gen_input_length) < 1 else 1

        # If we are generating the sequence we move to the next token and add the generate token to blacklist
        if last_id == self.desired_input[self.num_gen_tokens]:
            self.blacklist.append(self.desired_input[self.num_gen_tokens])
            if self.num_gen_tokens < self.desired_input_len-1:
                self.num_gen_tokens+=1

        # If we stopped generating sequence not because of completing it we reset and start enforcing all over
        elif last_id != self.desired_input[self.num_gen_tokens] and self.num_gen_tokens > 0 and self.num_gen_tokens+1 != self.desired_input_len:
            self.blacklist = []
            self.num_gen_tokens = 0

        # Adjustment of scores

        # Rise the scores of the token we want to generate
        # sign function make sure we multiply scores > 0 and divide scores < 0 - to always increase the score 
        scores[-1,self.desired_input[self.num_gen_tokens]] = scores[-1,self.desired_input[self.num_gen_tokens]]*enforcement_param**sign(scores[-1,self.desired_input[self.num_gen_tokens]])
        # Make sure the previous tokens from the sequence cannot be generated
        scores[-1,self.blacklist] = -float("inf")
        return scores
    
enforced_sentence = " hope you had a great ML Prague conference"
input_sentence = "Well,"

inputs = tokenizer_gpt(input_sentence, return_tensors="pt")
tokenized_enforced_sequence = tokenizer_gpt(enforced_sentence, add_special_tokens=False)

# Initialize tour Logit Processor
custom_processor = EnforceTokenSequenceProcessor(tokenized_enforced_sequence, inputs, param = 2.1)

#repetition penalty is needed as the naive Logit Processor leads to repetition, so to avoid that, we include a rep. penalty
outputs = model_gpt.generate(**inputs, logits_processor=LogitsProcessorList([custom_processor]),max_length=140, repetition_penalty=3.0) #greedy search
output_text = tokenizer_gpt.batch_decode(outputs, skip_special_tokens=True)

output_text

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


["Well, I hope you had a great ML Prague conference.\nI'm sure there are many more that will be coming soon!"]

As you can see, the way of enforcing the given sentence is not really an elegant way. Also, if you try using it with beam search, you will find, that nothing will be generated, as there are other sequences whose joined probability is higher.

If you want to use beam search as the decoding strategy, and are interested in forced words generation try [**constrained beam search**](https://huggingface.co/blog/constrained-beam-search)
