# Transformers

On Transformers, TimeSformers, And Attention

```{image} ../figs/deep_nlp/transformers/entelecheia_transformers.png
:alt: transformers
:class: bg-primary mb-1
:width: 70%
:align: center
```


Transformers are a highly influential and powerful class of Deep Learning models that have become a standard in numerous Natural Language Processing (NLP) tasks, and are on the verge of revolutionizing the field of Computer Vision as well.

In 2017, researchers from Google Brain published a groundbreaking paper titled "Attention Is All You Need" {cite}`vaswani2017attention`. This paper introduced the Transformer model, which has since become a major force in the field of deep learning, particularly for NLP tasks.

The Transformer model is built upon the concept of attention, a mechanism that allows the model to focus on specific parts of the input data. Attention enables the model to weigh the importance of different input elements, thereby allowing it to concentrate on the most relevant aspects when processing data. This ability to focus on the most pertinent information has proven to be particularly effective for tasks that involve understanding and manipulating sequences of data, such as text or time-series information.

The Transformer model has demonstrated impressive performance across a wide range of NLP tasks. Some notable examples include machine translation, where the model converts text from one language to another; text summarization, which involves generating a concise summary of a longer text; and question answering, where the model provides answers to questions based on a given context.

Aside from its exceptional performance in NLP, the Transformer model has also shown great promise in other domains, such as image classification and speech recognition. In these tasks, the attention mechanism helps the model focus on the most significant features of the input data, which leads to better overall performance. As the model continues to be refined and adapted for different applications, it is poised to have a substantial impact on a wide range of fields, from NLP and computer vision to speech processing and beyond.


```{figure} ../figs/deep_nlp/transformers/transformers-history.jpeg
---
width: 70%
name: fig-transformers-history
---
History of Transformers
```


In 2020, Google Brain posed an intriguing question: "Will Transformers be as effective on images as they are on text?" To explore this, a team of researchers published a paper titled "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" {cite}`dosovitskiy2020image`. This paper demonstrated the potential of Transformer models to excel in various Computer Vision tasks, including image classification and object detection.

The authors of the paper proposed a new Transformer-based architecture, called Vision Transformer (ViT), which treats image patches as if they were words in a text. By dividing an image into fixed-size, non-overlapping patches, and then flattening and linearly embedding them into a sequence of tokens, the model can apply the powerful attention mechanisms of Transformers to image data. The ViT model has shown impressive results on a variety of Computer Vision tasks, proving that the Transformer architecture can be highly effective for image processing as well as text-based tasks.

At the beginning of 2021, Facebook researchers took the concept of Transformers even further by introducing TimeSformer, a new variation of the Transformer model specifically designed for video understanding. They published a paper titled "Is space-time attention all you need for video understanding?" {cite}`bertasius2021space`. TimeSformer leverages the attention mechanism to process not only spatial information in video frames but also temporal information across multiple frames. This approach allows the model to recognize and analyze patterns in both space and time, making it well-suited for tasks such as action recognition and video classification.

These advancements in Transformer-based models for both image and video processing showcase the versatility of the Transformer architecture. By adapting the attention mechanism to different types of data, researchers are discovering new ways to leverage the power of Transformers across a wide range of domains.


## Why do we need transformers?

Transformers were introduced to address several limitations and challenges associated with previous models, particularly those based on Recurrent Neural Networks (RNNs). RNNs were designed to process sequences of data, such as text or audio, by maintaining an internal state that can capture information about previous elements in the sequence. However, there were a number of issues with RNNs that motivated the development of the Transformer architecture.

One of the main drawbacks of RNNs is their inherently sequential nature. In order to process an input sequence, RNNs must process each element one at a time, starting with the first and continuing through to the last. This sequential operation makes it challenging to parallelize RNN models, which in turn limits their computational efficiency and scalability. In contrast, Transformers can process all elements of a sequence simultaneously, enabling them to take full advantage of parallel computing resources and significantly improve training and inference times.

Another issue with RNNs is the difficulty they face in capturing long-range dependencies between distant elements within a sequence. Due to their sequential processing, RNNs can suffer from problems like gradient explosion or vanishing gradients, which make it challenging for them to learn and maintain information about relationships between far-apart elements. Transformers, on the other hand, leverage the attention mechanism to directly model relationships between all elements in the input, regardless of their positions in the sequence. This allows Transformers to better capture long-range dependencies and improves their overall performance in tasks that require understanding complex relationships within the data.

To illustrate the difference between RNNs and Transformers, consider the task of translating a sentence from English to Italian. With an RNN-based approach, the first word of the source sentence would be passed into an encoder along with an initial state. The resulting state would then be passed to a second encoder, together with the second word of the sentence, and so on, until the last word. The final state from the last encoder would then be fed into a decoder, which would generate the first translated word and a subsequent state. This state would be passed to another decoder, and the process would continue until the entire sentence was translated. This sequential approach is inherently slower and less efficient than the parallel processing enabled by Transformer models, which can consider the entire input sequence at once and generate translations more effectively.


 <!-- ![](../figs/deep_nlp/transformers/problem-rnn.gif) -->

```{figure} ../figs/deep_nlp/transformers/problem-rnn.gif
---
width: 70%
name: fig-problem-rnn
---
Problem with RNNs
```


## Attention is all you need?

```{figure} ../figs/deep_nlp/transformers/attention.gif
---
width: 70%
name: fig-attention
---
Attention
```

The attention mechanism is a key component of the Transformer architecture that allows it to extract relevant information from a sentence or sequence in a parallelized manner, as opposed to the sequential processing of traditional Recurrent Neural Networks (RNNs). The attention mechanism enables the model to selectively focus on different parts of the input data that are most relevant to the task at hand, and it can do so simultaneously for all elements in the sequence.

Consider the example sentence: `I gave my dog Charlie some food.`

In this context, let's explore how the attention mechanism can identify important relationships between words in the sentence to better understand its meaning.

1. Focusing on the word "gave", we may want to determine which words in the sentence provide context to the action of giving. In this case, we could ask, "Who gave the food to the dog?" The attention mechanism would recognize the importance of the word "I" in answering this question, as it indicates the subject performing the action.
2. Similarly, we might want to determine the recipient of the action. By asking, "To whom did I give the food?", the attention mechanism would identify the words "dog" and "Charlie" as crucial for understanding the recipient of the action.
3. Finally, to determine the object of the action, we could ask, "What did I give to the dog?" Here, the attention mechanism would focus on the word "food", as it represents the object being given.

In each of these cases, the attention mechanism can identify and focus on the most relevant words in the sentence to provide context and meaning to the central action of "gave". By doing so, the model can extract essential information and relationships within the sentence in a highly parallelized and efficient manner. This powerful mechanism is at the core of the Transformer architecture, enabling it to excel in various natural language processing tasks, such as machine translation, text summarization, and question answering.


### How do we implement this attention mechanism?

```{figure} ../figs/deep_nlp/transformers/attention-calculate.gif
---
width: 70%
name: fig-attention-calculate
---
Attention mechanism
```

To understand and implement the attention mechanism, let's break it down into a series of steps. We'll use an analogy with databases to help explain the process more clearly.

1. **Encoding words as vectors**: First, we represent the sentence as a set of vectors. Each word in the sentence is encoded into a vector using a word embedding mechanism. We'll call these vectors "keys" (K).
2. **Defining the query**: We then define a query (Q) as a vector representing the word we want to focus on. The query can be a word from the same sentence (self-attention) or a word from another sentence (cross-attention).
3. **Computing similarity**: We calculate the similarity between the query (Q) and each of the keys (K) in the sentence. This is typically done by computing the dot product between the query vector and the transpose of the key vectors. The result is a vector of scores, with each score representing the similarity between the query and a key.
4. **Normalization**: We normalize the scores by applying the softmax function, which converts the scores into a probability distribution. The resulting vector contains the "attention weights" that signify the importance of each word in the sentence with respect to the query.
5. **Computing the context**: We multiply the attention weights by the sentence vectors, which are the same dimension as the keys (K). This results in a context vector (C), which is a weighted sum of the words in the sentence, capturing the relevant context for the word we want to focus on.
6. **Linear transformation**: Finally, the context vector (C) is passed through a linear layer, which is a fully connected layer, to obtain the final output of the attention mechanism.

```{figure} ../figs/deep_nlp/transformers/attention-focus.jpeg
---
width: 70%
name: fig-attention-focus
---
Attention implementation
```

To summarize, the attention mechanism involves encoding words in a sentence as vectors, defining a query vector for the word of interest, calculating the similarity between the query and each word in the sentence, and using the resulting attention weights to compute a context vector that captures the relevant information about the word of focus. This context vector is then transformed by a linear layer to obtain the final output.

By following these steps, the attention mechanism can selectively focus on different parts of the input sentence to provide the necessary context for understanding and processing the information. This powerful mechanism is at the core of the Transformer architecture and plays a crucial role in its success in various natural language processing tasks.


### Multi-head attention

The basic attention mechanism described earlier focuses on a single word and its context within a sentence. However, to effectively capture various relationships between words, we need to examine the input from multiple perspectives. This is where the multi-head attention mechanism comes into play.

Multi-head attention is an extension of the basic attention mechanism that utilizes multiple sets of queries, keys, and values, known as "heads". Each head has its own set of learnable parameters and is designed to capture different aspects of the relationships between words in the input. By using multiple heads, the model can attend to different parts of the input simultaneously and understand the input more comprehensively.

Here's how multi-head attention works:

1. **Multiple sets of queries, keys, and values**: Instead of using a single set of query, key, and value vectors, we use multiple sets (or "heads") with their own learnable parameters. Each head focuses on different aspects of the relationships between words in the input.
2. **Compute attention for each head**: For each head, we perform the same attention mechanism steps as before, namely, computing similarity scores, normalizing the scores using the softmax function, computing the context vector, and transforming the context vector using a linear layer.
3. **Concatenate the results**: Once we have obtained the context vectors for each head, we concatenate these vectors to form a single, combined context vector. This concatenated vector captures information from all the attention heads and provides a more comprehensive understanding of the relationships between words in the input.
4. **Final linear transformation**: The concatenated vector is passed through an additional linear layer to produce the final output of the multi-head attention mechanism.

```{figure} ../figs/deep_nlp/transformers/attention-multihead.jpeg
---
width: 70%
name: fig-attention-multihead
---
Multi-head attention
```

In summary, multi-head attention enhances the basic attention mechanism by using multiple heads to focus on different parts of the input simultaneously. This allows the model to capture a more comprehensive understanding of the relationships between words in the input, ultimately improving its performance in various natural language processing tasks.


## Tranformer Architecture

The Transformer architecture, designed for various natural language processing tasks such as translation, consists of two primary components: the encoder and the decoder. Let's consider a Transformer model trained to translate a sentence from English to Italian and examine the roles of the encoder and decoder in this process.

```{figure} ../figs/deep_nlp/transformers/attention-architecture.gif
---
width: 70%
name: fig-attention-architecture
---
Transformer architecture
```

### Encoder

The encoder's role is to convert the input sentence into a meaningful vector representation. The steps involved are as follows:

1. **Tokenization**: The input sentence in English is first tokenized into individual words.
2. **Word Embedding**: Each word is converted into a vector using a word embedding mechanism.
3. **Positional Encoding**: Since the Transformer does not process the sentence sequentially, positional encoding vectors are added to the word embeddings to retain information about the order of the words in the sentence. These vectors are computed using sine and cosine functions and are of the same size as the word embedding vectors.
4. **Multi-head Attention**: The combined vectors are passed through the multi-head attention mechanism, which captures the relationships between words in the input sentence.
5. **Normalization and Feed-Forward Neural Network**: The output from the attention mechanism is normalized and passed through a feed-forward neural network.
6. **Stacking Encoder Layers**: The encoding process can be repeated multiple times, with each layer refining the sentence representation further.

### Decoder

The decoder's responsibility is to transform the encoded vector representation into a translated sentence in the target language, in this case, Italian. The steps involved are:

1. **Input Preparation**: The decoder takes as input the previously translated words in Italian and the output from the encoder. Initially, the input consists of the first two translated words.
2. **Positional Encoding and Multi-head Attention**: The decoder applies positional encoding and multi-head attention mechanisms to the translated words in Italian.
3. **Concatenation and Recalculation**: The output from the attention mechanism is concatenated with the output from the encoder, and attention is recalculated on the concatenated vector.
4. **Normalization and Feed-Forward Neural Network**: The concatenated vector is normalized and passed through a feed-forward neural network.
5. **Predicting Next Word**: The output from the neural network is a vector of potential candidates for the next word in the translated Italian sentence.
6. **Iterative Decoding**: In the next iteration, the decoder takes as input the first three translated words in Italian along with the encoder's output. This process is repeated until the entire translated sentence is generated.

In summary, the Transformer architecture consists of an encoder that converts the input sentence into a vector representation and a decoder that translates the encoded vector into the target language. The model leverages multi-head attention, positional encoding, and feed-forward neural networks to capture and process the relationships between words effectively, resulting in improved performance in various natural language processing tasks.


```{figure} ../figs/deep_nlp/transformers/transformer-best.jpeg
---
width: 70%
name: fig-transformer-best
---
Problems with the Transformer architecture
```


## Problems with the Transformer architecture

```{figure} ../figs/deep_nlp/transformers/transformer-problem.gif
---
width: 70%
name: fig-transformer-problem
---
Problems with the Transformer architecture
```


The Transformer architecture is a very powerful architecture, but it has some problems.

- One of its strengths is also its weakness, the calculation of attention is very expensive.
- The attention mechanism is very expensive because it requires a lot of computation.
- In order to calculate the attention of each word with respect to all the others I have to perform $N^2$ calculations.
- Graphically you can imagine a matrix that has to be filled with the attention values of each word compared to any other word.
- Optionally and usually on the decoder, it is possible to calculate the masked attention to avoid the calculation of the attention of a word with respect to the following words.
- The masked attention is a mechanism that allows you to calculate the attention of a word with respect to the previous words.


```{figure} ../figs/deep_nlp/transformers/transformer-N2.gif
---
width: 70%
name: fig-transformer-N2
---
Attention matrix
```


## Attention Is Not All You Need

In March 2021, Google researchers published a paper titled, "Attention Is Not All You Need" {cite}`dong2021attention`.

- The researchers conducted experiments analyzing the behaviour of the self-attention mechanism conducted without any of the other components of the transformers.
- They found that it converges to a rank 1 matrix with a doubly exponential rate.
- This means that this mechanism, by itself, is practically useless.


### So why are transformers so powerful?

- The researchers found that the self-attention mechanism is not the only component that makes transformers so powerful.
- It is due to a tug of war between the self-attention mechanism that tends to reduce the rank of the matrix and two other components of transformers, skip connections and MLP.
- The skip connections allow the model to diversify the distribution of paths avoiding all the same path.
- This drastically reduces the probability of the model converging to a rank 1 matrix.
- The MLP instead manages to increase the rank of the matrix due to the non-linearity of the activation function.
- Therefore, attention is not all you need, but it is necessary to have skip connections and MLP to make transformers powerful.
- The transformer architecture manages to use the self-attention mechanism to its advantage to achieve impressive results.


```{figure} ../figs/deep_nlp/transformers/transformer-tug-of-war.jpeg
---
width: 70%
name: fig-transformer-tug-of-war
---
Tug of war between the self-attention mechanism and skip connections and MLP
```


## Vision Transformers

"If Transformers have been found to be so effective in the field of Natural Language Processing, how will they perform with images?"

```{figure} ../figs/deep_nlp/transformers/vit.jpeg
---
width: 70%
name: fig-vit
---
Vision Transformer
```


If we consider a picture of a dog standing in front of a wall, we can imagine that the dog is the main subject of the picture and the wall is the background.

- This is because we are focusing on the dominant subject of the picture.
- This is the same concept that we use to understand the meaning of a sentence.
- This is exactly what the self-attention mechanism applied to images does.


### How to input images into a transformer?

- A first solution would be to use all the pixels of the image and pass them to the transformer.
- This solution is not very efficient because it would require a lot of computation.
- The calculation of attention has a complexity equal to $O(N^2)$, where $N$ is the number of pixels.
- This means that the calculation of attention would require $O(N^4)$ calculations.
- This is not a viable solution because it would require a lot of computation.

```{figure} ../figs/deep_nlp/transformers/vit-pixels.gif
---
width: 70%
name: fig-vit-pixels
---
Vision Transformer with pixels
```


The solution is simple.

- The image is divided into patches.
- Each patch is converted into a vector using a linear projection.

```{figure} ../figs/deep_nlp/transformers/vit-projection.gif
---
width: 70%
name: fig-vit-projection
---
Linear projection of the patches
```


### Vision Transformer Architecture

- Vectors obtained from a linear projection are then coupled with positional encoding vectors.
- The result is then passed to a classic transformer architecture.
- The result is a vector that represents the image.
- The vector is then passed to a classifier to obtain the final result.

```{figure} ../figs/deep_nlp/transformers/vit-architecture.gif
---
width: 70%
name: fig-vit-architecture
---
Vision Transformer Architecture
```


## Transformer in Transformer

In the transition from patch to vector, any kind of information about the position of pixels in the patch is lost. Is it possible to find a better way to get the vectors to submit to the transformer?

The authors of Transformer in Transformer (TnT) {cite}`han2021transformer` point out because the arrangement of pixels within a portion of the image to be analyzed is certain information we would not want to lose in order to make a quality prediction.

- Their proposal is then to take each individual patch (pxp) of the image, which are themselves images on 3 RGB channels, and transform it into a c-channel tensor.
- This tensor is then divided into $p^\prime$ parts with $p^\prime<p$, in the example $p^\prime=4$.
- This yields $p’$ vectors in $c$ dimensions.
- These vectors now contain information about the arrangement of pixels within the patch.
- They are then concatenated and linearly projected in order to make them the same size as the vector obtained from the linear projection of the original patch and combined with it.
- By doing this the authors have managed to further improve performance on various computer vision tasks.


```{figure} ../figs/deep_nlp/transformers/tnt.gif
---
width: 70%
name: fig-tnt
---
Transformer in Transformer
```


```{figure} ../figs/deep_nlp/transformers/tnt-architecture.gif
---
width: 70%
name: fig-tnt
---
Transformer in Transformer Architecture
```


## TimeSformers

In 2021 Facebook researchers tried to apply this architecture to video as well.

- The idea is to divide the video into frames and then apply the same procedure as for images.
- There is only one small detail that makes them different from Vision Transformers.
- You have to take into account the temporal dimension of the video besides the spatial dimension.

![](../figs/deep_nlp/transformers/timesformer-video.gif)


The authors have suggested several new attention mechanisms, from those that focus exclusively on space, used primarily as a reference point, to those that compute attention axially, scattered, or jointly between space and time.


```{figure} ../figs/deep_nlp/transformers/timesformer-architecture.gif
---
width: 70%
name: fig-timesformer-architecture
---
TimeSformer Architecture
```

```{figure} ../figs/deep_nlp/transformers/timesformer-attentions.jpeg
---
width: 70%
name: fig-timesformer-attentions
---
TimeSformer Attention Mechanisms
```


- The method that has achieved the best results is Divided Space-Time Attention.
- It consists, given a frame at instant t and one of its patches as a query, to compute the spatial attention over the whole frame and then the temporal attention in the same patch of the query but in the previous and next frame.
- Why does this approach work so well?
- The reason is that it learns more separate features than other approaches and is, therefore, better able to understand videos from different categories.


We can see this in the following visualization where each video is represented by a point in space and its colour represents the category it belongs to.

```{figure} ../figs/deep_nlp/transformers/timesformer-divide.jpeg
---
width: 70%
name: fig-timesformer-divide
---
TimeSformer Divide Space-Time Attention
```


- The authors ound that the higher the resolution the better the accuracy of the model, up to a point.
- As for the number of frames, again as the number of frames increases, the accuracy also increases.
- It was not possible to make tests with a higher number of frames than that shown in the graph and therefore potentially the accuracy could still improve.
- The upper limit of this improvement is not yet known.


```{figure} ../figs/deep_nlp/transformers/timesformer-resolution.jpeg
---
width: 70%
name: fig-timesformer-resolution
---
TimeSformer Resolution
```


In Vision Transformers it is known that a larger training dataset often results in better accuracy. This was also checked by the authors on TimeSformers and again, as the number of training videos considered increases, the accuracy also increases.

```{figure} ../figs/deep_nlp/transformers/timesformer-dataset.jpeg
---
width: 70%
name: fig-timesformer-dataset
---
TimeSformer Dataset
```


Transformers have just landed in the world of computer vision and seem to be more than determined to replace traditional convolutional networks or at least carve out an important role for themselves in this area.

- Transformers are a powerful architecture that has revolutionized the field of Natural Language Processing.
- They have been able to achieve impressive results in various tasks.
- They have also been applied to other areas such as computer vision.
- The results obtained are very promising and it is likely that they will continue to improve in the future.


## Multimodal Machine Learning

Having now a single architecture capable of working with different types of data, we can now start to think about how to combine them.

- This is called multimodal machine learning.
- People are able to combine information from several sources to draw their own inferences.
- They simultaneously receive data by observing the world around them with their eyes, but also by smelling its scents, listening to its sounds or touching its shapes.
- This is why we are able to understand the world around us.
- We can now try to replicate this ability in machines.
- The problem lies in treating all the different inputs in the same way without losing information.
- Trnasformers are a good candidate for this task.
- They are able to process different types of data and combine them in a single architecture.


## VATT: Transformers for Multimodal Self-Supervised Learning

One of the most important applications of Transformers in the field of Multimodal Machine Learning is VATT {cite}`akbari2021vatt`.

```{figure} ../figs/deep_nlp/transformers/vatt.gif
---
width: 70%
name: fig-vatt
---
VATT Architecture
```


The proposed architecture is composed of a single Transformer Encoder on which three distinct forward calls are made.

- One call for each type of input data is always transformed into a sequence of tokens.
- The transformer takes these sequences as input and returns three distinct sets of features.
- Then the features are given in input to a contrastive estimation block that calculates a single loss and performs the backward.
- In this way the loss is the result of the error committed on all the three types of data considered.
- Therefore the model, between the epochs, will learn to reduce it by managing better the information coming from all the three different sources.
- VATT represents the culmination of what Multimodal Machine Learning had been trying to achieve for years, a single model that handles completely different types of data together.


## GATO: A Generalist Agent

Is it possible to realize a neural network capable of receiving inputs of different types and then being able to perform different tasks?

- This is the question that the authors of GATO {cite}`reed2022generalist` have tried to answer.
- GATO is a multi-modal, multi-task, multi-embodiment generalist that represents one of the most impressive achievements in this field today.
- How does it work?
- GATO is composed of a single Transformer Encoder that receives as input a sequence of tokens representing the different types of data.
- Thanks to this unification of inputs and to the Transformer architecture, GATO is able to learn to combine the different types of data and to perform different tasks, achieving an unprecedented level of generalisation.


![](../figs/deep_nlp/transformers/gato.gif)


![](../figs/deep_nlp/transformers/gato-examples.gif)
