# Transformers

[On Transformers, TimeSformers, And Attention](https://www.topbots.com/transformers-timesformers-and-attention/)

![](../figs/deep_nlp/transformers/entelecheia_transformers.png)

Transformers are a very powerful Deep Learning model that has been able to become a standard in many Natural Language Processing tasks and is poised to revolutionize the field of Computer Vision as well.

- Google Brain published the paper "Attention Is All You Need" in 2017 {cite}`vaswani2017attention`.
- The paper introduced the Transformer model, which is a deep learning model that is able to perform well on many NLP tasks.
- The model is based on the idea of attention, which is a way of focusing on certain parts of the input.
- The model is able to perform well on many NLP tasks, including machine translation, summarization, and question answering.
- The model is also able to perform well on many other tasks, including image classification and speech recognition.


![](../figs/deep_nlp/transformers/transformers-history.jpeg)

In 2020, Google Brain asks "will they be as effective on images?"

- The paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" {cite}`dosovitskiy2020image` was published in 2020.
- The model is able to perform well on many Computer Vision tasks, including image classification and object detection.

At the beginning of 2021, Facebook researchers published a new version of the Transformers model, called TimeSformer.

- The paper "Is space-time attention all you need for video understanding?" {cite}`bertasius2021space` was published in 2021.


## Why do we need transformers?

What are the problems with the previous models?

- The previous models were based on Recurrent Neural Networks (RNNs).
- RNNs are able to process sequences of data, such as text or audio.
- One of the main problems is its sequential operation.
- The model needs to process the input sequentially, which means that it needs to process the first word, then the second word, and so on.
- This sequential operation makes it difficult to parallelize the model.
- There are also other problems such as gradient explosion, inability to detect dependencies between distant words in the same sentence, and so on.

 For example, to translate a sentence from English to Italian, with this type of networks, the first word of the sentence to be translated was passed into an encoder together with an initial state, and the next state was then passed into a second encoder with the second word of the sentence, and so on until the last word. The resulting state from the last encoder is then passed to a decoder that returns as output both the first translated word and a subsequent state, which is passed to another decoder, and so on.

 ![](../figs/deep_nlp/transformers/problem-rnn.gif)

## Attention is all you need?

Is there a mechanism that we can compute in a parallelized manner that allows us to extract the information we need from the sentence? 

![](../figs/deep_nlp/transformers/attention.gif)

> `I gave my dog Charlie some food.`

- Focusing on the word “gave,” what other words in the sentence should we pay attention on to add context to the word “gave”?
- You might ask yourself, “Who gave the food to the dog?”
- In this case, the attention mechanism would focus on the words “I.”
- If you were to ask yourself, “To whom did I give the food?”
- The attention mechanism would focus on the words “dog” and “Charlie.”
- If you were to ask yourself, “What did I give to the dog?”
- The attention mechanism would focus on the words “food.”

### How do we implement this attention mechanism?

- To understand the computation of attention we can draw parallels to the world of databases.
- When we do a search in the database we submit a query (Q) and we search among the available data for one or more keys that satisfy the query.
- The keys are the words in the sentence and the query is the word we want to focus on.
- The result of the search is the value of the key that satisfies the query.


![](../figs/deep_nlp/transformers/attention-calculate.gif)

- We begin by looking at the sentence on which to compute attention as a set of vectors.
- Each word, via a word embedding mechanism, is encoded into a vector K.
- These vectors are the keys to search for with respect to a query.
- A query is a vector Q that represents the word we want to focus on.
- The query could be a word from the same sentence (self-attention) or a word from another sentence (cross-attention).
- When then compute the similarity between the query Q and each of the available keys K.
- The similarity is computed by multiplying the query Q by the transpose of the keys K.
- The result of this operation is a vector of scores that represent the similarity between the query and each of the keys.
- The scores are then normalized to obtain a probability distribution by applying the softmax function.
- The result of the softmax function is a vector of probabilities that represent the attention weights.
- The attention weights are then multiplied by the sentence vector, which is a vector of the same dimension as the keys, where each value represents the word in the sentence.
- The result of this operation is a vector that represents the context of the word we want to focus on.
- The context vector C is a vector of the same dimensionality as the keys K, where each element is a weighted sum of the keys K.
- The context vector C is then passed to a linear layer, which is a fully connected layer, to obtain the final result of the attention mechanism.


![](../figs/deep_nlp/transformers/attention-focus.jpeg)

- Each vector represents a word in the sentence.
- The word we want to focus on is represented by the vector Q.
- We then compute the similarity between the vector Q and each of the vectors in the sentence.
- The similarity is computed by multiplying the vector Q by the vector of each word in the sentence.
- The result of the multiplication is a scalar value that represents the similarity between the vector Q and the vector of the word in the sentence.
- The scalar value is then passed through a softmax function, which normalizes the values between 0 and 1.
- The result of the softmax function is the attention vector.
- The attention vector is a vector of the same size as the sentence, where each value represents the attention that should be given to each word in the sentence.
- The attention vector is then multiplied by the sentence vector, which is a vector of the same size as the sentence, where each value represents the word in the sentence.
- The result of the multiplication is a vector of the same size as the sentence, where each value represents the weighted sum of the words in the sentence.
- The weighted sum is then passed through a linear layer, which is a fully connected layer, to obtain the final result.

### Multi-head attention

This mechanism would be sufficient if we wanted to focus on a single word. However, we want to focus on from several points of view.

- With a simliar mechanism, we can use multiple keys to focus on different words in the sentence.
- The results are then concatenated to obtain a single, summarized vector of all the attention mechanisms.
- This mechanism is called multi-head attention.

![](../figs/deep_nlp/transformers/attention-multihead.jpeg)

## Tranformer Architecture
