<a target="_blank" href="https://colab.research.google.com/github/avakanski/Fall-2022-Python-Programming-for-Data-Science/blob/main/Lectures/Theme%203%20-%20Model%20Engineering%20Pipelines/Lecture%2019%20-%20Transformer%20Networks/Lecture%2019%20-%20Transformer%20Networks.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

<a name='section0'></a>
# Lecture 19 Transformer Networks

- [19.1 Introduction to Tranformers](#section1)
- [19.2 Self-attention Mechanism](#section2)
- [19.3 Multi-head Attention](#section3)
- [19.4 Encoder Block](#section4)
- [19.5 Fine-tuning a Pretrained Bert Model](#section5)
- [19.6 Decoder Block](#section6)
- [19.7 Vision Transformenrs](#section7)





<a name='section1'></a>

# 19.1 Introduction to Transformers 

***Transformer Neural Networks***, or simply ***Transformers***, is a neural network architecture introduced in 2017 in the now famous paper [“Attention is all you need”](https://arxiv.org/abs/1706.03762). The title reffers to the attention mechanism, which forms the basis for data processing with Transformers.  

Transformer models have been the predominant type of deep learning models used for NLP in recent years. They replaced Recurrent Neural Networks in all NLP tasks, and all Large Language Models employ the Transformer network architecture. As well as, Transformers networks were recently adapted for other taks and have outperformed other machine learning models for image processing and video processing tasks, protein and DNA sequence prediction, timeseries data, as well have been used for reinforcement learning tasks. Consequently, Transformers are currently the most important neural network architecture.

<a name='section2'></a>

# 19.2 Self-attention Mechanism

***Self-attention*** in neural networks is a mechanism that forces a model to attend to portions of the data when making predictions. For instance, in NLP, self-attention mechanism is used to identify words in sentences that have significance for a given (query) word in the sentence. That is, the model should pay more attention to some words in sentences, and less attention to other words in sentences that are less relevant for a given task.  

In the following figure two sentences are shown, where in the left subfigure the word "it" refers to "street", while in the right sub-figure the word "it" refers to "animal". Understanding the relationships between the words in such sentences has been challening with traditional NLP approaches. Transformers use the self-attention mechanism to model the relationships between all words in a sentence, and assign weights to the other words in sentences based on their importance. In the left figure, the mechanism estimated that the ***query word*** "it" is most related to the word "street", but the word "it" is also somewhat related to the words "The" and "animal. These words are referred to as ***key words*** for the query word "it".The intensity of the lines connecting the words as well as the intensity of the blue color signifies the attention scores. The wider and bluer the lines, the higher the attention scores between two words are. 

<img src='https://raw.githubusercontent.com/avakanski/Fall-2022-Python-Programming-for-Data-Science/main/Lectures/Theme%203%20-%20Model%20Engineering%20Pipelines/Lecture%2019%20-%20Transformer%20Networks/images/attn_1.png' width=600px/>



Specifically, the Transformer network compares each word to every other word in the sentence, and calculates an attention score. This is shown in the next figure, where for example, the word "caves" has the highest ***attention scores*** for the words "glacier" and "formed". The attention scores are calculated as the dot (inner) product of the input representations of every two words. That is, for each Query word $Q$ and Key word $K$, the attention score is $Q\cdot K$. 


<img src='https://raw.githubusercontent.com/avakanski/Fall-2022-Python-Programming-for-Data-Science/main/Lectures/Theme%203%20-%20Model%20Engineering%20Pipelines/Lecture%2019%20-%20Transformer%20Networks/images/attn_2.png' width=250px/>



As we mentioned in the previous lecture, Transformers employ word embeddings for representing the individual words in sequences. Recall that ***word embeddings*** convert each word into a feature vector, such that feature vectors of words that have similar semantic meaning have close spatial positions in the embeddings space. Therefore, the attention scores are dot products of the feature vectors (embeddings) for each pair of words in sentences. 

The obtained attention scores for each word are next first scaled (by dividing the values by $\sqrt d$) and afterward are normalized to be in the [0,1] range (by applying a softmax function). That is, the attention scores are calculated as $a_{ij}=softmax(\frac{Q_i\cdot K_j}{\sqrt d})$, where $d$ is the dimensionality of the feature vectors in the embedding space. The resulting scaled and normalized attention scores are then multiplied with the initial representation of the words, which in the self-attention module is referred to as ***value*** or $V$. 

This is shown in the next figure. The left-subfigure shows the attention scores calculated as product of the input representations of the words $Q$ and $K$, which are afterwards mutliplied with the input representations $V$ to obtain the output of the module. Note that for text classification, all three terms Query, Key, and Value are the input representation of the words in sentences. However, the original Transformer was developed for machine translation, where the words in the target language are queries, and the words in the source language are pairs of keys and values. This terminology is alos related to search engines, which compare queries to keys, and deterimine values. Self-attention works in a similar way, where each query word is matched to other key words, and a weighted value is returned. 

The right-subfigure below shows how self-attention is implemented in Transformer networks. Namely `Matmul` stands for a matrix multiplication layer which calculates the dot product $Q\cdot K$, which is afterwards scaled by $\sqrt d$, then there is an optional masking layer, and afterward the final attention scores are obtained by passing it through a `Softmax` layer to obtain $softmax(\frac{Q_i\cdot K_j}{\sqrt d})$. Finally, the attention scores are multiplied with $V$ via another matrix multiplication layer `Matmul` to calculate the output of the self-attention module. 

<img src='https://raw.githubusercontent.com/avakanski/Fall-2022-Python-Programming-for-Data-Science/main/Lectures/Theme%203%20-%20Model%20Engineering%20Pipelines/Lecture%2019%20-%20Transformer%20Networks/images/attn_3.png' width=400px/>

In conclusion, self-attention is applied to determine the meaning of the words in a sentence based on the context. That is, transformers use the attention scores to modify the input vector representations for each word and generate a new representation based on the context of the sentence. During the training of the network, the representations of the words are updated and projected into a new embeddings space that takes the context into account.

<a name='section3'></a>

# 19.3 Multi-Head Attention

Transformer networks include multiple self-attention modules in their architecture. Each self-attention module is called ***attention head***, and the aggregation of the outputs of multiple attention heads is called ***multi-head attention***. For instance, the original Tansformer model had 8 attention heads, while the GPT-3 language model has 96 attention heads. 

The multi-head attention module is shown in the next figure, where the inputs are first passed through a linear layer (i.e., dense or fully-connected layer), next that are fed to the multiple attention heads, and the outputs of all attention heads are concatenated, and passed through one more linear (dense) layer. 

A logical question one can ask is why are multiple attention heads needed? The reason is that multiple attention modules can learn different relationships between the words in sentences. Each module can extract context independently from the other modules, which allows to capture less obvious context and enhance the learning capabilities of the model. 

<img src='https://raw.githubusercontent.com/avakanski/Fall-2022-Python-Programming-for-Data-Science/main/Lectures/Theme%203%20-%20Model%20Engineering%20Pipelines/Lecture%2019%20-%20Transformer%20Networks/images/multihead_1.png' width=500px/>

<a name='section4'></a>

# 19.4 Encoder

work in progress ...

<a name='section6'></a>

# References

1. Complete Machine Learning Package, Jean de Dieu Nyandwi, available at: [https://github.com/Nyandwi/machine_learning_complete](https://github.com/Nyandwi/machine_learning_complete).
2. Deep Learning with Python, Francois Chollet, Second Edition, Manning Publications, 2021.


[BACK TO TOP](#section0)