# Lecture 19 - Transformer Networks

[![View notebook on Github](https://img.shields.io/static/v1.svg?logo=github&label=Repo&message=View%20On%20Github&color=lightgrey)](https://github.com/avakanski/Fall-2025-Applied-Data-Science-with-Python/blob/main/docs/Lectures/Theme_3-Model_Engineering/Lecture_19-Transformer_Networks/Lecture_19-Transformer_Networks.ipynb)
[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/avakanski/Fall-2025-Applied-Data-Science-with-Python/blob/main/docs/Lectures/Theme_3-Model_Engineering/Lecture_19-Transformer_Networks/Lecture_19-Transformer_Networks.ipynb)

<a id='top'></a>

- [19.1 Introduction to Transformers](#19.1-introduction-to-transformers)
- [19.2 Self-attention Mechanism](#19.2-self-attention-mechanism)
- [19.3 Multi-head Attention](#19.3-multi-head-attention)
- [19.4 Encoder Block](#19.4-encoder-block)
- [19.5 Positional Encoding](#19.5-positional-encoding)
- [19.6 Using a Transformer Model for Classification](#19.6-using-a-transformer-model-for-classification)
- [19.7 Decoder Sub-network](#19.7-decoder-sub-network)
- [19.8 Vision Transformers](#19.8-vision-transformers)
- [References](#references)

## 19.1 Introduction to Transformers <a name='19.1-introduction-to-transformers'></a>

**Transformer Neural Networks**, or simply **Transformers**, is a neural network architecture introduced in 2017 in the now-famous paper [“Attention is all you need”](https://arxiv.org/abs/1706.03762). The title refers to the attention mechanism, which forms the basis for data processing with Transformers.  

Transformer Networks have been the predominant type of Deep Learning models  for NLP in recent years. They replaced Recurrent Neural Networks in all NLP tasks, and also, all Large Language Models employ the Transformer Network architecture. As well as, Transformer Networks were recently adapted for other tasks and have outperformed other Machine Learning models for image processing and video processing tasks, protein and DNA sequence prediction, time-series data processing, and have been used for reinforcement learning tasks. Consequently, Transformers are currently the most important Neural Network architecture.

## 19.2 Self-attention Mechanism <a name='19.2-self-attention-mechanism'></a>

**Self-attention** in NNs is a mechanism that forces a model to attend to portions of the data when making predictions. For instance, in NLP, self-attention mechanism is used to identify words in sentences that have significance for a given query word in the sentence. That is, the model should pay more attention to some words in sentences, and less attention to other words in sentences that are less relevant for a given task.  

In the following two sentences, in the left subfigure the word "it" refers to "street", while in the right subfigure the word "it" refers to "animal". Understanding the relationships between the words in such sentences has been challenging with traditional NLP approaches. Transformers use the self-attention mechanism to model the relationships between all words in a sentence, and assign weights to other words in sentences based on their importance. In the left subfigure, the mechanism estimated that the **query word** "it" is most related to the word "street", but the word "it" is also somewhat related to the words "The" and "animal. These words are referred to as **key words** for the query word "it".The intensity of the lines connecting the words, as well as the intensity of the blue color, signifies the attention scores (i.e., weights). The wider and bluer the lines, the higher the attention scores between two words are.

<img src="images/attn_1.png" width="700">

*Figure: Attention to words in sentences.*

Specifically, Transformer Network compares each word to every other word in the sentence, and calculates attention scores. This is shown in the next figure, where for example, the word "caves" has the highest **attention scores** for the words "glacier" and "formed". The attention scores are calculated as the dot (i.e., inner) product of the input representations of two words. That is, for each Query word $Q$ and Key word $K$, the attention score is $Q\cdot K$.


<img src="images/attn_2.png" width="300">

*Figure: Attention scores.*

Transformers employ word embeddings for representing the individual words in text sequences (where each text sequence can have one or several sentences). Recall from the previous lectures that **word embeddings** are vector representations of words, such that the vectors of words that have similar semantic meaning have close spatial positions in the embeddings space. Therefore, the attention scores are dot products of the embedding vectors for each pair of words in sentences.

The obtained attention scores for each word are then first scaled (by dividing the values by $\sqrt d$) and afterward are normalized to be in the [0,1] range (by applying a softmax function). That is, the attention scores are calculated as $a_{ij}=softmax(\frac{Q_i\cdot K_j}{\sqrt d})$, where $d$ is the dimensionality of the embedding vectors. As we stated in previous lectures, the dimensionality of embedding vectors in modern Large Language Models typically ranges from 768 to 4,096 dimensions. Scaling the values by $\sqrt d$ is helpful for improving the flow of the gradients during training.

The resulting scaled and normalized attention scores are then multiplied with the initial representation of the words, which in the self-attention module is referred to as **value** or $V$. This is shown in the next figure. The left subfigure shows the attention scores calculated as product of the input representations of the words $Q$ and $K$, which are afterwards multiplied with the input representation $V$ to obtain the output of the self-attention module. Note that for text classification, all three terms Query, Key, and Value are the same input representation of the words in sentences. However, the original Transformer was developed for machine translation, where the words in the target language are queries, and the words in the source language are pairs of keys and values. This terminology is also related to search engines, which compare queries to keys, and return values (e.g., the user submits a query, the search engine identifies key words within the query to search for, and it returns the results of the search as values). Self-attention works in a similar way, where each query word is matched to other key words, and a weighted value is returned.

The right subfigure below shows how self-attention is implemented in Transformer Networks. Namely, `Matmul` stands for a matrix multiplication layer which calculates the dot product $Q\cdot K$, which is afterwards scaled by $\sqrt d$, then there is an optional masking layer , and afterward the final attention scores are obtained by passing it through a `Softmax` layer to obtain $softmax(\frac{Q_i\cdot K_j}{\sqrt d})$. Finally, the attention scores are multiplied with $V$ via another matrix multiplication layer `Matmul` to calculate the output of the self-attention module. The optional masking layer can be used for two purposes: (a) to ensure that attention scores are not calculated for the padding tokens in padded sequences (e.g., 0 is often used as the padding token), but instead are calculated only for the positions in input sequences that have actual words in padded sequences; or (b) to set the attention scores for future tokens to zero, so that the model can only attend to previous tokens, as explained in the section below on decoder sub-networks).

<img src="images/attn_3.png" width="400">

*Figure: Self-attention in Transformer Networks*

In conclusion, self-attention is applied to capture the meaning of the words in a sentence based on its surrounding context. That is, Transformers use the attention scores to generate context-sensitive representations for each word based on the context of the sentence. During training, the representations of the words are refined and projected into a new embeddings space that takes the context into account.

## 19.3 Multi-Head Attention <a name='19.3-multi-head-attention'></a>

Transformer Networks include multiple self-attention modules in their architecture. Each self-attention module is called **attention head**, and the aggregation of the outputs of multiple attention heads is called **multi-head attention**. For instance, the original Transformer model had 8 attention heads, while LLamA 3 language model has 32 attention heads.

The multi-head attention module is shown in the next figure, where the inputs are first passed through a linear layer (i.e., fully-connected dense layer), next they are fed to the multiple attention heads, and the outputs of all attention heads are concatenated, and passed through one more linear layer.

A logical question one may ask is why are multiple attention heads needed? The reason is that multiple attention modules can learn different relationships between the words in sentences. Each module can extract context independently from the other modules, which allows to capture less obvious context and enhances the learning capabilities of the model. For example, one head may capture relationship between the nouns and numerical values in sentences, another head may focus on the relationship between the adjectives in sentences, and another head may focus on rhyming words, etc. And, if one head becomes too specialized in capturing one type of patterns, the other heads can compensate for it and provide redundancy that can improve the overall performance of the model.

Also, the computations of each attention head can be performed in parallel on different workers, which allows for accelerating the training and scaling up the models.

<img src="images/multihead_1.png" width="600">

*Figure: Multi-head attention*

## 19.4 Encoder Block <a name='19.4-encoder-block'></a>

The **Encoder Block** in Transformer Networks is shown in the next figure. It processes the input word embeddings and extracts representations in text data that can afterwards be used for different NLP tasks.

The components in the Encoder Block are:

- *Multi-head Attention layer*, which as explained, consists of multiple self-attention modules.
- *Dropout layer*, is a regular dropout layer.
- *Residual connections*, are skip connections in neural networks, where the input to a layer is added to the processed output of the layer. Residual connections were popularized in the ResNets models, as they were shown to stabilize the training and mitigate the problems of *vanishing and exploding gradients* in neural networks (i.e., they refer to cases when the gradients become too small or too large during training). In the figure, the `Add` term in the layer refers to the residual connection, which adds the input embeddings to the output of the Dropout layer.
- *Layer Normalization*, is an operation that is similar to the batch normalization in CNNs, but instead, it normalizes the outputs of each multi-head attention layer independently from the outputs of the other multi-head attention layers, and scales the data to have 0 mean and 1 standard deviation. This type of normalization is more adequate for text data. And, as we learned in the previous lectures, normalization improves the flow of gradients during training. The `Norm` term in the figure refers to the Layer Normalization operation.
- *Feed Forward network*, consists of 2 fully-connected (dense) layers that extract useful data representations.
- The Encoder Block also contains one more *Dropout layer*, and another *Add & Norm* layer that forms a residual connection for the input to the Feed Forward network and applies a layer normalization operation.

Larger Transformer networks typically include several encoder blocks in a sequence. For instance, in the original paper the authors used 6 encoder blocks.

<img src="images/enc_1.png" width="250">

*Figure: Encoder block*

The implementation of the Encoder Block in Keras and TensorFlow is shown in the cell following the imported libraries.

The Encoder Block is implemented as a custom layer which is a subclass of the `Layer` class in Keras. The `__init__()` constructor method lists the definitions of the layers in the Encoder, and the method `call()` provides the forward pass with the flow of information through the layers.

- *Multi-head attention* layer is implemented in Keras, and it can be directly imported. The arguments in the layer are: `num_heads` is the number of attention heads, and `key_dim` is the dimension of the embeddings of the input tokens.
- *Dropout* and *Normalization* layers are also directly imported, with arguments `rate` for the dropout rate, and `epsilon` is a small float added to the standard deviation to avoid division by 0.
- *Feed forward network* includes 2 dense layers, with the number of neurons set to `ff_dim` and `embed_dim`, respectively.

The `call()` method specifies the forward pass of the network, and takes two parameters: `inputs` (the input embeddings to the network) and `training` (an argument which can be True or False). For the dropout layers, during the model training this argument is set to True and dropout is applied, while during inference the argument is set to False and dropout is not applied.

Each step in the `call()` method performs the data processing for one layer. Note that the `multi_head_attention` layer has as arguments the `inputs` twice, which is once for the key and once for the value in the self-attention. Also note the residual connections that are implemented in the layer normalization, e.g., the inputs are added to the output of the multi-head attention.

In [None]:
import tensorflow as tf
from tensorflow import keras
from keras.layers import MultiHeadAttention, LayerNormalization, Dropout, Dense, Embedding, Layer
from keras.layers import Input, GlobalAveragePooling1D
from keras.optimizers import Adam
from keras import Sequential, Model

In [None]:
class TransformerEncoder(Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super().__init__()
        self.multi_head_attention = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.feed_forward_net = Sequential([Dense(ff_dim, activation="relu"), Dense(embed_dim),])
        self.layer_normalization1 = LayerNormalization(epsilon=1e-6)
        self.layer_normalization2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)

    def call(self, inputs, training):
        multi_head_att_output = self.multi_head_attention(inputs, inputs)
        multi_head_att_dropout = self.dropout1(multi_head_att_output, training=training)
        add_norm_output_1 = self.layer_normalization1(inputs + multi_head_att_dropout)
        feed_forward_output = self.feed_forward_net(add_norm_output_1)
        feed_forward_dropout = self.dropout2(feed_forward_output, training=training)
        add_norm_output_2 = self.layer_normalization2(add_norm_output_1 + feed_forward_dropout)
        return add_norm_output_2

## 19.5 Positional Encoding <a name='19.5-positional-encoding'></a>

We mentioned that Transformers use word embeddings as inputs, however, the embeddings alone don't provide information about the order of words in sentences. Understandably, the order of the words in a sentence is important, and different order of the words can convey a different meaning. To provide such information, Transformer Network introduces **positional encoding** for each word that is added to the input embedding, as shown in the next figure.  

<img src="images/positional_encoding_1.png" width="300">

*Figure: Positional encoding*

There are several different ways to implement positional encoding. In the original Transformer paper, the positional encoding is a vector that has the same size as the word embedding vector, and the authors used sine and cosine functions to create position vectors, which are afterwards scaled to be in the range from -1 to 1. Using such positional encoding, each encoding vector corresponds to a unique position in a sequence of words. This type is called *sinusoidal positional encoding*.

Another popular way to implement positional encoding is by learning the vector  representations for the words in input text, in the same way the word embeddings are learned. This type of positional encoding is referred to as *learned positional encodings*.

The following cell implements the addition of learned positional encoding to word embeddings in Keras. Therefore, for both token and positional embedding vectors we will use the `Embedding` layer in Keras which we introduced in the previous lecture. The arguments in the `Embedding` layer are the input dimension `input_dim` and the dimension of the embedding vectors `output_dim`. For the token embeddings layer, the input dimension is the size of the vocabulary (`vocab_size` below ), whereas for the positional embeddings layer the input dimension is the length of the text sequences (`maxlen` below).

In the `call` method, first the length of the text sequences is assigned to `maxlen`. The function `tf.range` is similar to NumPy's `linspace` and creates numbers in the range from `start` to `limit` with a step `delta`. Next, the two separate `Embedding` layers are called, and returned is the sum of the token and positional embeddings.

In [None]:
class TokenAndPositionEmbedding(Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super().__init__()
        self.token_embeddings = Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.positional_embeddings = Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, inputs):
        maxlen = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        position_embeddings = self.positional_embeddings(positions)
        input_embeddings = self.token_embeddings(inputs)
        return input_embeddings + position_embeddings

## 19.6 Using a Transformer Model for Classification <a name='19.6-using-a-transformer-model-for-classification'></a>

### Model Definition

We will now employ the layers that we defined above, to create a Transformer model for text classification.

It is a simple model that consists of the following parts:

- **Encoder**, which includes an `Input` layer that defines the maximum length of input sequences, `TokenAndPositionEmbedding` layer, and the `TransformerEncoder` layer.
- **Classifier**, which consists of a `GlobalAveragePooling1D` layer, and two `Dropout` and `Dense` layers. Global Average Pooling calculates the average value for each word, and it passes those values to the dense layers to classify the text sequences.

In [None]:
maxlen = 200  # Maximum length of input sequences is 200 words
embed_dim = 32  # Embedding size for each token
num_heads = 2  # Number of attention heads
ff_dim = 32  # Dense layer size in the feed forward network inside transformer
vocab_size = 20000  # The size of the vocabulary is 20k words

# encoder
inputs = Input(shape=(maxlen,))
embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, num_heads, ff_dim)(embedding_layer, training=True)

# classifier
x = GlobalAveragePooling1D()(x)
x = Dropout(0.1)(x)
x = Dense(100, activation="relu")(x)
x = Dropout(0.1)(x)
outputs = Dense(1, activation="sigmoid")(x)

model = Model(inputs=inputs, outputs=outputs)

The summary of the model is shown below.

In [None]:
model.summary()

### Loading the Dataset

Let's apply the model for sentiment analysis of the movie reviews in the IMDB database. The data is loaded from the Keras datasets, and it contains 25,000 training sequences and 25,000 validation sequences.

In [None]:
from keras.preprocessing.sequence import pad_sequences

(x_train, y_train), (x_val, y_val) = keras.datasets.imdb.load_data(num_words=vocab_size)
print(len(x_train), "Training sequences")
print(len(x_val), "Validation sequences")
x_train = pad_sequences(x_train, maxlen=maxlen)
x_val = pad_sequences(x_val, maxlen=maxlen)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 0us/step
25000 Training sequences
25000 Validation sequences


### Model Training

In [None]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4), loss="binary_crossentropy", metrics=["accuracy"])

model.fit(x_train, y_train, batch_size=64, epochs=10, validation_data=(x_val, y_val))

Epoch 1/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 24ms/step - accuracy: 0.5652 - loss: 0.6764 - val_accuracy: 0.8119 - val_loss: 0.4656
Epoch 2/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 7ms/step - accuracy: 0.8448 - loss: 0.3845 - val_accuracy: 0.8706 - val_loss: 0.3065
Epoch 3/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 7ms/step - accuracy: 0.9069 - loss: 0.2389 - val_accuracy: 0.8766 - val_loss: 0.2966
Epoch 4/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 7ms/step - accuracy: 0.9254 - loss: 0.1961 - val_accuracy: 0.8730 - val_loss: 0.3121
Epoch 5/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 7ms/step - accuracy: 0.9444 - loss: 0.1596 - val_accuracy: 0.8736 - val_loss: 0.3257
Epoch 6/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 7ms/step - accuracy: 0.9560 - loss: 0.1269 - val_accuracy: 0.8712 - val_loss: 0.3491
Epoch 7/10
[1m391/391[0m

<keras.src.callbacks.history.History at 0x7c6e729fcf10>

### Model Evaluation

The next cell evaluates the accuracy of the model on the validation dataset.

In [None]:
# Evaluate the model on the test set
loss, accuracy = model.evaluate(x_val, y_val)
print(f'Test Accuracy: {accuracy:.4f}')

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.8529 - loss: 0.4955
Test Accuracy: 0.8522


Let's also predict the class label for the first and second reviews in the  validation dataset, and compare the predictions to the ground-truth labels.

In [None]:
# Make prediction on two validation samples
predictions = model.predict(x_val[0:2])
predictions

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 979ms/step


array([[0.0094576 ],
       [0.99988115]], dtype=float32)

In [None]:
# Compare to the labels for the two validation samples
y_val[0:2]

array([0, 1])

If we inspect the first review in the next cell, we will notice that the dataset was loaded as tokenized and indexed sentences. We can find the actual text in the reviews by retrieving the `word_index` for the dataset, which if you recall from the previous lecture, is a dictionary that has words from the training dataset as keys and the assigned indices as values. Several examples of words and indices are shown in the next cell.

In [None]:
print(x_val[0])

[    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     1   591   202    14    31     6   717    10    10 18142 10698     5
     4   360     7     4   177  5760   394   354     4   123     9  1035
  1035  1035    10    10    13    92   124    89   

In [None]:
# Get the word index from the IMDB dataset
word_index = keras.datasets.imdb.get_word_index()

# Display several keys and values from word_index
sorted(word_index.items(), key=lambda x: x[1])[:10]

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
[1m1641221/1641221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1us/step


[('the', 1),
 ('and', 2),
 ('a', 3),
 ('of', 4),
 ('to', 5),
 ('is', 6),
 ('br', 7),
 ('in', 8),
 ('it', 9),
 ('i', 10)]

Now, since we know the corresponding words, we can display the words in the first two reviews. It is noticeable that the first review is indeed negative, and the second is positive.

In [None]:
# Function to convert indices to words
def decode_review(sequence):
    """Decodes a sequence of integers back to words."""
    reverse_word_index = {value: key for key, value in word_index.items()}
    return ' '.join([reverse_word_index.get(i - 3, '?') for i in sequence if i > 0])

print('Review 1:', decode_review(x_val[0]))

print('Review 2:', decode_review(x_val[1]))

Review 1: ? please give this one a miss br br kristy swanson and the rest of the cast rendered terrible performances the show is flat flat flat br br i don't know how michael madison could have allowed this one on his plate he almost seemed to know this wasn't going to work out and his performance was quite lacklustre so all you madison fans give this a miss
Review 2: psychological trickery it's very interesting that robert altman directed this considering the style and structure of his other films still the trademark altman audio style is evident here and there i think what really makes this film work is the brilliant performance by sandy dennis it's definitely one of her darker characters but she plays it so perfectly and convincingly that it's scary michael burns does a good job as the mute young man regular altman player michael murphy has a small part the ? moody set fits the content of the story very well in short this movie is a powerful study of loneliness sexual repression and

Let's also consider one more example, where we will provide two sample sentences that are not in the training dataset, and we will obtain the predictions by the model. The first sentence has a positive sentiment, and the second is negative.

In [None]:
# Sample sentences to evaluate
sample_sentences = ["Excellent movie I loved it great cast performance",
                    "It was a terrible movie horrible script"]

To tokenize the sentences, we will use again the `word_index` and we will assign the corresponding indices to the words in the sample sentences. The outputs are shown below. For instance, the word `excellent` is assigned the index 318.

In [None]:
# Tokenize the sample sentences by converting words to indices using word_index
def encode_review(review):
    return [word_index.get(word) for word in review.lower().split()]

# Encode the sample sentences
sample_sentences_encoded = [encode_review(sentence) for sentence in sample_sentences]
sample_sentences_encoded

[[318, 17, 10, 444, 9, 84, 174, 236], [9, 13, 3, 391, 17, 524, 226]]

In [None]:
word_index['excellent']

318

Next, let's pad the tokenized sample sentences, and ask the model to predict the sentiment. The prediction by the model is a positive review for the first sentence, and negative review for the second sentence.

In [None]:
# Pad the sequences
sample_sentences_padded = pad_sequences(sample_sentences_encoded, maxlen=200)

# Make predictions
predictions = model.predict(sample_sentences_padded)
predictions

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step


array([[0.54321724],
       [0.2842807 ]], dtype=float32)

## 19.7 Decoder Sub-network <a name='19.7-decoder-sub-network'></a>

The Transformer Network in the original paper was designed for machine translation. Differently from the text classification task where for an input text sentence the model predicts a class label, in machine translation for an input text sentence in a source language the model predicts the corresponding text sentence in a target language. Therefore, both the input and output of the model are text sequences. These type of models are called **sequence-to-sequence models**, or oftentimes this term is abbreviated to **seq2seq models**. Beside machine translation, other NLP tasks that employ seq2seq models include question answering, text summarization, dialog generation, and others.

The architecture of Transformer Networks designed to handle seq2seq tasks consists of encoder and decoder sub-networks.

- **Encoder sub-network** takes a source text sequence as an input, and extracts a useful representation of the text data.
- **Decoder sub-network** takes a target text sequence as an input, as well as it receives the intermediate representation from the encoder sub-network. The decoder combines the information from the target sequence and the encoded source sequence, and learns to predict the next word (token) in the target sequence.

This is shown in the next figure, where the French sequence "Je suis etudiant" is translated into "I am a student". The decoder outputs one word at each time step until the end-of-sequence is reached.

<img src="images/transformer_decoding_2.gif" width="700">

*Figure: Decoder block*

Such models that predict future values based on past observations under the assumption that the current value is dependent on previous values are called **autoregressive models**. Autoregressive text generation involves iteratively generating one token at a time, by predicting the next word or token based on the preceding words in the sequence. This approach allows the model to produce coherent and relevant responses by chatbots.

An example of autoregressive text generation is shown in the next figure. For instance, the first generated word by the model is *Binge*, and given this word, the model assigns probabilities to all possible next words. The considered words shown in the top line in the figure include *on*, *and*, *of*, *is*, etc. The generated word is *drinking*. Next, the model considers the words *Binge drinking* and generates the next word. This process is repeated for the entire output text, where the model takes as input all previously generated words, and generates the next word in the sequence. Note that this process is computationally expensive, since the model needs to repeatedly evaluate all previous words to generate every new word; as a result, inference with autoregressive models requires significant resources and time.   

<img src="images/next-word-prediction.png" width="600">

*Figure: Autoregressive text generation. Source: [7].*

The architecture of the decoder is similar to the encoder and it is shown in the next figure. The upper part of the decoder is practically the same as the encoder, and it consists of a multi-head attention module with residual connections and layer normalization, followed by a feed-forward network with residual connections and layer normalization. The main difference from the encoder is the *masked multi-head attention* module in the lower part of the decoder. This module is inserted before the multi-head attention module in the decoder. Masked multi-head attention module applies masking to the next words in the target sequence, so that the network does not have access to those words. That is, during training, if the model needs to predict the 4th word in a sentence, masks will be applied to all words after the 3rd word, so that the model has access only to the words 1, 2, and 3, in order to predict the 4th word.

<img src="images/transformer.png" width="700">

*Figure: Transformer Network*

For example, in the next figure, for predicting the word *with* the model has access to the attention coefficients only for the words *Your*, *journey*, and *starts*, whereas the attention coefficients for the words *with*, *one*, and *step* are masked and the model does not have access to those values. This ensures that the model uses only the previous words to predict the next words in the target sequence. This type of mask is referred to as **causal attention mask**.

<img src="images/causal_attention_mask.png" width="500">

*Figure: Causal attention mask. Source: [8].*

This also explains why in the figure for the Transformer network inputs to the decoder sub-network are "Outputs (shifted right)". It is because at each step, the target sequence is shifted to the right and it is fed again into the decoder. E.g., after predicting the 4th word, to predict the 5th word the input to the decoder will be words 1, 2, 3, and 4, and so on.

Finally, the output representations from the decoder are passed to a linear (dense) layer and a softmax layer, that outputs the probability for the next word in the vocabulary learned from the training dataset.

And also note the marks `Nx` in the figure. They indicate that the shown encoder and decoder blocks are repeated multiple times in the network. In the original Transformer Network, there are 6 encoder blocks, and similarly there are 6 decoder blocks. Introducing multiple blocks in the encoder and decoder sub-networks increases the learning ability as it allows the model to learn more abstract representations.

Note that Recurrent Neural Networks are also a type of seq2seq models. Transformer Networks have several advantages over RNN, due to the ability to inspect entire text sequences at once, capture context in long sequences, are parallelizable, and are more powerful in general. Conversely, RNN have access only to the next token in a sequence (have difficulty finding correlations in long sequences because the information needs to pass through many processing steps), can not perform parallel computations (are slow to train), and the gradients can become unstable.

## 19.8 Vision Transformers <a name='19.8-vision-transformers'></a>

After the initial success of Transformer Networks in NLP, recently they have been adapted for computer vision tasks as well. The initial Transformer model for vision tasks proposed in 2021 was called **Vision Transformer (ViT)**.

The architecture of ViT is very similar to the Transformers used in NLP. However, Transformer Networks were designed for working with sequential data, while images are spatial data types. To consider each pixel in an image as a sequential token would be impractical and too time-consuming. Therefore, ViT splits images into a set of smaller image patches (16x16 pixels), and it uses the sequence of image patches as inputs to the model (i.e., each image patch is considered a token). Each image patch is first flattened to one-dimensional vector, and those vectors are afterward passed through a dense layer to learn lower-dimensional embeddings for each patch. Positional embeddings and class embeddings are added, and the sequences are fed to a standard transformer encoder. Class embeddings are vectors that correspond to different classes in the dataset. The encoder block in ViT is identical to the encoder in the original Transformer Network. The steps are depicted in the figure below.

<img src="images/vision_transformer.gif" width="700">

*Figure: Vision Transformer*

The authors trained 3 versions of ViT, called Base (12 encoder blocks, 768 embeddings dimension, 86M parameters), Large (24 encoder blocks, 1,024 embeddings dimension, 307M parameters), and Huge (32 encoder blocks, 1,280 embeddings dimension, 632M parameters).

Various other versions of vision transformers were introduced recently, which include MaxViT (Multi-axis ViT), Swin (Shifted Window ViT), DeiT (Data-efficient image Transformer), T2T-ViT (Token-to-token ViT), and others. These models achieved higher accuracy on many vision tasks in comparison to Convolutional Neural Networks (EffNet, ConvNeXt, NFNet). The following figure shows the accuracy on ImageNet.

<img src="images/imagenet_accuracy.png" width="500">

*Figure: Accuracy on the ImageNet dataset*

## References <a name='references'></a>

1. The Illustrated Transformer, Jay Alammar, available at: [https://jalammar.github.io/illustrated-transformer/](https://jalammar.github.io/illustrated-transformer/).
2. Keras Examples, Text classification with Transformer, available at: [https://keras.io/examples/nlp/text_classification_with_transformer/](https://keras.io/examples/nlp/text_classification_with_transformer/).
3. Using Pretrained BERT for Text Classification, Jean de Dieu Nyandwi, available at: [https://github.com/Nyandwi/machine_learning_complete/blob/main/9_nlp_with_tensorflow/5_using_pretrained_bert_for_text_classification.ipynb](https://github.com/Nyandwi/machine_learning_complete/blob/main/9_nlp_with_tensorflow/5_using_pretrained_bert_for_text_classification.ipynb).
4. Deep Learning with Python, Francois Chollet, Second Edition, Manning Publications, 2021.
5. TensorFlow Tutorials, Neural Machine Translation with a Transformer and Keras, available at [https://www.tensorflow.org/text/tutorials/transformer](https://www.tensorflow.org/text/tutorials/transformer).
6. How the Vision Transformer (ViT) Works in 10 Minutes: An Image is Worth 16x16 Words, Nikolas Adaloglou, available at [https://theaisummer.com/vision-transformer/](https://theaisummer.com/vision-transformer/).
7. Benedetta Cevoli, Chris Watkins and Kathleen Rastle, "Prediction as a basis for skilled reading: insights from modern language models," The Royal Society 9(6), 2022.
8. Building a GPT-Style LLM Classifier From Scratch, Sebastian Raschka, available at [https://www.linkedin.com/pulse/building-gpt-style-llm-classifier-from-scratch-sebastian-raschka-phd-itp5c/](https://www.linkedin.com/pulse/building-gpt-style-llm-classifier-from-scratch-sebastian-raschka-phd-itp5c/).



[BACK TO TOP](#top)