![Banner](img/AI_Special_Program_Banner.jpg)

## Introduction to Transformer Models - Exercise - Sample Solution
---

The exercise you are going to do this time is slightly different from the previous ones.
Your task will be to find out more about Transformer Model on your own. To do this, you will read the paper that is the cornerstone of many modern Transformer applications. The paper, titled **Attention Is All You Need**, was published by Google scientists in 2017 and can be downloaded [here](https://arxiv.org/abs/1706.03762).

If some of the terms in the paper sound unfamiliar to you, don't hesitate to consult other sources on the internet. It is also recommended to watch the following video, which explains the components of a Transformer model, step by step, in an excellent way:
* [Illustrated Guide to Transformers](https://www.youtube.com/watch?v=4Bdc55j80l8)

After you have done your research, try to answer the following questions:

## Question 1: What are differences between Transformer Models and RNNs/LSTMs?

Answer:

Transformer Models do not rely on recurrence or convolutions to handle long-range dependencies in input data. Instead, they utilize a mechanism called (self)-attention. Unlike RNNs, which process data sequentially and struggle with long-range dependencies, Transformers use attention to directly model relationships in data. This enables efficient management of distant relationships and increased parallelization during training. Becuase of that, Transformers are highly effective for large-scale language tasks. They have achieved state-of-the-art results in many NLP applications. However, for shorter sequences or tasks where the sequential nature of data is paramount, RNNs, including LSTMs and GRUs, are more efficient when it comes to needed memory and computational power. The choice between the two options typically relies on specific criteria such as the size of the data, complexity of the task, and computational limitations.

## Question 2: What does self-attention do in the context of Transformers?

Answer:

Self-attention is a method used in Transformer Models to compute a representation of a sequence by relating different positions within the sequence. This mechanism enables the model to assign varying degrees of importance to different segments of the input sequence when predicting a particular part of the sequence. For instance, in a sentence such as **'His aim was to improve**,' self-attention would function in the following way: The model may assign significant importance to **'His'** and **'improve'** when processing the word **'aim'**, as these words are crucial to comprehending the meaning of **'aim'** in this context. The Transformer Model adjusts its attention to each word dynamically, based on its relevance to the word being processed. This enables the model to capture the nuances and dependencies within the sentence, allowing it to handle complex, context-dependent relationships in the data. 

## Question 3: What are positional encodings and why are they needed?

Answer:

Positional encodings are an important concept for Transformer Models as they provide information about the position of each token in the input sequence. The Transformer architecture does not inherently capture sequential order or the position of elements since it lacks recurrence or convolutions. To address this, positional encodings are added to the input embeddings, giving the model a sense of the order of words in the sequence. These encodings have the same dimensions as the embeddings, allowing for their summation. This approach is crucial for the model to comprehend the sequence order, which is essential for tasks such as language understanding and translation.

To understand positional encoding, consider the example of a sentence with three words:  "I like sports." In a model without positional encoding, the words are represented only by their embeddings, which are vectors of numbers.

However, there is no indication of the order of the words. Positional encoding adds another vector to each word embedding, representing the position of each word in the sequence. 'I' could be represented by a positional vector of <code>[1, 0, 0]</code>, while 'like' would be <code>[0, 1, 0]</code>, and 'sports' would be <code>[0, 0, 1]</code>. By adding these positional vectors to the original word embeddings, the model can understand the word order. 'I' comes first, followed by 'like', and 'sports' is third.

## Question 4: What are some of the limitations/challenges associated with Transformer Models?

Answer:
Of course, the paper itself primarily focuses on the advantages and implementation of the Transformer. However, Transformer Models, although revolutionary in many aspects, have limitations such as the need for substantial computational resources for training, especially with large datasets. They also have memory efficiency issues due to the self-attention mechanism's quadratic scaling with sequence length, making them less effective for very long sequences. Furthermore, these models may struggle with generalizing to data that significantly differs from their training set. One challenge with these models is their lack of interpretability, as it can be difficult to understand the decision-making process. Additionally, their performance is often dependent on large datasets, which can be limiting in situations where such data is scarce or unavailable.