# Deep learning recap

### Summary

The text discusses the evolution of machine learning from hard-coded rules for classification, exemplified by the Iris dataset, to modern approaches using deep learning and transformer architectures, which enable tasks like image recognition and natural language processing for complex data analysis and generation.

### Highlights

- 💾 Early coding involved hard-coding rules to distinguish classes, such as classifying Iris flowers based on petal and sepal dimensions.
- 🌸 The Iris dataset is a well-known example illustrating early machine learning classification techniques.
- 🤖 Modern machine learning utilizes deep learning models, including transformer architectures developed in 2017.
- 🧠 Deep learning employs neural networks with interconnected neurons that learn patterns in complex data through layers.
- ⚙️ Neurons apply mathematical operations, and the connections (weights) between them are adjusted during training.
- 📊 Training involves iteratively updating weights to minimize the difference between predicted and actual outputs.
- 🗣️ Advanced models, like large language models, can now perform tasks such as image recognition and natural language conversations about complex topics like flowers.

# The problem with RNNs

### Summary

The text explains the limitations of recurrent neural networks (RNNs) in processing language, particularly long sequences, and introduces transformers as a more effective architecture. Transformers overcome the sequential processing and memory issues of RNNs by enabling parallelization and utilizing an "attention" mechanism to focus on the most relevant words for context.

### Highlights

- 🧠 Different neural network types excel at specific tasks; convolutional neural networks (CNNs) are good for vision, but traditional neural networks struggle with language.
- 🗣️ Recurrent neural networks (RNNs) were the primary language processing models before transformers, processing text sequentially.
- ⏳ RNNs struggled with long texts, often "forgetting" information from the beginning, hindering context understanding.
- 🔗 Understanding language often requires context from earlier in the text, as demonstrated by pronoun references across sentences.
- ⚡ Sequential processing in RNNs prevented effective parallelization, leading to long training times and limitations on dataset size.
- 🚀 Transformers address these limitations by allowing parallel processing and incorporating an "attention" mechanism.
- 🎯 The "attention" mechanism enables transformers to weigh the importance of different words in a text, leading to better contextual understanding and the development of powerful language models.

# The solution: attention is all you need

### Summary

The text discusses the "attention" mechanism introduced in the 2017 paper "Attention is All You Need," which is fundamental to transformer models. This mechanism allows the model to weigh the importance of different parts of the input sequence when generating the output, enabling it to handle long-range dependencies and improve tasks like language translation by considering word order and context. Self-attention, a specific type, focuses on relationships within a single input sequence.

### Highlights

- 📰 The 2017 paper "Attention is All You Need" introduced transformer models and the crucial attention mechanism.
- ⚖️ The attention mechanism assigns weights to input tokens, indicating their importance for generating the output.
- 🔎 Unlike processing the entire input uniformly, attention allows the model to selectively focus on relevant parts at each step.
- 🌐 This weighting scheme enables transformers to understand and manage long-range dependencies in data.
- 🗣️ Language translation benefits significantly from attention, as it allows the model to consider word order differences between languages, unlike simple word-to-word translation.
- 📊 Visualizations show that attention mechanisms in translation don't have a one-to-one word correspondence but focus on relevant tokens for accurate output.
- Self-attention is a specific type of attention that computes relationships within a single input sequence, capturing contextual information between elements.

# The transformer architecture

### Summary

The provided text introduces the Transformer architecture, highlighting its ability to process all input words simultaneously, unlike Recurrent Neural Networks (RNNs) which process words sequentially. The explanation will delve into the encoder-decoder structure of Transformers, starting with a detailed look at the encoder block.

### Highlights

- 🤖 Transformers are powerful models capable of processing information efficiently.
- 🗺️ The text provides an overview diagram of the Transformer architecture.
- 🧱 Understanding the Transformer architecture involves breaking down its components step-by-step.
- 🗣️ Translating between languages like French and English serves as a use case for understanding these models.
- 🔄 Recurrent Neural Networks (RNNs) process input words one after another.
- ⚡ Transformers can process all input words at once, enabling parallel processing.
- 🏗️ The Transformer architecture employs an encoder-decoder structure.

# Input embeddings

### Summary

The initial step in the Transformer architecture involves creating input embeddings to convert text into a numerical format that the model can process. This process includes tokenization, mapping tokens to unique IDs based on a vocabulary, retrieving pre-trained word embeddings, and incorporating positional encoding to capture the order of words. Padding or truncation might be applied to ensure uniform sequence lengths. The resulting input embeddings are then passed to the encoder block.

### Highlights

- 🔢 Input embeddings are the first step in the Transformer architecture, converting text to numbers.
- 🧩 Text is broken down into tokens, which can be words, subwords, or characters.
- 📚 Each token is mapped to a unique numerical ID based on a predefined vocabulary.
- 📊 An embedding matrix stores vector representations (word embeddings) for each token, capturing semantic and syntactic information.
- 📍 Positional encoding adds information about the position of each token in the sequence.
- 📏 Padding or truncation may be used to ensure all input sequences have the same length.
- ➡️ The generated input embeddings are then fed into the encoder block of the Transformer.

### Code Examples

- `I love natural language processing` might be tokenized into: `['I', 'love', 'natural', 'language', 'processing']`.
- Example embedding vector for 'I': `[0.2, -0.5, 0.8, ...]`.

# Multi-headed attention

### **Summary**

Following the input embeddings and positional encoding, the data is processed by the encoder block, which includes a multi-head attention layer and a feedforward layer. The multi-head attention mechanism is a core component that allows the model to weigh the importance of different tokens by calculating attention vectors. This involves creating query, key, and value vectors for each token, calculating similarity scores using dot products, scaling these scores, applying the softmax function to get attention weights, and finally computing a weighted sum of the value vectors. The "multi-head" aspect refers to multiple parallel attention mechanisms that allow the model to capture different relationships within the data.

### **Highlights**

- 🧠 The encoder block processes input embeddings and positional encodings using multi-head attention and a feedforward layer.
- ⚖️ The multi-head attention layer weighs the importance of different tokens in the input.
- ❓ Query, key, and value vectors are created for each token to calculate attention.
- 🎯 Similarity scores between query and key vectors determine the attention each token should receive.
- 📈 Softmax function converts similarity scores into attention weights (probabilities).
- ➕ A weighted sum of value vectors based on attention weights produces the attention vector.
- 🐙 Multiple "heads" in the attention mechanism allow the model to learn diverse patterns in the data.

### **Code Examples**

- Calculation of similarity score: similarity $(Qi,Kj)=Qi⋅Kj$
- Scaling of similarity scores: scaled_similarity$(Qi,Kj)=dkQi⋅Kj$, where dk is the dimension of the key vectors.
    
    [](data:image/svg+xml;utf8,<svg xmlns="http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice"><path d="M95,702%0Ac-2.7,0,-7.17,-2.7,-13.5,-8c-5.8,-5.3,-9.5,-10,-9.5,-14%0Ac0,-2,0.3,-3.3,1,-4c1.3,-2.7,23.83,-20.7,67.5,-54%0Ac44.2,-33.3,65.8,-50.3,66.5,-51c1.3,-1.3,3,-2,5,-2c4.7,0,8.7,3.3,12,10%0As173,378,173,378c0.7,0,35.3,-71,104,-213c68.7,-142,137.5,-285,206.5,-429%0Ac69,-144,104.5,-217.7,106.5,-221%0Al0 -0%0Ac5.3,-9.3,12,-14,20,-14%0AH400000v40H845.2724%0As-225.272,467,-225.272,467s-235,486,-235,486c-2.7,4.7,-9,7,-19,7%0Ac-6,0,-10,-1,-12,-3s-194,-422,-194,-422s-65,47,-65,47z%0AM834 80h400000v40h-400000z"></path></svg>)
    
- Application of softmax to get attention weights: $Attention(Qi,Kj)=softmax(scaled_similarity(Qi,Kj))$
- Calculation of the attention vector: $AttentionVector_i=∑j(Attention(Qi,Kj)⋅Vj)$

# Feed-forward layer

### **Summary**

Following the multi-head attention layer, the output is processed by a feedforward neural network within the encoder block. This network consists of linear transformations with an activation function in between, enabling the model to learn complex, non-linear relationships between tokens. The feedforward layer operates independently on each token's representation, enhancing the information captured by the self-attention mechanism and allowing for parallel processing. The output of this layer is a refined representation of each token, ready for further processing or the decoder block.

### **Highlights**

- 🧠 The output of the multi-head attention layer is passed to a feedforward neural network.
- 🔗 This network captures complex non-linear relationships within the input sequence.
- 📈 The feedforward layer typically involves two linear transformations with a non-linear activation function in between.
- 🔄 The first linear transformation reshapes and projects the token representations to a higher dimension.
- ⚡ An activation function introduces non-linearity to the model.
- 📉 The second linear transformation further reshapes and often reduces the dimensionality of the representations.
- 🚀 These feedforward operations are applied independently to each token, allowing for parallel computation and increased speed.

### **Code Examples**

- First linear transformation: $hi=W_1x_i+b_1$, where xi is the input token representation, $W_1$ is the weight matrix, and $b_1$ is the bias.
- Application of activation function (e.g., ReLU): $a_i=ReLU(h_i)=max(0,h_i)$.
- Second linear transformation: $y_i=W_2a_i+b_2$, where W2 is the weight matrix and $b_2$ is the bias, and $y_i$ is the output representation for token i.

# Masked multihead attention

### Summary

The decoder block receives the desired output sequences, such as the English translations in a French-to-English task. These outputs undergo embedding transformations and receive positional encodings, similar to the input in the encoder. However, unlike the encoder, the decoder processes these output embeddings through a masked multi-head attention layer. This masking ensures that when the model is predicting a word, it can only attend to the words that precede it in the output sequence, forcing it to learn the sequential generation of the target language.

### Highlights

- 🎯 The decoder block processes the target output sequences.
- 🗣️ In a translation task, the target language words are fed into the decoder.
- 🔢 These outputs also undergo embedding and positional encoding steps.
- 🎭 A masked multi-head attention layer is used in the decoder.
- 遮蔽 Masking prevents the model from seeing future words in the output sequence during training.
- 👁️ The model learns to predict the next word based only on the preceding words.
- ⚙️ This masked attention mechanism is crucial for sequential output generation.

# Predicting the final outputs

### Summary

The decoder block's multi-head attention layer takes input from both the encoder output and its own masked multi-head attention layer. It calculates attention scores between the current output token and the entire encoder output, generating a context vector that highlights relevant parts of the input sequence for predicting the next output token. This is followed by a feedforward layer for further processing, a linear layer for output manipulation, and a softmax layer to produce a probability distribution over possible next tokens. This process is repeated for each token in the output sequence, using previously generated tokens as input for subsequent predictions.

### Highlights

- 🤝 The decoder's multi-head attention receives input from both the encoder and its own masked attention.
- 🔍 It calculates attention scores between the current output and the encoder outputs to identify relevant input parts.
- 💡 A context vector is created by weighting encoder outputs with these attention scores.
- ⚙️ A feedforward layer further processes the information.
- 📊 A linear layer manipulates the output, and a softmax layer generates a probability distribution for the next token.
- 🔄 The decoder block repeats this process for each token in the output sequence.
- 🎉 The explanation provides a comprehensive overview of how Transformer models function.