In [19]:
import torch
from torch import nn 
import pandas as pd
import math 
import torch.nn.functional as F

## Embedding block 

In [20]:
class InputEmbedding (nn.Module) :
    def __init__(self,d_model:int, vocab_size:int):
        super(InputEmbedding,self).__init__()

        self.d_model= d_model
        self.vocab_size = vocab_size
        self.embedding= nn.Embedding(vocab_size,d_model)

    def forward(self,x):
        # this multiplication helps maintain the appropriate variance of the input embeddings.
        return self.embedding(x)*math.sqrt(self.d_model)

## Positional encoding block 

for even positions 
$$
PE(pos,2i) = sin(\frac{pos}{10000^{\frac{2i}{d_{model}}}})
$$
for odd positions
$$
PE(pos,2i+1) = cos(\frac{pos}{10000^{\frac{2i}{d_{model}}}})
$$

however we are calculating the divisor in the log scale for numerical stability

In [21]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, seq_len: int, dropout: float):
        super(PositionalEncoding, self).__init__()

        self.d_model = d_model
        self.seq_len = seq_len
        self.dropout = nn.Dropout(p=dropout)

        # Create a matrix of seq_len * d_model
        Pos_enc = torch.zeros(seq_len, d_model)
        # Create a vector of shape (seq_len)
        position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)  # (seq_len, 1)
        # Compute the divisor
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

        Pos_enc[:, 0::2] = torch.sin(position * div_term)
        Pos_enc[:, 1::2] = torch.cos(position * div_term)

        # Add an additional dimension for the batch size
        Pos_enc = Pos_enc.unsqueeze(0)  # [1, seq_len, d_model]
        # as this in unlearnable parameter save it with the model as a buffer
        self.register_buffer("Pos_enc", Pos_enc)  # Register as a buffer

    def forward(self, x):
        # Add positional encoding to each embedding in the batch
        # make it an unlearnable parameter as it's fixed
        x = x + self.Pos_enc[:, :x.size(1), :].requires_grad_(False)
        x = self.dropout(x)
        return x


## Add and Norm block 
These blocks incorporate two essential components: a residual connection and a LayerNormalization layer.

In [22]:
class AddNorm (nn.Module):

    def __init__(self, d_model,dropout:float):
        super(AddNorm,self).__init__()
        self.dropout = nn.Dropout(dropout)

        # norm block 
        self.normlayer = nn.LayerNorm(d_model)
        

    def forward(self,x, sublayer):
        x = x + self.dropout(self.normlayer(sublayer))
        return x 

## Position-Wise Feed-Forward Network (FFN):

FFN consists of two fully connected layers. Number of dimensions in the hidden layer $d_{ff}$, is generally set to around four times that of the token embedding $d_{model}$. So it is sometimes also called the expand-and-contract network.
There is an activation at the hidden layer, which is usually set to ReLU activation

The FFN transforms the features of each position in the input sequence independently.
By processing each position separately, the FFN enables the model to capture position-specific information and learn different representations for different parts of the sequence.

The **expanding** action increases the dimensionality of the representations, allowing the model to capture more complex features and interactions in the data, while the **contracting** action compresses these representations, preserving the most relevant information and reducing computational complexity, thereby improving the model's efficiency and capacity to capture intricate patterns.

$$FFN(x,W_1,W_2,b_1,b_2)=max(0,xW_1+b_1)W_2+b_2$$
where $W_1, W_2, b_1$ and $b_2$ are learnable parameters.








In [23]:
class FFN(nn.Module):
    def __init__(self,d_model, dropout):
        super(FFN,self).__init__()

        self.d_model = d_model
        self.dropout = dropout
        self.dropout = nn.Dropout(self.dropout)

        #FFN block 
        d_ff = d_model *4
        self.linear1 = nn.Linear(self.d_model,d_ff) # W_1 and b_1
        self.linear2 = nn.Linear(d_ff, self.d_model) # W_2 and b_2


    def forward(self,x):
        x = self.linear2(self.dropout(F.relu(self.linear1(x))))
        return x 

        

## Multihead self-attention 

Attention mechanisms were introduced to give access to all sequence elements at each time step. The key is to be selective and determine which words are most important in a specific context.

Self-attention is a mechanism used in deep learning models, that enhances the information content of an input embedding by incorporating information about the input's context. It allows the model to assign different weights to different words in a sequence, focusing more on relevant parts and less on irrelevant ones, thus enriching the representation of the input sequence. 

Each word in a sequence is transformed into three vectors: Query (Q), Key (K), and Value (V) by multiplying the word's embedding by learnable weights. This process is done to capture various information about the word for Q, K, and V, which are then fed into our attention layer:
- Q-> The query vector that represents the word for which we want to calculate the attention scores. It's the vector that we will compare with other words in the sequence to determine their relevance to the current word.

- K -> The key vector represents the other words in the sequence. Each word has its own key vector. These key vectors are compared with the query vector to determine how relevant each word is to the query word.
- V -> The value vector that carries information about the word itself. After determining the relevance of each word (using keys and queries), these values are combined to create the output. 

Computing the dot product of the Query vector of one word with the Key vector of another word, divided by the square root of the dimensionality of the vectors, produces a score that represents the importance of the relationship between the two words, which is then passed through a softmax function to get attention weights, and finally, these attention weights are used to compute a weighted sum of the Value vectors, providing the context vector.
Finally, the model uses this weighted sum to create a new representation for each word that takes into account its relationship with all the other words in the sentence. This representation captures the context in which the word appears
$$
Attention(Q,K,V) = softmax (\frac{Qk^T}{\sqrt{d_{model}}}) V
$$
Multi-head allows the model to focus on different aspects of the input simultaneously, improving its ability to capture complex relationships within the sequence.
1.	Splitting into Heads: In multi-head self-attention, the input is transformed into multiple smaller representations, called "heads". Each head has its own set of learned weight matrices for query (Q), key (K), and value (V) transformations. These weight matrices are learned during training.
2.	Parallel Computations: Each head performs its own attention calculation independently, resulting in multiple sets of attention scores.
3.	Concatenation and Linear Transformation: After the attention scores are calculated for each head, they are concatenated together and multiplied by a learned weight matrix. This linear transformation ensures that the outputs from different heads are combined appropriately.
$$
MultiHead(Q, K, V ) = Concat(head_1, ..., head_h) W_O
$$
$$
\quad  \textrm{where} \quad  head_i = Attention(QW^Q_i, KW^K_i, VW^V_i)
$$


In [24]:
class MultiHeadAttention(nn.Module):
    def __init__(self, h: int , d_model:int, dropout:float):
        super(MultiHeadAttention,self).__init__()

        self.h = h 
        self.d_model = d_model
 
        # check is it's possible to divide d_model amongst the available heads h 
        assert d_model  % h == 0, "d_model is not divisible by h"
        #split d_model into the multitude of heads 
        self.d_k = d_model// h

        # linear transformation matrices
        self.w_q = nn.Linear(d_model,d_model)
        self.w_k = nn.Linear(d_model,d_model) 
        self.w_v = nn.Linear(d_model,d_model) 
        self.w_o = nn.Linear(d_model,d_model)

        self.dropout = nn.Dropout(dropout)


    @staticmethod
    def attention(Q,K,V,mask,dropout: nn.Dropout):

        d_k = V.shape[-1]
        #[batch_size, num_heads, seq_len, seq_len]
        attention_values = (Q @ K.transpose(-2,-1))/ math.sqrt(d_k)
        # MASKED SELF ATTENTION 
        if mask is not None:
            mask = mask.unsqueeze(1)  # ensure mask shape (batch_size, 1, seq_len)
            attention_values = attention_values.masked_fill(mask == 0, float('-inf'))
        # dim=-1 so that the softmax function normalizes the scores for each query across all keys
        attention_values = attention_values.softmax(dim = -1) 
        attention_values = dropout(attention_values)
        attention_f_values = attention_values @ V

        return  attention_f_values, attention_values



    def forward(self,Q,K,V,mask = None):

        # linear transformation 
        # -> (batch_size, seq_length , d_model)
        Q = self.w_q(Q) 
        K = self.w_k(K)
        V = self.w_v(V)

        # splitting by viewing each matrix as a (batch_size, seq_length, h , d_k)
        # change the shape to (batch__size, num_heads, seq_length, d_k)
        Q = Q.view(Q.shape[0],Q.shape[1], self.h , self.d_k).transpose(1,2)
        K = K.view(K.shape[0],K.shape[1], self.h , self.d_k).transpose(1,2)
        V = V.view(V.shape[0],V.shape[1], self.h , self.d_k).transpose(1,2)

        x, self.attention_values = MultiHeadAttention.attention(Q,K,V,mask,self.dropout)

        # return the shape to (batch_size, seq_length,num_head,s d_k) 
        #Concatenate the results of all the heads. (batch_size, seq_len, d_model)
        x = x.transpose(1,2).contiguous().view(x.shape[0],-1,self.h*self.d_k)

        x = self.w_o(x)

        return x 


## Encoder wrapper 

In [25]:
class EncoderBlock(nn.Module):
    def __init__(self,SelfAttention_block :MultiHeadAttention, FFN_block :FFN ,dropout:float,d_model) :
        super(EncoderBlock,self).__init__()

        self.SelfAttention_block = SelfAttention_block 
        self.FFN_block = FFN_block
        #ModuleList for storing and iterating over a list of modules.
        self.AddNorm_block = nn.ModuleList([AddNorm(d_model,dropout) for _ in range(2)])

    def forward(self ,x , mask):
        x = self.AddNorm_block[0](x,self.SelfAttention_block(x,x,x,mask))
        x = self.AddNorm_block[1](x,self.FFN_block(x))

        return x 
    
class Encoder(nn.Module):

    def __init__(self, layers :nn.ModuleList,d_model):
        super(Encoder, self).__init__()
        self.layers = layers 
        self.normlayer = nn.LayerNorm(d_model)

    def forward(self,x,mask):
        for layer in self.layers:
            x = layer(x,mask)
        return self.normlayer(x) 


## Decoder wrapper

In [26]:
class DecoderBlock (nn.Module):
    def __init__(self,MaskedSelfAtt: MultiHeadAttention,CrossAttention:MultiHeadAttention, FNN_block : FFN, dropout:float,d_model):
        super(DecoderBlock,self).__init__()
        self.MaskedSelfAtt = MaskedSelfAtt
        self.CrossAttention = CrossAttention
        self.FNN_block = FNN_block 
        self.AddNorm = nn.ModuleList([AddNorm(d_model,dropout) for _ in range(3)])

        
    def forward(self, x , encoder_output , encoder_mask , decoder_mask ):
        x = self.AddNorm[0](x, self.MaskedSelfAtt(x,x,x,decoder_mask))
        x = self.AddNorm[1](x, self.CrossAttention(x,encoder_output,encoder_output,encoder_mask))
        # leeeh msh wakhod 2y input hena fl FNN 
        
        x = self.AddNorm[2](x, self.FNN_block(x))
        return x 
    
class Decoder (nn.Module):
    def __init__(self,layers:nn.ModuleList,d_model):
        super(Decoder,self).__init__()

        self.layers = layers 
        self.normlayer = nn.LayerNorm(d_model)
    def forward (self,x,encoder_output,encoder_mask, decoder_mask):
        for layer in self.layers:
            x = layer(x,encoder_output,encoder_mask, decoder_mask)
        return self.normlayer(x)

## Classification head

In [27]:
class ClassificationHead (nn.Module):
    def __init__(self,d_model,vocab_size):
        super(ClassificationHead,self).__init__()
        self.linear = nn.Linear(d_model, vocab_size)

    def forward (self,x):
        logits = self.linear(x)
        probabilities = F.softmax(logits, dim=-1)
        return probabilities

## Transformer wrapper 

 during inference, you can reuse the output of the encoder in a Transformer model. This is a common practice, especially in sequence-to-sequence tasks like machine translation, where the encoder processes the input sequence once and the decoder generates the output sequence token by token.

In [28]:
class Transformer(nn.Module):
    def __init__(self,src_vocab_size:int,tgt_vocab_size:int,src_seq_len:int,tgt_seq_len:int,dropout:float,
                 d_model: int = 512, N:int = 6,h:int = 8 ):
        super(Transformer,self).__init__()
        self.src_vocab_size = src_vocab_size
        self.tgt_vocab_size = tgt_vocab_size
        self.src_seq_len = src_seq_len
        self.tgt_seq_len = tgt_seq_len
        self.N = N
        self.h = h 
        self.dropout = dropout
        self.d_model = d_model

        self.initialize()

    def initialize(self):
        # embedding layer 
        self.src_embedd = InputEmbedding(self.d_model,self.src_vocab_size)
        self.tgt_embedd = InputEmbedding(self.d_model,self.tgt_vocab_size)

        # positional encoding layer 
        self.src_pos = PositionalEncoding(self.d_model,self.src_seq_len,self.dropout)
        self.tgt_pos = PositionalEncoding(self.d_model,self.tgt_seq_len,self.dropout)

        # Encoders block 
        encoder_blocks = []
        for _ in range(self.N):
            encoder_self_attention = MultiHeadAttention(self.h,self.d_model,self.dropout)
            FFN_layer = FFN(self.d_model,self.dropout)
            encoder_block = EncoderBlock(encoder_self_attention,FFN_layer,self.dropout,self.d_model)
            encoder_blocks.append(encoder_block)
        
        #decoders block 
        decoder_blocks = []
        for _ in range(self.N):
            decoder_self_attention = MultiHeadAttention(self.h, self.d_model, self.dropout) 
            decoder_cross_attention = MultiHeadAttention(self.h, self.d_model, self.dropout) 
            FFN_layer = FFN(self.d_model,self.dropout)
            decoder_block = DecoderBlock(decoder_self_attention,decoder_cross_attention,FFN_layer,self.dropout,self.d_model)
            decoder_blocks.append(decoder_block)

        # create encoder and decoder 
        self.encoder = Encoder (encoder_blocks,self.d_model)
        self.decoder = Decoder (decoder_blocks,self.d_model)

        # classification head 
        self.classification_head = ClassificationHead(self.d_model,self.tgt_vocab_size)

    def forward(self,src,tgt , tgt_mask=None,src_mask=None):
        src_embedd = self.src_embedd(src)
        src_pos =  self.src_pos(src_embedd)
        encoder_output = self.encoder(src_pos, src_mask)

        tgt_embedd = self.tgt_embedd(tgt)
        tgt_pos =  self.tgt_pos(tgt_embedd)
        decoder_output = self.decoder(tgt_pos,encoder_output,src_mask, tgt_mask)

        classify = self.classification_head(decoder_output)

        return classify


In [31]:
d_model = 512
src_vocab_size = 10000
tgt_vocab_size = 10000
src_seq_len = 100
tgt_seq_len = 100
dropout = 0.1
batch_size = 32


src = torch.randint(0, src_vocab_size, (batch_size, src_seq_len))  
tgt = torch.randint(0, tgt_vocab_size, (batch_size, tgt_seq_len))

# Create masks
def create_masks(src, tgt, src_pad_idx, tgt_pad_idx):
    src_mask = (src != src_pad_idx).unsqueeze(-2)
    tgt_mask = (tgt != tgt_pad_idx).unsqueeze(-2)
    
    # Subsequent mask for the target sequence (for autoregressive decoding)
    size = tgt.size(1)  # get seq_len for matrix
    nopeak_mask = torch.tril(torch.ones((1, size, size), device=tgt.device)).bool()
    tgt_mask = tgt_mask & nopeak_mask
    
    return src_mask, tgt_mask

src_pad_idx = 0  # Assume 0 is the padding index
tgt_pad_idx = 0  # Assume 0 is the padding index

# src_mask prevents attention to source padding tokens.
# tgt_mask enforces autoregressive decoding by masking future target tokens.
src_mask, tgt_mask = create_masks(src, tgt, src_pad_idx, tgt_pad_idx)


model = Transformer( src_vocab_size, tgt_vocab_size, 
                     src_seq_len, tgt_seq_len, dropout)
x = model(src,tgt,src_mask,tgt_mask)  
print(x)      

tensor([[[1.1472e-04, 1.7049e-04, 1.1314e-04,  ..., 6.1232e-05,
          1.2523e-04, 7.9625e-05],
         [1.1360e-05, 8.4765e-05, 4.0626e-05,  ..., 1.7509e-04,
          6.2689e-05, 2.7224e-04],
         [2.8805e-04, 1.3874e-04, 1.6954e-04,  ..., 1.2933e-04,
          1.8462e-04, 1.8869e-04],
         ...,
         [7.2598e-05, 9.9030e-05, 1.4380e-04,  ..., 3.4544e-05,
          6.2701e-05, 2.0583e-04],
         [2.1606e-04, 2.2514e-05, 1.1037e-04,  ..., 1.6708e-04,
          5.8799e-05, 1.0975e-04],
         [2.8416e-05, 2.0974e-04, 6.6932e-05,  ..., 9.1073e-05,
          1.4830e-04, 3.2595e-05]],

        [[4.7407e-05, 4.5760e-05, 1.1797e-04,  ..., 6.1123e-05,
          9.2908e-05, 4.9942e-05],
         [1.4749e-04, 1.0644e-04, 3.4683e-04,  ..., 8.0093e-05,
          6.9632e-05, 6.1360e-05],
         [7.7821e-05, 2.5662e-05, 1.2113e-04,  ..., 4.0259e-05,
          5.7442e-05, 7.6457e-05],
         ...,
         [5.6182e-05, 6.1656e-05, 1.0672e-04,  ..., 1.7616e-04,
          7.205

In [33]:
df = pd.read_csv("eng_-french.csv")
english_sentences = df['English words/sentences'].tolist()
french_sentences = df['French words/sentences'].tolist()

# Question 3 (Theory)

#### **1. Autoencoders: Explain what autoencoders are and how they are related to sequence-to-sequence modeling. How are they related to Transformers?**
> ##### Autoencoders: 
an auto encoder is a type of neural nets designed to learn efficient representations of the input data that can be used for either for dimensionality reduction or feature learning. it's structured as follows :

- **input layer**: that receives an input of size d
- encoder: that encodes the input data into a latent representation of size k << d , which is usually composed of one or more layers of neurons with activation functions 
- **latent/ bottleneck layer**: the latent representation we obtain from the encoder , which forces the model to learn a compressed version of our input, the dimensionality reduction at this stage ensures that the model only learns the relevant most important features of the input.
- **decoder**: which reconstructs the input x from the latent representation either to reproduce a reconstructed version original input x or  apply them to a supervised task. 

this model can be used for a lot of purposes for instance , Dimensionality Reduction to compress the input into a much lower dimensions and reconstructing it afterwords for storage or visualization purposes, feature learning to learn abstract features in an unsupervised manner that can be used in a supervised tasks and sequence to sequence(seq2seq) tasks.

> ##### Encoder-Decoder seq2seq models: 
when it comes to seq2seq tasks such as language translation or text summarization, using a standard neural network in an encoder is unsufficient due to the sequential nature of the data. instead RNN or LSTM are used. where autoencoders are adapted to to accommodate LSTMs instead of neural net. additionally, it solved the challenges that accompanied LSTM or other RNN variants which is the variable length input and output  : 
- **RNN/LSTM sequence encoder**: encoder process each token in the input sequence and tries to represent it using a fixed length latent layer( also known as context vector in the context of seq2seq) and after going through all the tokens the encoder passes this newly learned vector onto the decoder 
- **RNN/LSTM sequence decoder** : takes the context vector from the encoder and generates the output sequences. it produces on element at a time, updating it's hidden states based on the previously generated elements, allowing for flexible mappings between sequences of different lengths. 

> ##### Transformers 
the main disadvantage of Encoder-decoder seq2seq models is their fixed length context vector design thus with long sentences it struggles to learn long term dependency due to the vanishing/exploding gradient dilemma, it only remembers the parts that it has just learned , another problem is that there is no way to attend to words more than others for their importance ,that's where Transformers came in handy, it's an encoder-decoder structure however it relies solely on self-attention mechanisms to weigh the importance of different words in a sequence. allowing each position in the input sequence to attend to all other positions, enabling the model to capture dependencies regardless of distance. where structurly:

- encoder: Consists of multiple layers of self-attention and feedforward neural networks. also, the input sequence is processed all at once, allowing for parallelization.
- decoder: Similar to the encoder but includes an additional attention mechanism to focus on relevant parts of the input sequence when generating each output element.

#### **2.Normalization: What are the differences between the notions of batch and layer normalization? Explain your answer** 

Normalization techniques can significantly reduce the training time of a model by normalizing each feature, ensuring the network remains unbiased towards higher value features. Batch normalization and layer normalization are two popular methods used for this purpose. and here are some of the key differences between them:

>##### Batch Normalization (BatchNorm) 
- normalizes the activation functions across a mini-batch of a definite size,at which it normalizes the activation for each mini-batch of training data. It calculates the mean and variance across the mini-batch dimension for each feature.
- 
$$
\hat{x_i} = \frac{x_i -\mu_B }{\sqrt{\sigma^2_B + \epsilon}}
$$
$\mu_B$: batch mean <br>
$\sigma_B^2$: batch variance <br>
$\epsilon$: is a small constant for numerical stability 
- Adds noise during training, which acts as a form of regularization and reduces the probability of overfitting.
- Computationally more costly because it calculates the mean and variance for the activations of each mini-batch.
- Typically used in convolutional neural networks (CNNs) and feedforward neural networks (FFNs) where batch processing is common.
>##### Layer Normalization (LayerNorm)
- Normalizes the activations across all dimensions (features) of the input for each example independently, instead of across the batch dimension.
- 
$$
\hat{x_i} = \frac{x_i -\mu_L }{\sqrt{\sigma^2_L + \epsilon}}
$$
$\mu_B$: mean along a layer <br>
$\sigma_B^2$: variance along a layer<br> 
$\epsilon$: is a small constant for numerical stability
- Does not inherently add noise during training, so it does not have the same regularization effect as batch normalization.
-Less computationally expensive as it calculates the mean and variance for each example individually.
- Mostly used in recurrent neural networks (RNNs) and transformer-based architectures, where input lengths can vary and batch sizes may be small or varied. 

After normalization using either of the two techniques , a learned affine transformation is applied to the normalized features. 
$$ 
y = \gamma\hat{x_i}+\beta
​$$
where $\gamma$ and $\beta$ are learnable parameters

#### **3. Show how training and inference of RNNs and Transformers for sequence-to-sequence modeling are done in an autoregressive fashion. What trick do we use when training the decoder part? Explain your answer in detail.** 

Both Recurrent Neural Networks (RNNs) and Transformers are used for sequence-to-sequence (seq2seq) tasks. These models generate output sequences one element at a time in an autoregressive fashion, where autoregressive modeling means that the model predicts the next element in the sequence based on the elements that have been generated so far. This approach is used in both training and inference phases, though the techniques differ slightly to ensure effective learning and generation.

> **RNN in seq2seq**: 

In RNNs, the input sequence is given sequentially. Each element of the input sequence is fed into the RNN's encoder one at a time, and the encoder updates its hidden state with each new element, capturing the temporal dependencies in the sequence. 

> **transformers in seq2seq**

However transformers's , unlike RNNs, process the input sequence in parallel rather than sequentially, where this is achieved due to the self attention mechanism, which allows the model to attend to all positions in the input sequence simultaneously, capturing complex dependencies more efficiently.


>**inference in seq2seq modeling**  

for inference, both algorithms uses their own previously generated token at the previous time stamp as input for the next step where the autoregressive continues until end of sequence token is generated This method ensures that the model can generate coherent and contextually appropriate sequences based on the elements generated so far.

>**Training in Seq2Seq Models with Teacher Forcing**

for training ,the trick used here to ensure effective learning instead of using the model's own output , the model's true output are used rather than the predicted one ,which is a technique called "teacher forcing" ,as if we refed the our model the previous time step it will lead to slow convergence and model instability during the training phase.The loss is computed between the predicted output sequence and the actual target sequence, allowing the model to adjust its parameters to improve its performance.


#### **4. Positional embeddings: Explain in your own words what positional embeddings are and why they are introduced? What alternative methods (other than the sin-cos based method we discussed in the lecture) are usually considered for incorporating positional embeddings? Explain your results verbally as well as formally.**

#### **5. Attention mechanism: What are the main differences between the attention models considered in RNNs and the Transformer-models? What is the main motivation to incorporate attention to RNNs? Why do we consider using multi-headed attention in Transformers? What is modeled through it? Explain your answer in detail.**