<h1><center>
    Implementing, fine-tuning and visualizing transformer architectures. <br/>
  
    
    Project 1
</center></h1>

# Task 1
Implement the feed-forward pass of the original transformer network using only numpy, i.e. without machine learning frameworks.

Note: All subtasks are voluntary and rather a guide-line of how we would implement the forward pass. You can also choose a different order for implementing the different parts or implement everything in one class/function. The forward pass should return an numpy array.

Please initialize the projectionss using Glorot initialization.

In [3]:
# You will test your implementation on a single array:
import numpy as np
forward_pass_array = np.array([101, 400, 500, 600, 107, 102])


In [4]:
def init_weights(y_rows, x_cols):
    """
    Initialise the weights of a layer with Glorot normal initialisation.
    
    :param input_size: The number of inputs to the layer.
    :type input_size: int
    :param output_size: The number of outputs of the layer.
    :type output_size: int
    :return: The initialised weights.
    :rtype: np.ndarray
    """
    return np.random.normal(loc=0.0, scale=np.sqrt(2.0 / (y_rows + x_cols)), size=(y_rows, x_cols, ))

In [5]:
# token embedding matrix 1000x512 random floats golorot normal
word_embeddings = init_weights(1000, 512)
word_embeddings.shape

(1000, 512)

In [6]:
word_embeddings

array([[-0.038666  ,  0.0110876 , -0.00198816, ..., -0.00793881,
        -0.00731678,  0.05975828],
       [ 0.04740794,  0.02777291,  0.04728539, ...,  0.0096149 ,
        -0.04204589, -0.05074325],
       [ 0.01354499, -0.02552212,  0.01505498, ..., -0.06429986,
         0.00700919,  0.01385261],
       ...,
       [ 0.01120186,  0.02460554,  0.04894044, ..., -0.0032413 ,
        -0.00678045, -0.01369281],
       [ 0.02624195,  0.05351129,  0.03768188, ...,  0.00552568,
         0.01751928,  0.0140576 ],
       [ 0.0201187 ,  0.10398219, -0.06068838, ...,  0.04591414,
        -0.03330898,  0.00720756]])

In [7]:
selected_embeddings = word_embeddings[forward_pass_array]

## Task 1.1
Implement the sinus/cosinus positional encoding used in the original paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762). Implement the token embedding.

In [8]:
def positional_encoding(positions:list,  d_model:int = 512):
    """
    returns a matrix
    pos is the position, embeddings is a vector and d_model is the model dimension
    we use sine and cosine functions of different frequencies

    :param pos: position aka t 
    :type pos: list
    :param embeddings: size of the embedding
    :type i: list
    :param d_model: model dimensions
    :type d_model: int
    """
    result_matrix = np.zeros((len(positions), d_model))
    for pos in range(len(positions)):
        for i in range(d_model):
            if i % 2 == 0:
                result_matrix[pos,i] = np.sin(pos / (10000 ** (i / d_model)))
            else:
                result_matrix[pos,i] = np.cos(pos / (10000 ** ((i - 1) / d_model)))
    return result_matrix

# Now we can create the positional encoding matrix


positional_encoding_matrix = positional_encoding(selected_embeddings, len(selected_embeddings[0]))

In [9]:
positional_encoding_matrix

array([[ 0.00000000e+00,  1.00000000e+00,  0.00000000e+00, ...,
         1.00000000e+00,  0.00000000e+00,  1.00000000e+00],
       [ 8.41470985e-01,  5.40302306e-01,  8.21856190e-01, ...,
         9.99999994e-01,  1.03663293e-04,  9.99999995e-01],
       [ 9.09297427e-01, -4.16146837e-01,  9.36414739e-01, ...,
         9.99999977e-01,  2.07326584e-04,  9.99999979e-01],
       [ 1.41120008e-01, -9.89992497e-01,  2.45085415e-01, ...,
         9.99999948e-01,  3.10989874e-04,  9.99999952e-01],
       [-7.56802495e-01, -6.53643621e-01, -6.57166863e-01, ...,
         9.99999908e-01,  4.14653159e-04,  9.99999914e-01],
       [-9.58924275e-01,  2.83662185e-01, -9.93854779e-01, ...,
         9.99999856e-01,  5.18316441e-04,  9.99999866e-01]])

In [10]:
positional_encoding_matrix.shape

(6, 512)

## Task 1.2
Implement a dense layer with the number of hidden units as an argument.

In [11]:
def dense_layer(X:np.array, hidden_units:int):
    """
    Input and output matrices are of same dimension
    Single dense layer with a specified number of hidden units


    :param X: input matrix
    :type X: np.array
    :param hidden_units: number of hidden units
    :type hidden_units: int
    """
    input_dim = X.shape[1]
    
    # golorot normal initialization of the weights
    weight = init_weights(input_dim, hidden_units )

    bias = np.zeros(hidden_units)
    
    # we use the dot product to calculate the output of the layer
    return np.matmul(X, weight) + bias

In [12]:
dense_output = dense_layer(positional_encoding_matrix, 512)
dense_output.shape

(6, 512)

## Task 1.3
Implement all activation function such that they are compatible with the dense layer.

In [13]:
# Implement all activation function such that they are compatible with the dense layer.
def activation(hidden_layer:np.array):
    """ Relu activation

    :param hidden_layer: hidden layer for activation
    :type hidden_layer: np.array
    :param activation: ReLU activated hidden layer
    :type activation: np.array
    """
    # Implement the relu activation function
    return np.maximum(0, hidden_layer)

In [14]:
activation(dense_output)

array([[0.        , 0.34080006, 0.78258567, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.36876947, 1.03083418, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.43418891, 1.09910465, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.44263979, 0.95976103, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.36540362, 0.68448107, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.26594423, 0.39367441, ..., 0.        , 0.        ,
        0.        ]])

## Task 1.4
Implement the skip (residual) connections.

In [15]:
#Implement the skip (residual) connections.
def skip_connections(hidden_layer, input_layer):
    """Skip connection 

    :param hidden_layer: _description_
    :type hidden_layer: _type_
    :param input_layer: _description_
    :type input_layer: _type_
    """
    return hidden_layer + input_layer

## Task 1.5
Implement layer normalization.

In [16]:
# Implement layer normalization.
def normalisation_layer(hidden_layer):
    """ check the dimensionality should be normalised over all values 
    instead of columns wise normalisation used in batch normalisation

    :param hidden_layer: _description_
    :type hidden_layer: _type_
    :param mean: _description_
    :type mean: _type_
    :param variance: _description_
    :type variance: _type_
    """
    # Implement the layer normalization
    mean = np.mean(hidden_layer, axis=1, keepdims=True)
    variance = np.var(hidden_layer, axis=1, keepdims=True)
    return (hidden_layer - mean) / np.sqrt(variance + 1e-8)

## Task 1.6
Implement dropout.

In [17]:
# Implement dropout
def drop_out(hidden_layer, dropout_rate):
    """dropout 

    :param hidden_layer: _description_
    :type hidden_layer: _type_
    :param dropout_rate: _description_
    :type dropout_rate: _type_
    """
    # Implement the dropout
    return hidden_layer * np.random.binomial(1, 1 - dropout_rate, size=hidden_layer.shape)


## Task 1.7
Implement the attention mechanism.

In [18]:
def softmax(input_matrix):
    """Softmax activation

    :param input_matrix: Input matrix
    :type input_matrix: np.array
    """
    # Implement the softmax function
    return np.exp(input_matrix) / np.sum(np.exp(input_matrix), axis=1, keepdims=True)
 
def attention(query, key, value, mask=None, dropout_rate=None):
    """Compute 'Scaled Dot Product Attention'

    :param query: query matrix resulting from the dot product of the x and the query weights of shape (seq_len, d_model).
    :type query: np.array
    :param key: key matrix resulting from the dot product of the x and the key weights of shape (seq_len, d_model).
    :type key: np.array
    :param value: value matrix resulting from the dot product of the x and the value weights of shape (seq_len, d_model).
    :type value: np.array
    :param mask: boolean mask, defaults to None
    :type mask: boolen, optional
    :param dropout: boolean dropout, defaults to None
    :type dropout: boolen, optional
    :return: output of shape (seq_len, d_model).
    :rtype: np.array
    """    """"""
    # Implement the attention mechanism
    dimension_k = key.shape[1]
    scores = np.matmul(query, key.T) / np.sqrt(dimension_k)
    if mask is not None:
        scores = mask_attention_scores(scores, mask_value=-np.inf)
    predicted_attention = softmax(scores)
    if dropout_rate is not None:
        predicted_attention = drop_out(predicted_attention, dropout_rate=dropout_rate)
    return np.matmul(predicted_attention, value)

# Implement the mask function

def mask_attention_scores(scores, mask_value=-np.inf):
    """
    Apply a mask to attention scores.

    :param scores: Attention scores of shape (num_heads, seq_len, seq_len).

    :returns: Masked attention scores of shape (num_heads, seq_len, seq_len).
    """
    # Create mask of shape (seq_len, seq_len)
    attn_shape = scores.shape
    mask = np.tril(np.ones(attn_shape), k=0)

    # Set masked positions to mask_value
    masked_scores = np.where(mask == 0, mask_value, scores)
    # print('masked_scores', masked_scores.shape)
    return masked_scores




# implement a multi-head attention layer
def multihead_attention(query, key, value, query_projections, key_projections, value_projections, o_projections, mask:bool = None, dropout_rate:float = None, num_heads:int = 8):
    """
    Compute multihead self-attention given query, key, and value matrices.

    :param query: query matrix resulting from the dot product of the x and the query weights of shape (seq_len, d_model).
    :type query: np.array
    :param key: key matrix resulting from the dot product of the x and the key weights of shape (seq_len, d_model).
    :type key: np.array
    :param value: value matrix resulting from the dot product of the x and the value weights of shape (seq_len, d_model).
    :type value: np.array
    :param query_projections: Query projectionss of shape (d_model, d_model).
    :type query_projections: np.array
    :param key_projections: Key projectionss of shape (d_model, d_model).
    :type key_projections: np.array
    :param value_projections: Value projectionss of shape (d_model, d_model).
    :type value_projections: np.array
    :param o_projections: Output projectionss of shape (d_model, d_model).
    :type o_projections: np.array
    :param mask: Mask for attention, defaults to None
    :type mask: bool, optional
    :param dropout_rate: Dropout rate, defaults to None
    :type dropout_rate: float, optional
    :param num_heads: Number of heads for attention.
    :type num_heads: int, optional

    :returns: Output of multihead attention of shape (seq_len, d_model).
    """
    
    # Project query, key, and value matrices using learnable projectionss
    query_projected = np.matmul(query, query_projections)
    key_projected = np.matmul(key, key_projections)
    value_projected = np.matmul(value, value_projections)
    

    # Split matrices into multiple heads
    query_heads = np.array(np.split(query_projected, num_heads, axis=1))
    key_heads = np.array(np.split(key_projected, num_heads, axis=1))
    value_heads = np.array(np.split(value_projected, num_heads, axis=1))
        
    # print(query_heads.shape)
    

    # Compute scaled dot product attention for each head
    head_outputs = []
    for i in range(num_heads):
        query_i = query_heads[i]
        key_i = key_heads[i]
        value_i = value_heads[i]

        # Compute attention scores
        attention_scores = attention(query_i, key_i, value_i, mask, dropout_rate)

        # Append attention scores to head_outputs
        head_outputs.append(attention_scores)

    # Concatenate head outputs and project back to original dimensionality
    outputs = np.concatenate(head_outputs, axis=-1)
    
    # reproject the output
    outputs_projected = np.matmul(outputs, o_projections)

    return outputs_projected
    

In [19]:
positional_encoding_matrix.shape

(6, 512)

In [20]:
# query_weights, key_weights, value_weights, o_weights = init_weights(512, 512), init_weights(512, 512), init_weights(512, 512), init_weights(512, 512)
# query, key, value = np.matmul(positional_encoding_matrix, query_weights), np.matmul(positional_encoding_matrix, key_weights), np.matmul(positional_encoding_matrix, value_weights)

# attention_output = multihead_attention(query, key, value, query_weights, key_weights, value_weights, o_weights, mask = False, dropout_rate = 0.1, num_heads = 8)
# print(attention_output.shape)
# attention_output

In [21]:

# drop one row of the query matrix and check the output
# query_weights, key_weights, value_weights, o_weights = init_weights(512, 512), init_weights(512, 512), init_weights(512, 512), init_weights(512, 512)
# query, key, value = np.matmul(positional_encoding_matrix, init_weights(512, 512)), np.matmul(positional_encoding_matrix, init_weights(512, 512)), np.matmul(positional_encoding_matrix, init_weights(512, 512))
# attention_masked = multihead_attention(query, key, value, query_weights, key_weights, value_weights, o_weights, mask = True, dropout_rate = 0.1, num_heads = 8)
# print(attention_masked.shape)
# attention_masked

## Task 1.8
Implement the positonal feed-forward network.

In [22]:
# included in Task 1.14

## Task 1.9
Implement the encoder attention.

In [23]:
# included in Task 1.14

## Task 1.10
Implement the encoder.

In [24]:
# included in Task 1.14


## Task 1.11
Implement the decoder attention.

In [25]:
# included in Task 1.14

## Task 1.12
Implement the encoder-decoder attention.

In [26]:
# included in Task 1.14

## Task 1.13
Implement the decoder.

In [27]:
# included in Task 1.14



## Task 1.14
Implement the transformer architecture (e.g. by creating a Transformer class that includes the steps before).

In [32]:
class Transformer():
    def __init__(self, vocab_size, d_model, attention_layers, d_hidden, dropout=0.1):
        self.vocab_size = vocab_size
        self.d_model = d_model
        self.attention_layers = attention_layers
        self.d_hidden = d_hidden
        self.dropout = dropout

        # initialise weights for embedding
        self.W_embedding_encoder = init_weights(vocab_size, d_model) 
        self.W_embedding_decoder = init_weights(vocab_size, d_model)
        
        
        list_weights = [init_weights(d_model, d_model) for i in range(attention_layers)] # template for weights for dimensionality (d_model, d_model）  
        
        # initialise weights for encoder attention
        self.W_K_encoder, self.W_Q_encoder, self.W_V_encoder = list_weights, list_weights, list_weights
        self.query_projections_enc, self.key_projections_enc, self.value_projections_enc, self.o_projections_enc = list_weights,list_weights,list_weights,list_weights
        
        # initialise weights for decoder attention
        self.W_K_decoder, self.W_Q_decoder, self.W_V_decoder = list_weights, list_weights, list_weights
        self.query_projections_dec, self.key_projections_dec, self.value_projections_dec, self.o_projections_dec = list_weights,list_weights,list_weights,list_weights

        # initialise weights for encoder-decoder attention
        self.W_Q_encoder_decoder, self.W_K_encoder_decoder, self.W_V_encoder_decoder = list_weights, list_weights, list_weights
        self.query_projections_enc_dec, self.key_projections_enc_dec, self.value_projections_enc_dec, self.o_projections_enc_dec = list_weights,list_weights,list_weights,list_weights

        # parameter for encoder feed forward 
        self.W_1_enc = [init_weights(d_model, d_hidden) for i in range(attention_layers)]
        self.b_1_enc = [init_weights(1, d_hidden) for i in range(attention_layers)]
        self.W_2_enc = [init_weights(d_hidden, d_model) for i in range(attention_layers)]
        self.b_2_enc = [init_weights(1, d_model) for i in range(attention_layers)]
        
        # parameter for decoder feed forward 
        self.W_1_dec = [init_weights(d_model, d_hidden) for i in range(attention_layers)]
        self.b_1_dec = [init_weights(1, d_hidden) for i in range(attention_layers)]
        self.W_2_dec = [init_weights(d_hidden, d_model) for i in range(attention_layers)]
        self.b_2_dec = [init_weights(1, d_model) for i in range(attention_layers)]

        # parameter for linear layer
        self.W, self.b = init_weights(len(forward_pass_array_output) * d_model, self.vocab_size), init_weights(1, self.vocab_size)

    def forward(self, input_enc, input_dec, mask=None):
        output_enc = self.encoder(input_enc, n_layers=self.attention_layers)  
        out_prob = self.decoder(input_dec, output_enc, n_layers=self.attention_layers, mask=mask) # 
        return out_prob # output of encoder

    def encoder(self, x, n_layers=6):
        x = self.W_embedding_encoder[x]
        x += positional_encoding(x)
        x = drop_out(x, self.dropout)  # apply dropout to the sums of the embeddings and the positional encodings
    
        for i in range(n_layers):
            # get query, key, value
            query_encoder = np.matmul(x, self.W_K_encoder[i])
            key_encoder = np.matmul(x, self.W_Q_encoder[i])
            value_encoder = np.matmul(x, self.W_V_encoder[i])
            self_attenion = multihead_attention(query_encoder, key_encoder, value_encoder, self.query_projections_enc[i], self.key_projections_enc[i], self.value_projections_enc[i], self.o_projections_enc[i], mask=None, dropout_rate = self.dropout, num_heads = 8)

            x += skip_connections(x, self_attenion)  # apply dropout to the output of each sub-layer, before it is added to the sub-layer input and normalized.
            x = normalisation_layer(x)

            # feed forward
            x += skip_connections(x, drop_out(self.feed_forward(x, self.W_1_enc[i], self.b_1_enc[i], self.W_2_enc[i], self.b_2_enc[i]),dropout_rate=self.dropout))
            x = normalisation_layer(x)
        return x


    def decoder(self, x, y, n_layers=6, mask=None):
        """
        x: input of decoder, 
        y: output of encoder, 
        n_layers: number of layers in decoder,
        mask: mask for decoder attention
        """
        x = self.W_embedding_decoder[x]
        x += positional_encoding(x)
        x = drop_out(x, 0.1)  # apply dropout to the sums of the embeddings and the positional encodings
    
        for i in range(n_layers):
            # get query, key, value for decoder attention
            query_decoder = np.matmul(x, self.W_K_decoder[i])
            key_decoder = np.matmul(x, self.W_Q_decoder[i]) 
            value_decoder = np.matmul(x, self.W_V_decoder[i]) 
            # decoder attention
            self_attenion = multihead_attention(query_decoder, key_decoder, value_decoder, self.query_projections_dec[i], self.key_projections_dec[i], self.value_projections_dec[i], self.o_projections_dec[i], mask=mask, dropout_rate = self.dropout, num_heads = 8)
            x += skip_connections(x, self_attenion)  # apply dropout to the output of each sub-layer, before it is added to the sub-layer input and normalized.
            x = normalisation_layer(x)
    
            # get query, key, value for encoder-decoder attention
            query_encoder_decoder = np.matmul(x, self.W_Q_encoder_decoder[i])  # from decoder attention
            key_encoder_decoder = np.matmul(y, self.W_K_encoder_decoder[i]) # from encoder attention
            value_encoder_decoder = np.matmul(y, self.W_V_encoder_decoder[i])  # from encoder attention
            # encoder-decoder attention
            attenion = multihead_attention(query_encoder_decoder, key_encoder_decoder, value_encoder_decoder, self.query_projections_enc_dec[i], self.key_projections_enc_dec[i], self.value_projections_enc_dec[i], self.o_projections_enc_dec[i], mask=None, dropout_rate = self.dropout, num_heads = 8)
            x += skip_connections(x, attenion)  # apply dropout to the output of each sub-layer, before it is added to the sub-layer input and normalized.
            x = normalisation_layer(x)
    
            # feed forward
            x += skip_connections(x, drop_out(self.feed_forward(x, self.W_1_dec[i], self.b_1_dec[i], self.W_2_dec[i], self.b_2_dec[i]), dropout_rate=self.dropout))
            x = normalisation_layer(x)
    
        # flatten 2D matrix to 1D vector
        x = x.flatten().squeeze()
        # linear layer
        x = self.linear_layer(x)
        output = softmax(x)
    
        return output

    
    def linear_layer(self, x):
        x = np.matmul(x, self.W) + self.b
        x = drop_out(x, self.dropout)
        return x
    
    def feed_forward(self, x, W_1, b_1, W_2, b_2):
        x = np.matmul(x, W_1) + b_1
        x = activation(x)
        x = np.matmul(x, W_2) + b_2
        x = drop_out(x, self.dropout)
        return x


## Task 1.15
Test the forward pass using the follow array.

In [33]:
forward_pass_array_input = np.array([1, 40, 50, 60, 17, 12]) # input for encoder
forward_pass_array_output = np.array([12, 48, 50, 63]) # temporary input for decoder

transformer = Transformer(vocab_size=1000, d_model=512, attention_layers=6, d_hidden=2048, dropout=0.1)
output = transformer.forward(forward_pass_array_input, forward_pass_array_output, mask=True)

print(output.max())
print(output.argmax())

0.03173424513326655
171


# Task 2

In the following, we want to use pretrained models. From here on, you are allowed to use any machine learning frame work of your choice. Moreover, you will use [Hugging Face](https://huggingface.co/) which is compatible with tensorflow, keras, torch and other machine learning frameworks.

We want to fine-tune a pretrained model to determine whether Yelp reviews are positive or negative. The data set is available for [Tensorflow](https://www.tensorflow.org/datasets/catalog/yelp_polarity_reviews) and [(py)torch](https://pytorch.org/text/stable/datasets.html#yelpreviewpolarity). Given the text of a review, we want to determine whether the yelp review is positive and negative. The data set is pre-split into training and test set. Please use the training data to fine-tune your model, while using the test data to evaluate your models performance. This exercise does not necessarily end in having a SOTA model, the goal is for you to use and fine-tune SOTA pretrained large language models.

Problem Setting:

The label $y$ to a Yelp review $T$ is either positive or negative. Given a Yelp Review $T$ and a polarity feedback $y$ determine whether the Review $T$ is positive or negative. The training set $\mathcal{D} = \{(T_1, y_1), \ldots, (T_N, y_N)\}$, where $T_i$ is review $i$ and $y_i$ is $T_i$'s polarity feedback. A suitable evaluation metric for this type of problem is $\rightarrow$ see Theory Question 1.

In the following, please solve all subtasks.

## Theory Question 1
Which metric is threshold independent to evaluate the problem setting described in Task 2. Please list pros and cons of three different metrics that might be suitable, define an evaluation protocol and decide which evaluation suits this problem best.

: 

## Task 2.0
Load the Yelp Review Polarity dataset.

: 

## Task 2.1
Decide on a suitable language model from the HuggingFace model zoo (a library providing pretrained models).

: 

## Task 2.2

For the model to process the intended way, we need the tokenizer that was used during training. Luckily Hugging Face  provides both pretrained models and tokenizer. After in Task 2.1 decided for a language model, please load the corresponding tokenizer.

: 

## Task 2.3
Load the language model from HuggingFace.

: 

## Task 2.4
Fine-tune your model on the Yelp Review Polarity training data set. Note: If you have computational limitations consider fine-tuning only on part of the training dataset. 

: 

## Task 2.5
Evaluate your model, following the evaluation protocol you defined in Theory Question 1, on the test part of the Yelp Review Polarity data set.

: 

# Task 3
Visualize and interpret the attention weights of one correctly and one of the incorrectly classified examples of the Yelp Review Polarity test data using [BertViz](https://github.com/jessevig/bertviz)'s model_view.

: 

## Theory Question 2
In your own words, describe how the attention mechanism in a transformer works in the case of self-attention and cross-attention, identifying in each case the keys, queries, and values. Give two examples of alignment models and describe how they affect the output using a simple example. This part of the written report can be done in collaboration with your group.

: 

# Task 4
Please describe your team's implementation of this project, including your personal contribution, in 1000-1500 characters. Each team member must explain the main aspects of the team's implementation, and may not discuss this summary with other students. You are allowed to use figures and tables to clarify. This summary constitutes a separately and individually graded piece of work.

In [None]:
jhk

: 