<h1><center>
    Implementing, fine-tuning and visualizing transformer architectures. <br/>
  
    
    Project 1
</center></h1>

# Task 1
Implement the feed-forward pass of the original transformer network using only numpy, i.e. without machine learning frameworks.

Note: All subtasks are voluntary and rather a guide-line of how we would implement the forward pass. You can also choose a different order for implementing the different parts or implement everything in one class/function. The forward pass should return an numpy array.

Please initialize the weights using Glorot initialization.

In [1]:
# You will test your implementation on a single array:
import numpy as np
forward_pass_array = np.array([101, 400, 500, 600, 107, 102])


In [5]:
def initialise_glorot_weights(y_rows, x_cols):
    """
    Initialise the weights of a layer with Glorot normal initialisation.
    
    :param input_size: The number of inputs to the layer.
    :type input_size: int
    :param output_size: The number of outputs of the layer.
    :type output_size: int
    :return: The initialised weights.
    :rtype: np.ndarray
    """
    return np.random.normal(loc=0.0, scale=np.sqrt(2.0 / (y_rows + x_cols)), size=(y_rows, x_cols, ))

In [8]:
# token embedding matrix 1000x512 random floats golorot normal
word_embeddings = initialise_glorot_weights(1000, 512)
word_embeddings.shape

(1000, 512)

In [26]:
selected_embeddings = word_embeddings[forward_pass_array]

## Task 1.1
Implement the sinus/cosinus positional encoding used in the original paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762). Implement the token embedding.

In [27]:
def positional_encoding(positions:list, embeddings:list, d_model:int = 512):
    """
    returns a matrix
    pos is the position, embeddings is a vector and d_model is the model dimension
    we use sine and cosine functions of different frequencies

    :param pos: position aka t 
    :type pos: list
    :param embeddings: size of the embedding
    :type i: list
    :param d_model: model dimensions
    :type d_model: int
    """
    result_matrix = np.zeros((len(positions), len(embeddings)))
    for pos in range(len(positions)):
        for i in range(len(embeddings)):
            if i % 2 == 0:
                result_matrix[pos,i] = np.sin(pos / (10000 ** (i / d_model)))
            else:
                result_matrix[pos,i] = np.cos(pos / (10000 ** ((i - 1) / d_model)))
    return result_matrix

# Now we can create the positional encoding matrix


positional_encoding_matrix = positional_encoding(selected_embeddings, selected_embeddings[0], len(selected_embeddings[0]))

In [28]:
positional_encoding_matrix

array([[ 0.00000000e+00,  1.00000000e+00,  0.00000000e+00, ...,
         1.00000000e+00,  0.00000000e+00,  1.00000000e+00],
       [ 8.41470985e-01,  5.40302306e-01,  8.21856190e-01, ...,
         9.99999994e-01,  1.03663293e-04,  9.99999995e-01],
       [ 9.09297427e-01, -4.16146837e-01,  9.36414739e-01, ...,
         9.99999977e-01,  2.07326584e-04,  9.99999979e-01],
       [ 1.41120008e-01, -9.89992497e-01,  2.45085415e-01, ...,
         9.99999948e-01,  3.10989874e-04,  9.99999952e-01],
       [-7.56802495e-01, -6.53643621e-01, -6.57166863e-01, ...,
         9.99999908e-01,  4.14653159e-04,  9.99999914e-01],
       [-9.58924275e-01,  2.83662185e-01, -9.93854779e-01, ...,
         9.99999856e-01,  5.18316441e-04,  9.99999866e-01]])

In [29]:
positional_encoding_matrix.shape

(6, 512)

## Task 1.2
Implement a dense layer with the number of hidden units as an argument.

In [13]:
def dense_layer(X:np.array, hidden_units:int):
    """
    Input and output matrices are of same dimension
    Single dense layer with a specified number of hidden units


    :param X: input matrix
    :type X: np.array
    :param hidden_units: number of hidden units
    :type hidden_units: int
    """
    input_dim = X.shape[1]
    
    # golorot normal initialization of the weights
    weight = initialise_glorot_weights(input_dim, hidden_units )

    bias = np.zeros(hidden_units)
    
    # we use the dot product to calculate the output of the layer
    return np.matmul(X, weight) + bias

In [14]:
dense_output = dense_layer(positional_encoding_matrix, 512)
dense_output.shape

(6, 512)

## Task 1.3
Implement all activation function such that they are compatible with the dense layer.

In [15]:
# Implement all activation function such that they are compatible with the dense layer.
def activation(hidden_layer:np.array):
    """ Relu activation

    :param hidden_layer: hidden layer for activation
    :type hidden_layer: np.array
    :param activation: ReLU activated hidden layer
    :type activation: np.array
    """
    # Implement the relu activation function
    return np.maximum(0, hidden_layer)

In [16]:
activation(dense_output)

array([[0.0386763 , 0.16269847, 0.        , ..., 0.13063814, 0.35234958,
        0.        ],
       [0.33376418, 0.19914226, 0.        , ..., 0.        , 0.50838357,
        0.        ],
       [0.56363832, 0.21748447, 0.        , ..., 0.        , 0.50922896,
        0.        ],
       [0.6138117 , 0.2571192 , 0.        , ..., 0.        , 0.42028939,
        0.        ],
       [0.47351487, 0.29935418, 0.        , ..., 0.        , 0.3563683 ,
        0.        ],
       [0.22648588, 0.3043982 , 0.        , ..., 0.        , 0.40995253,
        0.        ]])

## Task 1.4
Implement the skip (residual) connections.

In [17]:
#Implement the skip (residual) connections.
def skip_connections(hidden_layer, input_layer):
    """Skip connection 

    :param hidden_layer: _description_
    :type hidden_layer: _type_
    :param input_layer: _description_
    :type input_layer: _type_
    """
    return hidden_layer + input_layer

## Task 1.5
Implement layer normalization.

In [18]:
# Implement layer normalization.
def normalisation_layer(hidden_layer, mean, variance):
    """ check the dimensionality should be normalised over all values 
    instead of columns wise normalisation used in batch normalisation

    :param hidden_layer: _description_
    :type hidden_layer: _type_
    :param mean: _description_
    :type mean: _type_
    :param variance: _description_
    :type variance: _type_
    """
    # Implement the layer normalization
    mean = np.mean(hidden_layer, axis=1, keepdims=True)
    variance = np.var(hidden_layer, axis=1, keepdims=True)
    return (hidden_layer - mean) / np.sqrt(variance + 1e-8)

## Task 1.6
Implement dropout.

In [19]:
# Implement dropout
def dropout(hidden_layer, dropout_rate):
    """dropout 

    :param hidden_layer: _description_
    :type hidden_layer: _type_
    :param dropout_rate: _description_
    :type dropout_rate: _type_
    """
    # Implement the dropout
    return hidden_layer * np.random.binomial(1, 1 - dropout_rate, size=hidden_layer.shape)


## Task 1.7
Implement the attention mechanism.

In [36]:
def softmax(input_matrix):
    """_summary_

    :param hidden_layer: _description_
    :type hidden_layer: _type_
    """
    # Implement the softmax function
    return np.exp(input_matrix) / np.sum(np.exp(input_matrix), axis=1, keepdims=True)
 
def attention(query, key, value, mask=None, dropout=None):
    """Compute 'Scaled Dot Product Attention'

    :param query: query matrix resulting from the dot product of the x and the query weights
    :type query: np.array
    :param key: key matrix resulting from the dot product of the x and the key weights
    :type key: np.array
    :param value: value matrix resulting from the dot product of the x and the value weights
    :type value: np.array
    :param mask: boolean mask, defaults to None
    :type mask: boolen, optional
    :param dropout: boolean dropout, defaults to None
    :type dropout: boolen, optional
    :return: output of shape (seq_len, d_model).
    :rtype: np.array
    """    """"""
    # Implement the attention mechanism
    dimension_k = key.shape[1]
    return np.matmul(softmax(np.matmul(query, key.T) / np.sqrt(dimension_k)), value)

def decoder_mask(input_matrix:np.ndarray):
    """mask the attention to the decoder so that the model does not cheat by looking ahead in the sequence.

    :param input_matrix: the output of the attion mechanism
    :type input_matrix: np.ndarray
    """    
    attn_shape = input_matrix.shape
    subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
    return input_matrix[subsequent_mask == -np.inf].reshape(attn_shape)

# implement a multi-head attention layer
def multihead_attention(query, key, value, d_model:int = 512, num_heads:int = 8):
    """
    Compute multihead self-attention given query, key, and value matrices.

    :param query: Query matrix of shape (seq_len, d_model).
    :param key: Key matrix of shape (seq_len, d_model).
    :param value: Value matrix of shape (seq_len, d_model).
    :param d_model: Model dimensionality.
    :param num_heads: Number of heads for attention.

    :returns: Output of multihead attention of shape (seq_len, d_model).
    """

    # Split matrices into multiple heads
    query_heads = np.array(np.split(query, num_heads, axis=1))
    key_heads = np.array(np.split(key, num_heads, axis=1))
    value_heads = np.array(np.split(value, num_heads, axis=1))

    # Compute scaled dot product attention for each head
    head_outputs = []
    for i in range(num_heads):
        query_i = query_heads[i]
        key_i = key_heads[i]
        value_i = value_heads[i]

        # Compute attention scores
        dimension_k = key_i.shape[1]
        attention_scores = np.matmul(softmax(np.matmul(query, key.T) / np.sqrt(dimension_k)), value)

        # Weighted sum over rows
        print(attention_scores.shape, value_i.shape)
        head_output = np.matmul(attention_scores, value_i)

        head_outputs.append(head_output)

    # Concatenate head outputs and project back to original dimensionality
    outputs = np.concatenate(head_outputs, axis=-1)
    
    # W = np.random.randn(d_model, d_model)
    # outputs = np.matmul(outputs, W)

    return outputs
    

In [37]:
positional_encoding_matrix.shape

(6, 512)

In [38]:
query_weights, key_weights, value_weights = initialise_glorot_weights(512, 512), initialise_glorot_weights(512, 512), initialise_glorot_weights(512, 512)
query, key, value = np.matmul(positional_encoding_matrix, query_weights), np.matmul(positional_encoding_matrix, key_weights), np.matmul(positional_encoding_matrix, value_weights)
attention = multihead_attention(query, key, value)
attention.shape

(6, 512) (6, 64)


ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 6 is different from 512)

In [14]:
# Implement the attention mechanism.
# softmax function from numpy.



def softmax(input_matrix):
    """_summary_

    :param hidden_layer: _description_
    :type hidden_layer: _type_
    """
    # Implement the softmax function
    return np.exp(input_matrix) / np.sum(np.exp(input_matrix), axis=1, keepdims=True)
 
def attention(Q, K, V, dimension_k, mask=None, dropout=None):
    """Compute 'Scaled Dot Product Attention'

    :param Q: _description_
    :type Q: _type_
    :param K: _description_
    :type K: _type_
    :param V: _description_
    :type V: _type_
    """
    # Implement the attention mechanism
    return np.matmul(softmax(np.matmul(Q, K.T) / np.sqrt(dimension_k)), V)

def decoder_mask(input_matrix:np.ndarray):
    """mask the attention to the decoder so that the model does not cheat by looking ahead in the sequence.

    :param input_matrix: the output of the attion mechanism
    :type input_matrix: np.ndarray
    """    
    attn_shape = input_matrix.shape
    subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
    return input_matrix[subsequent_mask == -np.inf].reshape(attn_shape)
    


## Task 1.8
Implement the positonal feed-forward network.

## Task 1.9
Implement the encoder attention.

## Task 1.10
Implement the encoder.

## Task 1.11
Implement the decoder attention.

## Task 1.12
Implement the encoder-decoder attention.

## Task 1.13
Implement the decoder.

## Task 1.14
Implement the transformer architecture (e.g. by creating a Transformer class that includes the steps before).

In [15]:
class Transformer():
    

SyntaxError: incomplete input (1776112302.py, line 2)

## Task 1.15
Test the forward pass using the follow array.

In [None]:
forward_pass_array = np.array([101, 400, 500, 600, 107, 102])
Transformer(forward_pass_array)

# Task 2

In the following, we want to use pretrained models. From here on, you are allowed to use any machine learning frame work of your choice. Moreover, you will use [Hugging Face](https://huggingface.co/) which is compatible with tensorflow, keras, torch and other machine learning frameworks.

We want to fine-tune a pretrained model to determine whether Yelp reviews are positive or negative. The data set is available for [Tensorflow](https://www.tensorflow.org/datasets/catalog/yelp_polarity_reviews) and [(py)torch](https://pytorch.org/text/stable/datasets.html#yelpreviewpolarity). Given the text of a review, we want to determine whether the yelp review is positive and negative. The data set is pre-split into training and test set. Please use the training data to fine-tune your model, while using the test data to evaluate your models performance. This exercise does not necessarily end in having a SOTA model, the goal is for you to use and fine-tune SOTA pretrained large language models.

Problem Setting:

The label $y$ to a Yelp review $T$ is either positive or negative. Given a Yelp Review $T$ and a polarity feedback $y$ determine whether the Review $T$ is positive or negative. The training set $\mathcal{D} = \{(T_1, y_1), \ldots, (T_N, y_N)\}$, where $T_i$ is review $i$ and $y_i$ is $T_i$'s polarity feedback. A suitable evaluation metric for this type of problem is $\rightarrow$ see Theory Question 1.

In the following, please solve all subtasks.

## Theory Question 1
Which metric is threshold independent to evaluate the problem setting described in Task 2. Please list pros and cons of three different metrics that might be suitable, define an evaluation protocol and decide which evaluation suits this problem best.

## Task 2.0
Load the Yelp Review Polarity dataset.

## Task 2.1
Decide on a suitable language model from the HuggingFace model zoo (a library providing pretrained models).

## Task 2.2

For the model to process the intended way, we need the tokenizer that was used during training. Luckily Hugging Face  provides both pretrained models and tokenizer. After in Task 2.1 decided for a language model, please load the corresponding tokenizer.

## Task 2.3
Load the language model from HuggingFace.

## Task 2.4
Fine-tune your model on the Yelp Review Polarity training data set. Note: If you have computational limitations consider fine-tuning only on part of the training dataset. 

## Task 2.5
Evaluate your model, following the evaluation protocol you defined in Theory Question 1, on the test part of the Yelp Review Polarity data set.

# Task 3
Visualize and interpret the attention weights of one correctly and one of the incorrectly classified examples of the Yelp Review Polarity test data using [BertViz](https://github.com/jessevig/bertviz)'s model_view.

## Theory Question 2
In your own words, describe how the attention mechanism in a transformer works in the case of self-attention and cross-attention, identifying in each case the keys, queries, and values. Give two examples of alignment models and describe how they affect the output using a simple example. This part of the written report can be done in collaboration with your group.

# Task 4
Please describe your team's implementation of this project, including your personal contribution, in 1000-1500 characters. Each team member must explain the main aspects of the team's implementation, and may not discuss this summary with other students. You are allowed to use figures and tables to clarify. This summary constitutes a separately and individually graded piece of work.