<h1><center>
    Implementing, fine-tuning and visualizing transformer architectures. <br/>
  
    
    Project 1
</center></h1>

# Task 1
Implement the feed-forward pass of the original transformer network using only numpy, i.e. without machine learning frameworks.

Note: All subtasks are voluntary and rather a guide-line of how we would implement the forward pass. You can also choose a different order for implementing the different parts or implement everything in one class/function. The forward pass should return an numpy array.

Please initialize the weights using Glorot initialization.

In [4]:
# You will test your implementation on a single array:
import numpy as np
forward_pass_array = np.array([101, 400, 500, 600, 107, 102])


## Task 1.1
Implement the sinus/cosinus positional encoding used in the original paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762). Implement the token embedding.

In [None]:
# token embedding matrix 1000x512 random floats golorot normal
word_embedding = np.random.
# select words from the embedding matrix by index

# matrix 6 x 512


In [25]:
def positional_encoding(positions:list, embeddings:list, d_model:int = 512):
    """
    returns a matrix
    pos is the position, embeddings is a vector and d_model is the model dimension
    we use sine and cosine functions of different frequencies

    :param pos: position aka t 
    :type pos: list
    :param embeddings: size of the embedding
    :type i: list
    :param d_model: model dimensions
    :type d_model: int
    """
    result_matrix = np.zeros((len(positions), len(embeddings)))
    for pos in range(len(positions)):
        for i in range(len(embeddings)):
            if i % 2 == 0:
                result_matrix[pos,i] = np.sin(pos / (10000 ** (i / d_model)))
            else:
                result_matrix[pos,i] = np.cos(pos / (10000 ** ((i - 1) / d_model)))
    return result_matrix

# Now we can create the positional encoding matrix


positional_encoding_matrix = positional_encoding(forward_pass_array, forward_pass_array[0], len(forward_pass_array[0]))

In [26]:
positional_encoding_matrix

array([[0.        , 1.        , 0.        , 1.        , 0.        ,
        1.        ],
       [0.84147098, 0.54030231, 0.04639922, 0.99892298, 0.00215443,
        0.99999768]])

In [27]:
positional_encoding_matrix.shape

(2, 6)

## Task 1.2
Implement a dense layer with the number of hidden units as an argument.

In [28]:
def dense_layer(X:np.array, hidden_units:int):
    """
    Input and output matrices are of same dimension
    Single dense layer with a specified number of hidden units


    :param X: input matrix
    :type X: np.array
    :param hidden_units: number of hidden units
    :type hidden_units: int
    """
    input_dim = X.shape[1]
    
    # golorot normal initialization of the weights
    weight = np.random.normal(loc=0.0, scale=np.sqrt(2.0 / (input_dim + hidden_units)), size=(input_dim, hidden_units, ))

    bias = np.zeros(hidden_units)
    
    # we use the dot product to calculate the output of the layer
    return np.matmul(X, weight) + bias

In [29]:
dense_output = dense_layer(positional_encoding_matrix, 512)
dense_output.shape

(2, 512)

## Task 1.3
Implement all activation function such that they are compatible with the dense layer.

In [19]:
# Implement all activation function such that they are compatible with the dense layer.
def activation(hidden_layer:np.array):
    """ Relu activation

    :param hidden_layer: hidden layer for activation
    :type hidden_layer: np.array
    :param activation: ReLU activated hidden layer
    :type activation: np.array
    """
    # Implement the relu activation function
    return np.maximum(0, hidden_layer)

In [20]:
activation(dense_output)

array([[0.0158035 , 0.1954243 , 0.11861342, ..., 0.0784443 , 0.04377799,
        0.        ],
       [0.        , 0.25335698, 0.05216577, ..., 0.03803676, 0.02747156,
        0.        ]])

## Task 1.4
Implement the skip (residual) connections.

In [11]:
#Implement the skip (residual) connections.
def skip_connections(hidden_layer, input_layer):
    """Skip connection 

    :param hidden_layer: _description_
    :type hidden_layer: _type_
    :param input_layer: _description_
    :type input_layer: _type_
    """
    return hidden_layer + input_layer

## Task 1.5
Implement layer normalization.

In [52]:
# Implement layer normalization.
def normalisation_layer(hidden_layer, mean, variance):
    """ check the dimensionality should be normalised over all values 
    instead of columns wise normalisation used in batch normalisation

    :param hidden_layer: _description_
    :type hidden_layer: _type_
    :param mean: _description_
    :type mean: _type_
    :param variance: _description_
    :type variance: _type_
    """
    # Implement the layer normalization
    mean = np.mean(hidden_layer, axis=1, keepdims=True)
    variance = np.var(hidden_layer, axis=1, keepdims=True)
    return (hidden_layer - mean) / np.sqrt(variance + 1e-8)

## Task 1.6
Implement dropout.

In [13]:
# Implement dropout
def dropout(hidden_layer, dropout_rate):
    """dropout 

    :param hidden_layer: _description_
    :type hidden_layer: _type_
    :param dropout_rate: _description_
    :type dropout_rate: _type_
    """
    # Implement the dropout
    return hidden_layer * np.random.binomial(1, 1 - dropout_rate, size=hidden_layer.shape)


## Task 1.7
Implement the attention mechanism.

In [37]:

def softmax(input_matrix, axis=1):
    """_summary_

    :param hidden_layer: _description_
    :type hidden_layer: _type_
    """
    # Implement the softmax function
    return np.exp(input_matrix) / np.sum(np.exp(input_matrix), axis=axis, keepdims=True)

# implement a multi-head attention layer
def multihead_attention(query, key, value, num_heads, mask=None):
    """
    Applies multi-head attention to the given query, key, and value tensors.

    :param query: A NumPy array representing the query tensor. This tensor should have shape `(batch_size, sequence_length, feature_dim)`.
    :type query: numpy.ndarray
    :param key: A NumPy array representing the key tensor. This tensor should have shape `(batch_size, sequence_length, feature_dim)`.
    :type key: numpy.ndarray
    :param value: A NumPy array representing the value tensor. This tensor should have shape `(batch_size, sequence_length, feature_dim)`.
    :type value: numpy.ndarray
    :param num_heads: An integer representing the number of attention heads to use.
    :type num_heads: int
    :param mask: An optional mask tensor. This tensor should have shape `(batch_size, sequence_length)` and should contain `0`s
        where elements should be masked and `1`s elsewhere.
    :type mask: numpy.ndarray, optional
    :return: A tuple containing the output tensor of the multi-head attention and the attention weights.
        The output tensor has shape `(batch_size, sequence_length, feature_dim)`.
        The attention weights tensor has shape `(batch_size, num_heads, sequence_length, sequence_length)`.
    :rtype: tuple[numpy.ndarray, numpy.ndarray]

    **Example**

query = np.array([[1, 2, 3], [4, 5, 6]])
key = np.array([[7, 8, 9], [10, 11, 12]])
value = np.array([[13, 14, 15], [16, 17, 18]])
num_heads = 2
output, attention_weights = multihead_attention(query, key, value, num_heads)
print(output)
    [[ 6.902019   7.1488886  7.395758 ]
     [15.902019  16.14889   16.39576  ]]
print(attention_weights)
    [[[0.33333334 0.6666667 ]
      [0.33333334 0.6666667 ]]
     [[0.33333334 0.6666667 ]
      [0.33333334 0.6666667 ]]]
    """
    
    # Get the number of features for the query, key, and value tensors
    feature_dim = query.shape[-1]

    # Split the query, key, and value tensors into num_heads separate tensors
    query = np.reshape(query, (-1, num_heads, feature_dim // num_heads))
    key = np.reshape(key, (-1, num_heads, feature_dim // num_heads))
    value = np.reshape(value, (-1, num_heads, feature_dim // num_heads))

    # Transpose the query, key, and value tensors to prepare for matrix multiplication
    query = np.transpose(query, (1, 0, 2))
    key = np.transpose(key, (1, 0, 2))
    value = np.transpose(value, (1, 0, 2))

    # Compute the dot products of the query and key tensors
    dot_products = np.matmul(query, key.transpose(0, 2, 1))

    # Scale the dot products by the square root of the feature dimension
    dot_products = np.sqrt(feature_dim // num_heads)

    # Apply the mask (if any)
    if mask is not None:
        dot_products = np.where(mask == 0, -1e9, dot_products)

    # Apply the softmax function along the last dimension
    attention_weights = softmax(dot_products, axis=-1)

    # Compute the weighted sum of the value tensors using the attention weights
    output = np.matmul(attention_weights, value)

    # Transpose and reshape the output tensor to match the shape of the input tensors
    output = np.transpose(output, (1, 0, 2))
    output = np.reshape(output, (-1, feature_dim))

    return output, attention_weights
    

In [38]:

query = np.array([[1, 2, 3], [4, 5, 6]])
key = np.array([[7, 8, 9], [10, 11, 12]])
value = np.array([[13, 14, 15], [16, 17, 18]])
num_heads = 2
output, attention_weights = multihead_attention(query, key, value, num_heads)
print(output)
print(attention_weights)

ValueError: matmul: Input operand 0 does not have enough dimensions (has 0, gufunc core with signature (n?,k),(k,m?)->(n?,m?) requires 1)

In [14]:
# Implement the attention mechanism.
# softmax function from numpy.



def softmax(input_matrix):
    """_summary_

    :param hidden_layer: _description_
    :type hidden_layer: _type_
    """
    # Implement the softmax function
    return np.exp(input_matrix) / np.sum(np.exp(input_matrix), axis=1, keepdims=True)
 
def attention(Q, K, V, dimension_k, mask=None, dropout=None):
    """Compute 'Scaled Dot Product Attention'

    :param Q: _description_
    :type Q: _type_
    :param K: _description_
    :type K: _type_
    :param V: _description_
    :type V: _type_
    """
    # Implement the attention mechanism
    return np.matmul(softmax(np.matmul(Q, K.T) / np.sqrt(dimension_k)), V)

def decoder_mask(input_matrix:np.ndarray):
    """mask the attention to the decoder so that the model does not cheat by looking ahead in the sequence.

    :param input_matrix: the output of the attion mechanism
    :type input_matrix: np.ndarray
    """    
    attn_shape = input_matrix.shape
    subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
    return input_matrix[subsequent_mask == -np.inf].reshape(attn_shape)
    


## Task 1.8
Implement the positonal feed-forward network.

## Task 1.9
Implement the encoder attention.

## Task 1.10
Implement the encoder.

## Task 1.11
Implement the decoder attention.

## Task 1.12
Implement the encoder-decoder attention.

## Task 1.13
Implement the decoder.

## Task 1.14
Implement the transformer architecture (e.g. by creating a Transformer class that includes the steps before).

In [15]:
class Transformer():
    

SyntaxError: incomplete input (1776112302.py, line 2)

## Task 1.15
Test the forward pass using the follow array.

In [None]:
forward_pass_array = np.array([101, 400, 500, 600, 107, 102])
Transformer(forward_pass_array)

# Task 2

In the following, we want to use pretrained models. From here on, you are allowed to use any machine learning frame work of your choice. Moreover, you will use [Hugging Face](https://huggingface.co/) which is compatible with tensorflow, keras, torch and other machine learning frameworks.

We want to fine-tune a pretrained model to determine whether Yelp reviews are positive or negative. The data set is available for [Tensorflow](https://www.tensorflow.org/datasets/catalog/yelp_polarity_reviews) and [(py)torch](https://pytorch.org/text/stable/datasets.html#yelpreviewpolarity). Given the text of a review, we want to determine whether the yelp review is positive and negative. The data set is pre-split into training and test set. Please use the training data to fine-tune your model, while using the test data to evaluate your models performance. This exercise does not necessarily end in having a SOTA model, the goal is for you to use and fine-tune SOTA pretrained large language models.

Problem Setting:

The label $y$ to a Yelp review $T$ is either positive or negative. Given a Yelp Review $T$ and a polarity feedback $y$ determine whether the Review $T$ is positive or negative. The training set $\mathcal{D} = \{(T_1, y_1), \ldots, (T_N, y_N)\}$, where $T_i$ is review $i$ and $y_i$ is $T_i$'s polarity feedback. A suitable evaluation metric for this type of problem is $\rightarrow$ see Theory Question 1.

In the following, please solve all subtasks.

## Theory Question 1
Which metric is threshold independent to evaluate the problem setting described in Task 2. Please list pros and cons of three different metrics that might be suitable, define an evaluation protocol and decide which evaluation suits this problem best.

## Task 2.0
Load the Yelp Review Polarity dataset.

## Task 2.1
Decide on a suitable language model from the HuggingFace model zoo (a library providing pretrained models).

## Task 2.2

For the model to process the intended way, we need the tokenizer that was used during training. Luckily Hugging Face  provides both pretrained models and tokenizer. After in Task 2.1 decided for a language model, please load the corresponding tokenizer.

## Task 2.3
Load the language model from HuggingFace.

## Task 2.4
Fine-tune your model on the Yelp Review Polarity training data set. Note: If you have computational limitations consider fine-tuning only on part of the training dataset. 

## Task 2.5
Evaluate your model, following the evaluation protocol you defined in Theory Question 1, on the test part of the Yelp Review Polarity data set.

# Task 3
Visualize and interpret the attention weights of one correctly and one of the incorrectly classified examples of the Yelp Review Polarity test data using [BertViz](https://github.com/jessevig/bertviz)'s model_view.

## Theory Question 2
In your own words, describe how the attention mechanism in a transformer works in the case of self-attention and cross-attention, identifying in each case the keys, queries, and values. Give two examples of alignment models and describe how they affect the output using a simple example. This part of the written report can be done in collaboration with your group.

# Task 4
Please describe your team's implementation of this project, including your personal contribution, in 1000-1500 characters. Each team member must explain the main aspects of the team's implementation, and may not discuss this summary with other students. You are allowed to use figures and tables to clarify. This summary constitutes a separately and individually graded piece of work.