<h1><center>
    Implementing, fine-tuning and visualizing transformer architectures. <br/>
  
    
    Project 1
</center></h1>

# Task 1
Implement the feed-forward pass of the original transformer network using only numpy, i.e. without machine learning frameworks.

Note: All subtasks are voluntary and rather a guide-line of how we would implement the forward pass. You can also choose a different order for implementing the different parts or implement everything in one class/function. The forward pass should return an numpy array.

Please initialize the weights using Glorot initialization.

In [4]:
# You will test your implementation on a single array:
import numpy as np
forward_pass_array = np.array([101, 400, 500, 600, 107, 102])

In [5]:
def initialise_layers(shape):
    """
    This function initialises a numpy dense layer given the number of hidden units using Glorot initialization and returns the weights and biases

    :param input_shape: shape of the input
    :type input_shape: int
    :param output_shape: shape of the output
    :type output_shape: int
    :param n_hidden_units: number of hidden units
    :type n_hidden_units: int
    """
    # We use Glorot initialization
    weights = np.random.normal(loc= 0.0, scale= np.sqrt(2/ (shape + shape)), size=(shape, shape))
    
    # numpy dense layer
    for i in range(0, shape):
        for j in range(0, shape):
            weights[i][j] = np.random.normal(loc= 0.0, scale= np.sqrt(2/ (shape + shape)))
    return weights


In [6]:
initialise_layers(512).shape

(512, 512)

## Task 1.1
Implement the sinus/cosinus positional encoding used in the original paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762). Implement the token embedding.

In [7]:
def positional_encoding(pos:int, i:int, d_model:int):
    """
    pos is the position, i is the dimension and d_model is the model dimension
    we use sine and cosine functions of different frequencies

    :param pos: _description_
    :type pos: int
    :param i: _description_
    :type i: int
    :param d_model: _description_
    :type d_model: int
    """
    if i % 2 == 0:
        return np.sin(pos / (10000 ** (i / d_model)))
    else:
        return np.cos(pos / (10000 ** ((i - 1) / d_model)))

# Now we can create the positional encoding matrix
positional_encoding_matrix = np.array([[positional_encoding(pos, i, 512) for i in range(512)] for pos in range(512)])
positional_encoding_matrix


array([[ 0.00000000e+00,  1.00000000e+00,  0.00000000e+00, ...,
         1.00000000e+00,  0.00000000e+00,  1.00000000e+00],
       [ 8.41470985e-01,  5.40302306e-01,  8.21856190e-01, ...,
         9.99999994e-01,  1.03663293e-04,  9.99999995e-01],
       [ 9.09297427e-01, -4.16146837e-01,  9.36414739e-01, ...,
         9.99999977e-01,  2.07326584e-04,  9.99999979e-01],
       ...,
       [ 6.19504237e-02,  9.98079228e-01,  7.98205653e-01, ...,
         9.98504463e-01,  5.27401358e-02,  9.98608271e-01],
       [ 8.73326668e-01,  4.87135024e-01,  9.49807649e-01, ...,
         9.98498582e-01,  5.28436545e-02,  9.98602798e-01],
       [ 8.81770401e-01, -4.71678874e-01,  2.83995701e-01, ...,
         9.98492690e-01,  5.29471727e-02,  9.98597315e-01]])

## Task 1.2
Implement a dense layer with the number of hidden units as an argument.

In [12]:
def dense_layer(X, hidden_layers, activation=None, initializer="glorot_uniform"):
    """
    Computes the output of a dense layer with a specified number of hidden layers.

    :param X: input tensor of shape (batch_size, input_dim)
    :type X: numpy.ndarray

    :param hidden_layers: list of integers specifying the number of neurons in each hidden layer
    :type hidden_layers: list[int]

    :param activation: activation function (default: None)
    :type activation: callable

    :param initializer: weight initialization method (default: "glorot_uniform")
    :type initializer: str

    :return: output array of shape (batch_size, output_dim)
    :rtype: numpy.ndarray
    """
    input_dim = X.shape[1]

    # Initialize weights and biases for the hidden layers
    weights = []
    biases = []
    output_dim = None
    for layer_dim in hidden_layers:
        if initializer == "glorot_uniform":
            # Glorot initialization for uniform distribution
            bound = np.sqrt(6.0 / (input_dim + layer_dim))
            weight = np.random.uniform(low=-bound, high=bound, size=(input_dim, layer_dim))
        elif initializer == "glorot_normal":
            # Glorot initialization for normal distribution
            std_dev = np.sqrt(2.0 / (input_dim + layer_dim))
            weight = np.random.normal(loc=0.0, scale=std_dev, size=(input_dim, layer_dim))
        else:
            # Random initialization
            weight = np.random.randn(input_dim, layer_dim)

        bias = np.zeros(layer_dim)

        weights.append(weight)
        biases.append(bias)

        input_dim = layer_dim
        output_dim = layer_dim

    # Initialize weights and biases for the output layer
    if initializer == "glorot_uniform":
        # Glorot initialization for uniform distribution
        bound = np.sqrt(6.0 / (input_dim + X.shape[1]))
        weight = np.random.uniform(low=-bound, high=bound, size=(output_dim, X.shape[1]))
    elif initializer == "glorot_normal":
        # Glorot initialization for normal distribution
        std_dev = np.sqrt(2.0 / (input_dim + X.shape[1]))
        weight = np.random.normal(loc=0.0, scale=std_dev, size=(output_dim, X.shape[1]))
    else:
        # Random initialization
        weight = np.random.randn(output_dim, X.shape[1])

    bias = np.zeros(X.shape[1])

    weights.append(weight)
    biases.append(bias)

    # Compute the output of each layer
    output = X
    for i in range(len(weights)):
        output = np.dot(output, weights[i]) + biases[i]
        if i < len(weights) - 1 and activation is not None:
            output = activation(output)

    return output


In [14]:
# Create input tensor X of shape (batch_size, input_dim)
X = np.random.rand(32, 64)

# Define the number of neurons in each hidden layer
hidden_layers = [128, 256]

# Compute output of dense layer with ReLU activation
output = dense_layer(X, hidden_layers)
output.shape


(32, 64)

## Task 1.3
Implement all activation function such that they are compatible with the dense layer.

In [9]:
# Implement all activation function such that they are compatible with the dense layer.
def activation(hidden_layer):
    """ Relu activation

    :param hidden_layer: hidden layer for activation
    :type hidden_layer: np.array
    :param activation: ReLU activated hidden layer
    :type activation: np.array
    """
    # Implement the relu activation function
    return np.maximum(0, hidden_layer)

In [10]:
test_hidden_layer = np.random.normal(loc=0, scale=1, size=[256, 256])
activation(test_hidden_layer)

array([[0.17207862, 0.20716259, 0.08312284, ..., 0.58155436, 0.50223506,
        0.        ],
       [0.04796746, 0.        , 0.        , ..., 1.17118689, 0.90378338,
        1.97967341],
       [1.79309265, 0.        , 0.52665597, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.29559844],
       [0.        , 0.43673485, 0.        , ..., 0.        , 2.14142875,
        0.        ]])

## Task 1.4
Implement the skip (residual) connections.

In [11]:
#Implement the skip (residual) connections.
def skip_connections(hidden_layer, input_layer):
    """Skip connection 

    :param hidden_layer: _description_
    :type hidden_layer: _type_
    :param input_layer: _description_
    :type input_layer: _type_
    """
    return hidden_layer + input_layer

## Task 1.5
Implement layer normalization.

In [12]:
# Implement layer normalization.
def normalisation_layer(hidden_layer, mean, variance):
    """_summary_

    :param hidden_layer: _description_
    :type hidden_layer: _type_
    :param mean: _description_
    :type mean: _type_
    :param variance: _description_
    :type variance: _type_
    """
    # Implement the layer normalization
    mean = np.mean(hidden_layer, axis=1, keepdims=True)
    variance = np.var(hidden_layer, axis=1, keepdims=True)
    return (hidden_layer - mean) / np.sqrt(variance + 1e-8)

## Task 1.6
Implement dropout.

In [13]:
# Implement dropout
def dropout(hidden_layer, dropout_rate):
    """_summary_

    :param hidden_layer: _description_
    :type hidden_layer: _type_
    :param dropout_rate: _description_
    :type dropout_rate: _type_
    """
    # Implement the dropout
    return hidden_layer * np.random.binomial(1, 1 - dropout_rate, size=hidden_layer.shape)


## Task 1.7
Implement the attention mechanism.

In [14]:
# Implement the attention mechanism.
# softmax function from numpy.
def softmax(input_matrix):
    """_summary_

    :param hidden_layer: _description_
    :type hidden_layer: _type_
    """
    # Implement the softmax function
    return np.exp(input_matrix) / np.sum(np.exp(input_matrix), axis=1, keepdims=True)
 
def attention(Q, K, V, dimension_k):
    """_summary_

    :param Q: _description_
    :type Q: _type_
    :param K: _description_
    :type K: _type_
    :param V: _description_
    :type V: _type_
    """
    # Implement the attention mechanism
    return np.matmul(softmax(np.matmul(Q, K.T) / np.sqrt(dimension_k)), V)
    


## Task 1.8
Implement the positonal feed-forward network.

## Task 1.9
Implement the encoder attention.

## Task 1.10
Implement the encoder.

## Task 1.11
Implement the decoder attention.

## Task 1.12
Implement the encoder-decoder attention.

## Task 1.13
Implement the decoder.

## Task 1.14
Implement the transformer architecture (e.g. by creating a Transformer class that includes the steps before).

In [15]:
class Transformer():
    

SyntaxError: incomplete input (1776112302.py, line 2)

## Task 1.15
Test the forward pass using the follow array.

In [None]:
forward_pass_array = np.array([101, 400, 500, 600, 107, 102])
Transformer(forward_pass_array)

# Task 2

In the following, we want to use pretrained models. From here on, you are allowed to use any machine learning frame work of your choice. Moreover, you will use [Hugging Face](https://huggingface.co/) which is compatible with tensorflow, keras, torch and other machine learning frameworks.

We want to fine-tune a pretrained model to determine whether Yelp reviews are positive or negative. The data set is available for [Tensorflow](https://www.tensorflow.org/datasets/catalog/yelp_polarity_reviews) and [(py)torch](https://pytorch.org/text/stable/datasets.html#yelpreviewpolarity). Given the text of a review, we want to determine whether the yelp review is positive and negative. The data set is pre-split into training and test set. Please use the training data to fine-tune your model, while using the test data to evaluate your models performance. This exercise does not necessarily end in having a SOTA model, the goal is for you to use and fine-tune SOTA pretrained large language models.

Problem Setting:

The label $y$ to a Yelp review $T$ is either positive or negative. Given a Yelp Review $T$ and a polarity feedback $y$ determine whether the Review $T$ is positive or negative. The training set $\mathcal{D} = \{(T_1, y_1), \ldots, (T_N, y_N)\}$, where $T_i$ is review $i$ and $y_i$ is $T_i$'s polarity feedback. A suitable evaluation metric for this type of problem is $\rightarrow$ see Theory Question 1.

In the following, please solve all subtasks.

## Theory Question 1
Which metric is threshold independent to evaluate the problem setting described in Task 2. Please list pros and cons of three different metrics that might be suitable, define an evaluation protocol and decide which evaluation suits this problem best.

## Task 2.0
Load the Yelp Review Polarity dataset.

## Task 2.1
Decide on a suitable language model from the HuggingFace model zoo (a library providing pretrained models).

## Task 2.2

For the model to process the intended way, we need the tokenizer that was used during training. Luckily Hugging Face  provides both pretrained models and tokenizer. After in Task 2.1 decided for a language model, please load the corresponding tokenizer.

## Task 2.3
Load the language model from HuggingFace.

## Task 2.4
Fine-tune your model on the Yelp Review Polarity training data set. Note: If you have computational limitations consider fine-tuning only on part of the training dataset. 

## Task 2.5
Evaluate your model, following the evaluation protocol you defined in Theory Question 1, on the test part of the Yelp Review Polarity data set.

# Task 3
Visualize and interpret the attention weights of one correctly and one of the incorrectly classified examples of the Yelp Review Polarity test data using [BertViz](https://github.com/jessevig/bertviz)'s model_view.

## Theory Question 2
In your own words, describe how the attention mechanism in a transformer works in the case of self-attention and cross-attention, identifying in each case the keys, queries, and values. Give two examples of alignment models and describe how they affect the output using a simple example. This part of the written report can be done in collaboration with your group.

# Task 4
Please describe your team's implementation of this project, including your personal contribution, in 1000-1500 characters. Each team member must explain the main aspects of the team's implementation, and may not discuss this summary with other students. You are allowed to use figures and tables to clarify. This summary constitutes a separately and individually graded piece of work.