<h1><center>
    Implementing, fine-tuning and visualizing transformer architectures. <br/>
  
    
    Project 1
</center></h1>

# Task 1
Implement the feed-forward pass of the original transformer network using only numpy, i.e. without machine learning frameworks.

Note: All subtasks are voluntary and rather a guide-line of how we would implement the forward pass. You can also choose a different order for implementing the different parts or implement everything in one class/function. The forward pass should return an numpy array.

Please initialize the projectionss using Glorot initialization.

In [1]:
# You will test your implementation on a single array:
import numpy as np
forward_pass_array = np.array([101, 400, 500, 600, 107, 102])


In [2]:
def init_weights(y_rows, x_cols):
    """
    Initialise the weights of a layer with Glorot normal initialisation.
    
    :param input_size: The number of inputs to the layer.
    :type input_size: int
    :param output_size: The number of outputs of the layer.
    :type output_size: int
    :return: The initialised weights.
    :rtype: np.ndarray
    """
    return np.random.normal(loc=0.0, scale=np.sqrt(2.0 / (y_rows + x_cols)), size=(y_rows, x_cols, ))

In [3]:
# token embedding matrix 1000x512 random floats golorot normal
word_embeddings = init_weights(1000, 512)
word_embeddings.shape

(1000, 512)

In [4]:
word_embeddings

array([[-0.00863085, -0.04139448, -0.04344926, ...,  0.00198116,
         0.03192168,  0.02778582],
       [ 0.05211839, -0.0481381 , -0.00472216, ...,  0.0707065 ,
         0.03266225,  0.0529025 ],
       [-0.04634467,  0.09271401, -0.02840324, ...,  0.00545395,
         0.05169014, -0.00878025],
       ...,
       [-0.01998661,  0.03246816, -0.00911726, ...,  0.03126981,
         0.09910073, -0.0033792 ],
       [-0.06583972,  0.03375211, -0.01736838, ...,  0.02900458,
         0.00204968,  0.00333083],
       [-0.03945919,  0.00199318, -0.06154962, ..., -0.00854824,
        -0.01021864,  0.03121625]])

In [5]:
selected_embeddings = word_embeddings[forward_pass_array]

## Task 1.1
Implement the sinus/cosinus positional encoding used in the original paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762). Implement the token embedding.

In [6]:
def positional_encoding(positions:list,  d_model:int = 512):
    """
    returns a matrix
    pos is the position, embeddings is a vector and d_model is the model dimension
    we use sine and cosine functions of different frequencies

    :param pos: position aka t 
    :type pos: list
    :param embeddings: size of the embedding
    :type i: list
    :param d_model: model dimensions
    :type d_model: int
    """
    result_matrix = np.zeros((len(positions), d_model))
    for pos in range(len(positions)):
        for i in range(d_model):
            if i % 2 == 0:
                result_matrix[pos,i] = np.sin(pos / (10000 ** (i / d_model)))
            else:
                result_matrix[pos,i] = np.cos(pos / (10000 ** ((i - 1) / d_model)))
    return result_matrix

# Now we can create the positional encoding matrix


positional_encoding_matrix = positional_encoding(selected_embeddings, len(selected_embeddings[0]))

In [7]:
positional_encoding_matrix

array([[ 0.00000000e+00,  1.00000000e+00,  0.00000000e+00, ...,
         1.00000000e+00,  0.00000000e+00,  1.00000000e+00],
       [ 8.41470985e-01,  5.40302306e-01,  8.21856190e-01, ...,
         9.99999994e-01,  1.03663293e-04,  9.99999995e-01],
       [ 9.09297427e-01, -4.16146837e-01,  9.36414739e-01, ...,
         9.99999977e-01,  2.07326584e-04,  9.99999979e-01],
       [ 1.41120008e-01, -9.89992497e-01,  2.45085415e-01, ...,
         9.99999948e-01,  3.10989874e-04,  9.99999952e-01],
       [-7.56802495e-01, -6.53643621e-01, -6.57166863e-01, ...,
         9.99999908e-01,  4.14653159e-04,  9.99999914e-01],
       [-9.58924275e-01,  2.83662185e-01, -9.93854779e-01, ...,
         9.99999856e-01,  5.18316441e-04,  9.99999866e-01]])

In [8]:
positional_encoding_matrix.shape

(6, 512)

## Task 1.2
Implement a dense layer with the number of hidden units as an argument.

In [9]:
def dense_layer(X:np.array, hidden_units:int):
    """
    Input and output matrices are of same dimension
    Single dense layer with a specified number of hidden units


    :param X: input matrix
    :type X: np.array
    :param hidden_units: number of hidden units
    :type hidden_units: int
    """
    input_dim = X.shape[1]
    
    # golorot normal initialization of the weights
    weight = init_weights(input_dim, hidden_units )

    bias = np.zeros(hidden_units)
    
    # we use the dot product to calculate the output of the layer
    return np.matmul(X, weight) + bias

In [10]:
dense_output = dense_layer(positional_encoding_matrix, 512)
dense_output.shape

(6, 512)

## Task 1.3
Implement all activation function such that they are compatible with the dense layer.

In [11]:
# Implement all activation function such that they are compatible with the dense layer.
def activation(hidden_layer:np.array):
    """ Relu activation

    :param hidden_layer: hidden layer for activation
    :type hidden_layer: np.array
    :param activation: ReLU activated hidden layer
    :type activation: np.array
    """
    # Implement the relu activation function
    return np.maximum(0, hidden_layer)

In [12]:
activation(dense_output)

array([[0.        , 0.        , 0.        , ..., 0.45325582, 0.        ,
        1.01786121],
       [0.        , 0.        , 0.        , ..., 0.62195124, 0.        ,
        0.85395666],
       [0.        , 0.        , 0.02367677, ..., 0.67656675, 0.        ,
        0.63294979],
       [0.        , 0.        , 0.2811811 , ..., 0.6668396 , 0.        ,
        0.36439038],
       [0.        , 0.        , 0.54658032, ..., 0.65891775, 0.        ,
        0.08400062],
       [0.        , 0.        , 0.72405789, ..., 0.69903689, 0.        ,
        0.        ]])

## Task 1.4
Implement the skip (residual) connections.

In [13]:
#Implement the skip (residual) connections.
def skip_connections(hidden_layer, input_layer):
    """Skip connection 

    :param hidden_layer: _description_
    :type hidden_layer: _type_
    :param input_layer: _description_
    :type input_layer: _type_
    """
    return hidden_layer + input_layer

## Task 1.5
Implement layer normalization.

In [14]:
# Implement layer normalization.
def normalisation_layer(hidden_layer):
    """ check the dimensionality should be normalised over all values 
    instead of columns wise normalisation used in batch normalisation

    :param hidden_layer: _description_
    :type hidden_layer: _type_
    :param mean: _description_
    :type mean: _type_
    :param variance: _description_
    :type variance: _type_
    """
    # Implement the layer normalization
    mean = np.mean(hidden_layer, axis=1, keepdims=True)
    variance = np.var(hidden_layer, axis=1, keepdims=True)
    return (hidden_layer - mean) / np.sqrt(variance + 1e-8)

## Task 1.6
Implement dropout.

In [15]:
# Implement dropout
def drop_out(hidden_layer, dropout_rate):
    """dropout 

    :param hidden_layer: _description_
    :type hidden_layer: _type_
    :param dropout_rate: _description_
    :type dropout_rate: _type_
    """
    # Implement the dropout
    return hidden_layer * np.random.binomial(1, 1 - dropout_rate, size=hidden_layer.shape)


## Task 1.7
Implement the attention mechanism.

In [16]:
def softmax(input_matrix):
    """Softmax activation

    :param input_matrix: Input matrix
    :type input_matrix: np.array
    """
    # Implement the softmax function
    return np.exp(input_matrix) / np.sum(np.exp(input_matrix), axis=1, keepdims=True)
 
def attention(query, key, value, mask=None, dropout_rate=None):
    """Compute 'Scaled Dot Product Attention'

    :param query: query matrix resulting from the dot product of the x and the query weights of shape (seq_len, d_model).
    :type query: np.array
    :param key: key matrix resulting from the dot product of the x and the key weights of shape (seq_len, d_model).
    :type key: np.array
    :param value: value matrix resulting from the dot product of the x and the value weights of shape (seq_len, d_model).
    :type value: np.array
    :param mask: boolean mask, defaults to None
    :type mask: boolen, optional
    :param dropout: boolean dropout, defaults to None
    :type dropout: boolen, optional
    :return: output of shape (seq_len, d_model).
    :rtype: np.array
    """    """"""
    # Implement the attention mechanism
    dimension_k = key.shape[1]
    scores = np.matmul(query, key.T) / np.sqrt(dimension_k)
    if mask is not None:
        scores = mask_attention_scores(scores, mask_value=-np.inf)
    predicted_attention = softmax(scores)
    if dropout_rate is not None:
        predicted_attention = drop_out(predicted_attention, dropout_rate=dropout_rate)
    return np.matmul(predicted_attention, value)

# Implement the mask function

def mask_attention_scores(scores, mask_value=-np.inf):
    """
    Apply a mask to attention scores.

    :param scores: Attention scores of shape (num_heads, seq_len, seq_len).

    :returns: Masked attention scores of shape (num_heads, seq_len, seq_len).
    """
    # Create mask of shape (seq_len, seq_len)
    attn_shape = scores.shape
    mask = np.tril(np.ones(attn_shape), k=0)

    # Set masked positions to mask_value
    masked_scores = np.where(mask == 0, mask_value, scores)
    # print('masked_scores', masked_scores.shape)
    return masked_scores




# implement a multi-head attention layer
def multihead_attention(query, key, value, query_projections, key_projections, value_projections, o_projections, mask:bool = None, dropout_rate:float = None, num_heads:int = 8):
    """
    Compute multihead self-attention given query, key, and value matrices.

    :param query: query matrix resulting from the dot product of the x and the query weights of shape (seq_len, d_model).
    :type query: np.array
    :param key: key matrix resulting from the dot product of the x and the key weights of shape (seq_len, d_model).
    :type key: np.array
    :param value: value matrix resulting from the dot product of the x and the value weights of shape (seq_len, d_model).
    :type value: np.array
    :param query_projections: Query projectionss of shape (d_model, d_model).
    :type query_projections: np.array
    :param key_projections: Key projectionss of shape (d_model, d_model).
    :type key_projections: np.array
    :param value_projections: Value projectionss of shape (d_model, d_model).
    :type value_projections: np.array
    :param o_projections: Output projectionss of shape (d_model, d_model).
    :type o_projections: np.array
    :param mask: Mask for attention, defaults to None
    :type mask: bool, optional
    :param dropout_rate: Dropout rate, defaults to None
    :type dropout_rate: float, optional
    :param num_heads: Number of heads for attention.
    :type num_heads: int, optional

    :returns: Output of multihead attention of shape (seq_len, d_model).
    """
    
    # Project query, key, and value matrices using learnable projectionss
    query_projected = np.matmul(query, query_projections)
    key_projected = np.matmul(key, key_projections)
    value_projected = np.matmul(value, value_projections)
    

    # Split matrices into multiple heads
    query_heads = np.array(np.split(query_projected, num_heads, axis=1))
    key_heads = np.array(np.split(key_projected, num_heads, axis=1))
    value_heads = np.array(np.split(value_projected, num_heads, axis=1))
        
    # print(query_heads.shape)
    

    # Compute scaled dot product attention for each head
    head_outputs = []
    for i in range(num_heads):
        query_i = query_heads[i]
        key_i = key_heads[i]
        value_i = value_heads[i]

        # Compute attention scores
        attention_scores = attention(query_i, key_i, value_i, mask, dropout_rate)

        # Append attention scores to head_outputs
        head_outputs.append(attention_scores)

    # Concatenate head outputs and project back to original dimensionality
    outputs = np.concatenate(head_outputs, axis=-1)
    
    # reproject the output
    outputs_projected = np.matmul(outputs, o_projections)

    return outputs_projected
    

In [17]:
positional_encoding_matrix.shape

(6, 512)

In [18]:
# query_weights, key_weights, value_weights, o_weights = init_weights(512, 512), init_weights(512, 512), init_weights(512, 512), init_weights(512, 512)
# query, key, value = np.matmul(positional_encoding_matrix, query_weights), np.matmul(positional_encoding_matrix, key_weights), np.matmul(positional_encoding_matrix, value_weights)

# attention_output = multihead_attention(query, key, value, query_weights, key_weights, value_weights, o_weights, mask = False, dropout_rate = 0.1, num_heads = 8)
# print(attention_output.shape)
# attention_output

In [19]:

# drop one row of the query matrix and check the output
# query_weights, key_weights, value_weights, o_weights = init_weights(512, 512), init_weights(512, 512), init_weights(512, 512), init_weights(512, 512)
# query, key, value = np.matmul(positional_encoding_matrix, init_weights(512, 512)), np.matmul(positional_encoding_matrix, init_weights(512, 512)), np.matmul(positional_encoding_matrix, init_weights(512, 512))
# attention_masked = multihead_attention(query, key, value, query_weights, key_weights, value_weights, o_weights, mask = True, dropout_rate = 0.1, num_heads = 8)
# print(attention_masked.shape)
# attention_masked

## Task 1.8
Implement the positonal feed-forward network.

In [20]:
# included in Task 1.14

## Task 1.9
Implement the encoder attention.

In [21]:
# included in Task 1.14

## Task 1.10
Implement the encoder.

In [22]:
# included in Task 1.14


## Task 1.11
Implement the decoder attention.

In [23]:
# included in Task 1.14

## Task 1.12
Implement the encoder-decoder attention.

In [24]:
# included in Task 1.14

## Task 1.13
Implement the decoder.

In [25]:
# included in Task 1.14



## Task 1.14
Implement the transformer architecture (e.g. by creating a Transformer class that includes the steps before).

In [26]:
class Transformer():
    def __init__(self, vocab_size, d_model, attention_layers, d_hidden, padding=512, dropout=0.1):
        self.vocab_size = vocab_size
        self.d_model = d_model
        self.attention_layers = attention_layers
        self.d_hidden = d_hidden
        self.padding = padding
        self.dropout = dropout

        # initialise weights for embedding
        self.W_embedding_encoder = init_weights(vocab_size, d_model)  # embedding matrix for encoder
        self.W_embedding_decoder = init_weights(vocab_size, d_model)  # embedding matrix for decoder
        
        
        list_weights = [init_weights(d_model, d_model) for i in range(attention_layers)] # template for weights for dimensionality (d_model, d_model）  
        
        # initialise weights for encoder attention
        self.W_K_encoder, self.W_Q_encoder, self.W_V_encoder = list_weights, list_weights, list_weights # list of weights for K, Q, V for all attention layers
        self.query_projections_enc, self.key_projections_enc, self.value_projections_enc, self.o_projections_enc = list_weights,list_weights,list_weights,list_weights # list of Projection weights for K, Q, V for all attention layers
        
        # initialise weights for decoder attention
        self.W_K_decoder, self.W_Q_decoder, self.W_V_decoder = list_weights, list_weights, list_weights # list of weights for K, Q, V for all attention layers
        self.query_projections_dec, self.key_projections_dec, self.value_projections_dec, self.o_projections_dec = list_weights,list_weights,list_weights,list_weights # list of Projection weights for K, Q, V for all attention layers

        # initialise weights for encoder-decoder attention
        self.W_Q_encoder_decoder, self.W_K_encoder_decoder, self.W_V_encoder_decoder = list_weights, list_weights, list_weights # list of weights for K, Q, V for all attention layers
        self.query_projections_enc_dec, self.key_projections_enc_dec, self.value_projections_enc_dec, self.o_projections_enc_dec = list_weights,list_weights,list_weights,list_weights # list of Projection weights for K, Q, V for all attention layers

        # parameter for encoder feed forward 
        self.W_1_enc = [init_weights(d_model, d_hidden) for i in range(attention_layers)]
        self.b_1_enc = [init_weights(1, d_hidden) for i in range(attention_layers)]
        self.W_2_enc = [init_weights(d_hidden, d_model) for i in range(attention_layers)]
        self.b_2_enc = [init_weights(1, d_model) for i in range(attention_layers)]
        
        # parameter for decoder feed forward 
        self.W_1_dec = [init_weights(d_model, d_hidden) for i in range(attention_layers)]
        self.b_1_dec = [init_weights(1, d_hidden) for i in range(attention_layers)]
        self.W_2_dec = [init_weights(d_hidden, d_model) for i in range(attention_layers)]
        self.b_2_dec = [init_weights(1, d_model) for i in range(attention_layers)]

        # parameter for linear layer
        #self.W, self.b = init_weights(len(forward_pass_array_output) * d_model, self.vocab_size), init_weights(1, self.vocab_size)
        self.W, self.b = init_weights(self.padding * d_model, self.vocab_size), init_weights(1, self.vocab_size)

        # parameter for layer normlisation
        self.a_ln_1_enc = np.ones(attention_layers) # first layer norm for encoder
        self.b_ln_1_enc = np.zeros(attention_layers) # first layer norm for encoder

        self.a_ln_2_enc = np.ones(attention_layers) # second layer norm for encoder
        self.b_ln_2_enc = np.zeros(attention_layers) # second layer norm for encoder

        self.a_ln_1_dec = np.ones(attention_layers) # first layer norm for decoder
        self.b_ln_1_dec = np.zeros(attention_layers) # first layer norm for decoder

        self.a_ln_2_dec = np.ones(attention_layers) # second layer norm for decoder
        self.b_ln_2_dec = np.zeros(attention_layers) # second layer norm for decoder

        self.a_ln_3_dec = np.ones(attention_layers) # second layer norm for decoder
        self.b_ln_3_dec = np.zeros(attention_layers) # second layer norm for decoder

    def forward(self, input_enc, input_dec, mask=None):
        output_enc = self.encoder(input_enc, n_layers=self.attention_layers)  
        out_prob = self.decoder(input_dec, output_enc, n_layers=self.attention_layers, mask=mask) # 
        return out_prob # output of encoder

    def encoder(self, x, n_layers=6):
        x = np.pad(x, (0, self.padding-len(x)), 'constant') # pad the input to the maximum length of the input
        x = self.W_embedding_encoder[x] # get the embedding of the padded input
        x += positional_encoding(x)  # add the positional encoding to the embedding
        x = drop_out(x, self.dropout)  # apply dropout to the sums of the embeddings and the positional encodings
    
        for i in range(n_layers):
            # get query, key, value
            query_encoder = np.matmul(x, self.W_Q_encoder[i]) # get the query
            key_encoder = np.matmul(x, self.W_K_encoder[i]) # get the key
            value_encoder = np.matmul(x, self.W_V_encoder[i]) # get the value
            self_attenion = multihead_attention(query_encoder, key_encoder, value_encoder, self.query_projections_enc[i], self.key_projections_enc[i], self.value_projections_enc[i], self.o_projections_enc[i], mask=None, dropout_rate = self.dropout, num_heads = 8) # get the self attention

            x += skip_connections(x, self_attenion)  # apply dropout to the output of each sub-layer, before it is added to the sub-layer input and normalized.
            x = self.layer_norm(x, self.a_ln_1_enc[i], self.b_ln_1_enc[i]) # apply layer normalisation

            # feed forward
            x += skip_connections(x, drop_out(self.feed_forward(x, self.W_1_enc[i], self.b_1_enc[i], self.W_2_enc[i], self.b_2_enc[i]),dropout_rate=self.dropout))
            x = self.layer_norm(x, self.a_ln_2_enc[i], self.b_ln_2_enc[i]) # apply layer normalisation
        return x


    def decoder(self, x, y, n_layers=6, mask=None):
        """
        x: input of decoder, 
        y: output of encoder, 
        n_layers: number of layers in decoder,
        mask: mask for decoder attention
        """
        x = np.pad(x, (0, self.padding-len(x)), 'constant') # pad input to the same length
        x = self.W_embedding_decoder[x]
        x += positional_encoding(x)
        x = drop_out(x, 0.1)  # apply dropout to the sums of the embeddings and the positional encodings
    
        for i in range(n_layers):
            # get query, key, value for decoder attention
            query_decoder = np.matmul(x, self.W_Q_decoder[i]) # get the query
            key_decoder = np.matmul(x, self.W_K_decoder[i])  # get the key
            value_decoder = np.matmul(x, self.W_V_decoder[i])  # get the value
            # decoder attention
            self_attenion = multihead_attention(query_decoder, key_decoder, value_decoder, self.query_projections_dec[i], self.key_projections_dec[i], self.value_projections_dec[i], self.o_projections_dec[i], mask=mask, dropout_rate = self.dropout, num_heads = 8)
            x += skip_connections(x, self_attenion)  # apply dropout to the output of each sub-layer, before it is added to the sub-layer input and normalized.
            x = self.layer_norm(x, self.a_ln_1_dec[i], self.b_ln_1_dec[i]) # apply layer normalisation
    
            # get query, key, value for encoder-decoder attention
            query_encoder_decoder = np.matmul(x, self.W_Q_encoder_decoder[i])  # from decoder attention
            key_encoder_decoder = np.matmul(y, self.W_K_encoder_decoder[i]) # from encoder attention
            value_encoder_decoder = np.matmul(y, self.W_V_encoder_decoder[i])  # from encoder attention
            # encoder-decoder attention
            attenion = multihead_attention(query_encoder_decoder, key_encoder_decoder, value_encoder_decoder, self.query_projections_enc_dec[i], self.key_projections_enc_dec[i], self.value_projections_enc_dec[i], self.o_projections_enc_dec[i], mask=None, dropout_rate = self.dropout, num_heads = 8)
            x += skip_connections(x, attenion)  # apply dropout to the output of each sub-layer, before it is added to the sub-layer input and normalized.
            x = self.layer_norm(x, self.a_ln_2_dec[i], self.b_ln_2_dec[i])
    
            # feed forward
            x += skip_connections(x, drop_out(self.feed_forward(x, self.W_1_dec[i], self.b_1_dec[i], self.W_2_dec[i], self.b_2_dec[i]), dropout_rate=self.dropout))
            x = self.layer_norm(x, self.a_ln_3_dec[i], self.b_ln_3_dec[i]) # apply layer normalisation
    
        # flatten 2D matrix to 1D vector
        x = x.flatten().squeeze()
        # linear layer
        x = self.linear_layer(x)
        output = softmax(x)
    
        return output

    def layer_norm(self, hidden_layer, a, b):

        # Implement the layer normalization
        mean = np.mean(hidden_layer, axis=1, keepdims=True)
        variance = np.var(hidden_layer, axis=1, keepdims=True)
        #print(a.shape,hidden_layer.shape)
        return a * (hidden_layer - mean) / np.sqrt(variance + 1e-8) + b
    
    def linear_layer(self, x):
        x = np.matmul(x, self.W) + self.b
        x = drop_out(x, self.dropout)
        return x
    
    def feed_forward(self, x, W_1, b_1, W_2, b_2):
        x = np.matmul(x, W_1) + b_1
        x = activation(x)
        x = np.matmul(x, W_2) + b_2
        x = drop_out(x, self.dropout)
        return x


## Task 1.15
Test the forward pass using the follow array.

In [27]:
forward_pass_array_input = np.array([1, 40, 50, 60, 17, 12]) # input for encoder
forward_pass_array_output = np.array([12, 48, 50, 63]) # temporary input for decoder

transformer = Transformer(vocab_size=1000, d_model=512, attention_layers=6, d_hidden=2048, padding=512, dropout=0.1)
output = transformer.forward(forward_pass_array_input, forward_pass_array_output, mask=True)

print(output.max())
print(output.argmax())

0.03022852047096328
848


# Task 2

In the following, we want to use pretrained models. From here on, you are allowed to use any machine learning frame work of your choice. Moreover, you will use [Hugging Face](https://huggingface.co/) which is compatible with tensorflow, keras, torch and other machine learning frameworks.

We want to fine-tune a pretrained model to determine whether Yelp reviews are positive or negative. The data set is available for [Tensorflow](https://www.tensorflow.org/datasets/catalog/yelp_polarity_reviews) and [(py)torch](https://pytorch.org/text/stable/datasets.html#yelpreviewpolarity). Given the text of a review, we want to determine whether the yelp review is positive and negative. The data set is pre-split into training and test set. Please use the training data to fine-tune your model, while using the test data to evaluate your models performance. This exercise does not necessarily end in having a SOTA model, the goal is for you to use and fine-tune SOTA pretrained large language models.

Problem Setting:

The label $y$ to a Yelp review $T$ is either positive or negative. Given a Yelp Review $T$ and a polarity feedback $y$ determine whether the Review $T$ is positive or negative. The training set $\mathcal{D} = \{(T_1, y_1), \ldots, (T_N, y_N)\}$, where $T_i$ is review $i$ and $y_i$ is $T_i$'s polarity feedback. A suitable evaluation metric for this type of problem is $\rightarrow$ see Theory Question 1.

In the following, please solve all subtasks.

## Theory Question 1
Which metric is threshold independent to evaluate the problem setting described in Task 2. Please list pros and cons of three different metrics that might be suitable, define an evaluation protocol and decide which evaluation suits this problem best.

In [28]:
"""
1. "Which metric is threshold independent to evaluate the problem setting described in Task 2?"
The metric that is threshold independent and suitable for evaluating the classification task is the area under the Receiver Operating Characteristic (ROC) curve (AUC-ROC).

2. "Please list pros and cons of three different metrics that might be suitable"

Accuracy: commonly used metric to evaluate classification models. It is the ratio of correctly predicted samples to the total number of samples. However, it can be misleading when the classes are imbalanced.

Precision and Recall: These are two complementary metrics that can be used to evaluate binary classification models. Precision is the ratio of true positives to the total number of predicted positives, while recall is the ratio of true positives to the total number of actual positives. However, they are threshold dependent and cannot capture the trade-off between precision and recall.

AUC-ROC: is threshold independent and measures the model's ability to distinguish between positive and negative classes. It plots the true positive rate against the false positive rate for different thresholds and calculates the area under the curve. AUC-ROC is suitable for imbalanced datasets and can capture the trade-off between true positive rate and false positive rate.

3. "define an evaluation protocol"
- Split the dataset into training and test sets.
- Train the model on the training set using the chosen pretrained language model.
- Evaluate the model on the test set using AUC-ROC.
"""

'\n1. "Which metric is threshold independent to evaluate the problem setting described in Task 2?"\nThe metric that is threshold independent and suitable for evaluating the classification task is the area under the Receiver Operating Characteristic (ROC) curve (AUC-ROC).\n\n2. "Please list pros and cons of three different metrics that might be suitable"\n\nAccuracy: commonly used metric to evaluate classification models. It is the ratio of correctly predicted samples to the total number of samples. However, it can be misleading when the classes are imbalanced.\n\nPrecision and Recall: These are two complementary metrics that can be used to evaluate binary classification models. Precision is the ratio of true positives to the total number of predicted positives, while recall is the ratio of true positives to the total number of actual positives. However, they are threshold dependent and cannot capture the trade-off between precision and recall.\n\nAUC-ROC: is threshold independent and m

## Task 2.0
Load the Yelp Review Polarity dataset.

In [29]:
from torchtext.datasets import YelpReviewPolarity
from transformers import BertModel, BertTokenizer
from torch.optim import Adam
from tqdm import tqdm
import torch.nn as nn
import time
import torch
import numpy as np
from torch.utils.data import DataLoader, Dataset
from sklearn.metrics import roc_auc_score

# Load YelpReviewPolarity dataset, split into train and test
# datatype: ShardingFilterIterDataPipe
train_data, test_data = YelpReviewPolarity(root='.data', split=('train', 'test'))

  from .autonotebook import tqdm as notebook_tqdm


## Task 2.1
Decide on a suitable language model from the HuggingFace model zoo (a library providing pretrained models).

In [30]:
# included in Task 2.4

## Task 2.2

For the model to process the intended way, we need the tokenizer that was used during training. Luckily Hugging Face  provides both pretrained models and tokenizer. After in Task 2.1 decided for a language model, please load the corresponding tokenizer.

In [31]:
# included in Task 2.4

## Task 2.3
Load the language model from HuggingFace.

In [32]:
# included in Task 2.4

## Task 2.4
Fine-tune your model on the Yelp Review Polarity training data set. Note: If you have computational limitations consider fine-tuning only on part of the training dataset. 

In [33]:
# Define a collate function that applies the tokenizer to each batch
def collate_fn(batch):
    # Convert the list of inputs and labels to separate lists
    labels, inputs = zip(*batch)
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    # Tokenize the inputs
    tokenized_inputs = tokenizer(inputs, padding=True, add_special_tokens=True,  truncation=True, return_tensors="pt")
    tokenizer(inputs, padding='max_length', max_length=512, add_special_tokens=True, truncation=True, return_tensors="pt")
    # Convert the labels to a PyTorch tensor
    labels = torch.tensor(labels)

    # Return the tokenized inputs and labels as a tuple
    return tokenized_inputs, labels

class BertClassifier(nn.Module):
    """BERT model for classification.This module is composed of the BERT model with a linear layer on top of the pretrained BERT model.
    """
    def __init__(self, freeze_bert=False, dropout=0.5):

        super(BertClassifier, self).__init__()

        self.bert = BertModel.from_pretrained('bert-base-uncased', output_attentions=True)
        # Freeze the BERT parameters if desired
        if freeze_bert:
            for param in self.bert.parameters():
                param.requires_grad = False

        self.dropout = nn.Dropout(dropout)
        self.linear = nn.Linear(768, 2)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, input_id, mask):

        _, pooled_output, attention = self.bert(input_ids= input_id, attention_mask=mask,return_dict=False)
        dropout_output = self.dropout(pooled_output)
        linear_output = self.linear(dropout_output)
        prob = self.softmax(linear_output)
        return prob, attention

class TimingCallback(torch.nn.Module):
    def __init__(self,total_epochs ):
        super(TimingCallback, self).__init__()
        self.total_epochs = total_epochs
        self.start_time = time.time()

    def forward(self, epoch, total_epochs):
        elapsed_time = time.time() - self.start_time
        print("Epoch [{}/{}] - time: {:.2f}s".format(epoch + 1, total_epochs, elapsed_time))

    def on_epoch_end(self, epoch, logs=None):
        self.forward(epoch, self.total_epochs)

def train(model, train_data):
    train_loader = DataLoader(train_data, batch_size=batchSize, shuffle=True, collate_fn=collate_fn)

    criterion = nn.CrossEntropyLoss()
    optimizer = Adam(model.parameters(), lr=lr)
    model = model.to(device)
    criterion = criterion.to(device)

    for epoch in range(epochs):

        true_labels = []
        predicted_probs = []
        total_loss_train = 0
        model.train()
        for i, data in tqdm(enumerate(train_loader)):
            if i == sample:
                break
            train_input, train_label = data
            train_label = train_label.to(device)
            mask = train_input['attention_mask'].to(device)
            input_id = train_input['input_ids'].squeeze(1).to(device)

            output, _ = model(input_id, mask) #

            batch_loss = criterion(output, (train_label-1).long())
            total_loss_train += batch_loss.item()

            model.zero_grad()
            batch_loss.backward()
            optimizer.step()
            
            target = (train_label-1).long()
            true_labels.append(target.cpu().numpy())
            predicted_probs.append(output.argmax(dim=1).detach().cpu().numpy())

        timing_callback.on_epoch_end(epoch, epochs)

        true_labels = np.concatenate(true_labels)
        predicted_probs = np.concatenate(predicted_probs)
        roc_auc = roc_auc_score(true_labels, predicted_probs)

        print(f'Train Loss: {total_loss_train / (i*batchSize): .3f} \
                | Train AUC: {roc_auc: .3f}' )

# initialize the parameters
batchSize = 8
epochs = 20
lr = 1e-6
sample = 16 # number of batches
model = BertClassifier(freeze_bert=False) # initialize the model

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")

# Train the model and evaluate its performance
timing_callback = TimingCallback(epochs)
train(model, train_data)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
16it [00:13,  1.18it/s]


Epoch [1/20] - time: 15.13s
Train Loss:  0.092                 | Train AUC:  0.439


16it [00:12,  1.27it/s]


Epoch [2/20] - time: 27.80s
Train Loss:  0.089                 | Train AUC:  0.511


16it [00:14,  1.10it/s]


Epoch [3/20] - time: 42.37s
Train Loss:  0.088                 | Train AUC:  0.540


16it [00:13,  1.20it/s]


Epoch [4/20] - time: 55.72s
Train Loss:  0.087                 | Train AUC:  0.481


16it [00:12,  1.24it/s]


Epoch [5/20] - time: 68.62s
Train Loss:  0.087                 | Train AUC:  0.514


16it [00:12,  1.31it/s]


Epoch [6/20] - time: 80.85s
Train Loss:  0.087                 | Train AUC:  0.530


16it [00:12,  1.27it/s]


Epoch [7/20] - time: 93.50s
Train Loss:  0.087                 | Train AUC:  0.463


16it [00:12,  1.27it/s]


Epoch [8/20] - time: 106.09s
Train Loss:  0.087                 | Train AUC:  0.539


16it [00:12,  1.24it/s]


Epoch [9/20] - time: 119.06s
Train Loss:  0.085                 | Train AUC:  0.586


16it [00:11,  1.33it/s]


Epoch [10/20] - time: 131.08s
Train Loss:  0.085                 | Train AUC:  0.535


16it [00:12,  1.26it/s]


Epoch [11/20] - time: 143.85s
Train Loss:  0.086                 | Train AUC:  0.509


16it [00:12,  1.28it/s]


Epoch [12/20] - time: 156.38s
Train Loss:  0.085                 | Train AUC:  0.564


16it [00:12,  1.24it/s]


Epoch [13/20] - time: 169.33s
Train Loss:  0.084                 | Train AUC:  0.601


16it [00:12,  1.26it/s]


Epoch [14/20] - time: 182.06s
Train Loss:  0.084                 | Train AUC:  0.605


16it [00:12,  1.30it/s]


Epoch [15/20] - time: 194.39s
Train Loss:  0.084                 | Train AUC:  0.600


16it [00:12,  1.31it/s]


Epoch [16/20] - time: 206.66s
Train Loss:  0.083                 | Train AUC:  0.684


16it [00:12,  1.25it/s]


Epoch [17/20] - time: 219.55s
Train Loss:  0.083                 | Train AUC:  0.609


16it [00:12,  1.26it/s]


Epoch [18/20] - time: 232.25s
Train Loss:  0.086                 | Train AUC:  0.555


16it [00:12,  1.28it/s]


Epoch [19/20] - time: 244.80s
Train Loss:  0.082                 | Train AUC:  0.626


16it [00:12,  1.26it/s]

Epoch [20/20] - time: 257.50s
Train Loss:  0.085                 | Train AUC:  0.569





## Task 2.5
Evaluate your model, following the evaluation protocol you defined in Theory Question 1, on the test part of the Yelp Review Polarity data set.

In [34]:
def evaluate(model, test_data):
    test_loader = DataLoader(test_data, batch_size=batchSize, shuffle=True, collate_fn=collate_fn)

    criterion = nn.CrossEntropyLoss()
    model = model.to(device)
    total_loss_test = 0
    true_labels = []
    predicted_probs = []
    with torch.no_grad():

        for i, data in tqdm(enumerate(test_loader)):
            if i == sample:
                break
            test_input, test_label = data
            test_label = test_label.to(device)
            mask = test_input['attention_mask'].to(device)
            input_id = test_input['input_ids'].squeeze(1).to(device)

            output, _ = model(input_id, mask)
            batch_loss = criterion(output, (test_label-1).long())
            total_loss_test += batch_loss.item()

            target = (test_label-1).long()
            true_labels.append(target.cpu().numpy())
            predicted_probs.append(output.argmax(dim=1).detach().cpu().numpy())

        true_labels = np.concatenate(true_labels)
        predicted_probs = np.concatenate(predicted_probs)
        roc_auc = roc_auc_score(true_labels, predicted_probs)

        print(f'Test Loss: {total_loss_test / (i * batchSize): .3f} \
            | Test AUC: {roc_auc: .3f}')


#model =BertClassifier(freeze_bert=False)
#model.load_state_dict(torch.load("model.pth", map_location=device))  
evaluate(model, test_data)

16it [00:10,  1.49it/s]

Test Loss:  0.084             | Test AUC:  0.608





# Task 3
Visualize and interpret the attention weights of one correctly and one of the incorrectly classified examples of the Yelp Review Polarity test data using [BertViz](https://github.com/jessevig/bertviz)'s model_view.

In [35]:
from bertviz import model_view

correct_example = None
incorrect_example = None

# get one correctly and one incorrectly classified example
test_loader = DataLoader(test_data, batch_size=1, shuffle=True, collate_fn=collate_fn)
model = model.to(device)
for i, data in tqdm(enumerate(test_loader)):
    test_input, test_label = data
    test_label = test_label.to(device)
    mask = test_input['attention_mask'].to(device)
    input_id = test_input['input_ids'].squeeze(1).to(device)
    print(len(input_id[0]))

    output, _ = model(input_id, mask)
    
    if output[0][0] > 0.5:
        pred = 2
    else:
        pred = 1

    if pred == test_label and correct_example is None and len(input_id[0]) < 20: # make sure example is small enough to visualize
        correct_example = test_input
        print("found correct")
    elif pred != test_label and incorrect_example is None and len(input_id[0]) < 20: # make sure example is small enough to visualize
        incorrect_example = test_input
        print("found incorrect")
    
    if correct_example is not None and incorrect_example is not None:
        break

1it [00:00,  1.04it/s]

250


2it [00:01,  1.45it/s]

112


3it [00:01,  1.63it/s]

236


4it [00:02,  1.76it/s]

131


5it [00:02,  1.85it/s]

20


6it [00:03,  1.90it/s]

139


7it [00:03,  1.93it/s]

146


8it [00:04,  1.94it/s]

333


9it [00:04,  1.96it/s]

112


10it [00:05,  1.95it/s]

512


11it [00:06,  1.96it/s]

319


12it [00:06,  1.98it/s]

176


13it [00:07,  1.97it/s]

406


14it [00:07,  2.00it/s]

24


15it [00:07,  2.00it/s]

95


16it [00:08,  2.02it/s]

112


17it [00:08,  2.01it/s]

194


18it [00:09,  2.01it/s]

202


19it [00:09,  2.01it/s]

186


20it [00:10,  2.01it/s]

189


21it [00:10,  2.02it/s]

57


22it [00:11,  1.97it/s]

512


23it [00:11,  1.98it/s]

205


24it [00:12,  2.00it/s]

164


25it [00:12,  2.01it/s]

32


26it [00:13,  2.02it/s]

45


27it [00:13,  2.01it/s]

111


28it [00:14,  2.01it/s]

163


29it [00:14,  2.01it/s]

117


30it [00:15,  2.02it/s]

74


31it [00:15,  2.02it/s]

89


32it [00:16,  2.02it/s]

94


33it [00:16,  2.02it/s]

122


34it [00:17,  2.03it/s]

62


35it [00:17,  2.03it/s]

42


36it [00:18,  2.01it/s]

365


37it [00:18,  2.02it/s]

138


38it [00:19,  1.99it/s]

265


39it [00:19,  2.01it/s]

67


40it [00:20,  2.01it/s]

90


41it [00:20,  2.02it/s]

85


42it [00:21,  2.02it/s]

256


43it [00:21,  2.00it/s]

369


44it [00:22,  2.01it/s]

97


45it [00:22,  2.02it/s]

38


46it [00:23,  2.00it/s]

430


47it [00:23,  2.01it/s]

40


48it [00:24,  2.02it/s]

113


49it [00:24,  2.02it/s]

232


50it [00:25,  2.02it/s]

89


51it [00:25,  1.99it/s]

244


52it [00:26,  1.98it/s]

177


53it [00:26,  1.98it/s]

309


54it [00:27,  2.00it/s]

78


55it [00:27,  1.99it/s]

341


56it [00:28,  1.99it/s]

344


57it [00:28,  2.01it/s]

105


58it [00:29,  2.02it/s]

151


59it [00:29,  2.02it/s]

80


60it [00:30,  2.03it/s]

7
found correct


61it [00:30,  2.03it/s]

185


62it [00:31,  2.01it/s]

114


63it [00:31,  2.01it/s]

260


64it [00:32,  2.02it/s]

60


65it [00:32,  2.03it/s]

53


66it [00:33,  2.01it/s]

150


67it [00:33,  2.02it/s]

34


68it [00:34,  2.01it/s]

315


69it [00:34,  2.02it/s]

57


70it [00:35,  2.01it/s]

245


71it [00:35,  2.01it/s]

213


72it [00:36,  2.01it/s]

126


73it [00:36,  1.99it/s]

393


74it [00:37,  1.99it/s]

157


75it [00:37,  2.00it/s]

74


75it [00:38,  1.96it/s]

18
found incorrect





In [36]:
mask = correct_example['attention_mask'].to(device)
input_id = correct_example['input_ids'].squeeze(1).to(device)

# visualize correctly classified example
outputs, attention = model(input_id, mask)  # Run model
#attention = outputs[-1]  # Retrieve attention from model outputs
tokens = tokenizer.convert_ids_to_tokens(input_id[0])  # Convert input ids to token strings
model_view(attention, tokens)  # Display model view

<IPython.core.display.Javascript object>

In [37]:
mask = incorrect_example['attention_mask'].to(device)
input_id = incorrect_example['input_ids'].squeeze(1).to(device)

# visualize incorrectly classified example
outputs, attention = model(input_id, mask)  # Run model
#attention = outputs[-1]  # Retrieve attention from model outputs
tokens = tokenizer.convert_ids_to_tokens(input_id[0])  # Convert input ids to token strings
model_view(attention, tokens)  # Display model view

<IPython.core.display.Javascript object>

## Theory Question 2
In your own words, describe how the attention mechanism in a transformer works in the case of self-attention and cross-attention, identifying in each case the keys, queries, and values. Give two examples of alignment models and describe how they affect the output using a simple example. This part of the written report can be done in collaboration with your group.

<h4>Self-attention:</h4>
The keys, queries, and values are derived from the same input sequence. The attention function computes a weighted sum of the values based on the similarity between the queries and keys. The similarity scores are computed by taking the dot product between the queries and keys and applying a softmax function to normalize the scores. The resulting weighted sum is used as the output for that position in the sequence.

- Query: Token at current processing stage
- Keys: All tokens from the same sequence to which we pay attention
- Values: Transformed keys that are used for the weighted sum


<h4>Cross-attention:</h4> 
The keys, queries, and values come from different input sequences. For example, when translating a sentence from one language to another, the queries come from the sequence in the target language while the keys and values come from the sequence in the source language.

- Query: Token at current processing stage (in Decoder layer)
- Keys: All tokens from the output of the Encoder to which we pay attention
- Values: Transformed keys that are used for the weighted sum


<h4>Alignment models:</h4> 
Example: Translate "The man" from English to French ("l'homme")

- <b> Bahdanau/additive attention:</b>  In this case, h<sub>j</sub> is a concatenation of the forward- and backward states of a single bi-directional RNN. The vectors s<sub>t-1</sub> (preceding Decoder-state) and h<sub>j</sub> (Encoder hidden state) are then passed throught a single-layer feed-forward neural network. Considering our example, this attention mechanism would learn that while generating "l'", "man" has a high importance to generate the correct French article.
- <b>Dot product:</b> The alignment score solely depends on the similarity (represented by the dot product between two vectors) of the vectors s<sub>t-1</sub> and h<sub>j</sub>. The importance of "man" while generating "l'" therefore depends on the similarity of the encoded hidden state of "man" and the preceding decoder state while generating "l'". Instead of the attention neural network (like above) learning that this should result in a high alignment score, the Encoder/Decoder must learn to generate s<sub>t-1</sub> and h<sub>j</sub> in a way that they result in a large dot product for "man" and "l'".

# Task 4
Please describe your team's implementation of this project, including your personal contribution, in 1000-1500 characters. Each team member must explain the main aspects of the team's implementation, and may not discuss this summary with other students. You are allowed to use figures and tables to clarify. This summary constitutes a separately and individually graded piece of work.