
## Grading Criteria

**Maximum Score: 10 points**

1. **Step #1: Implementation of Attention Mechanisms (4 points)**  
   - 2 points for implementing the `additive` attention mechanism correctly.
   - 2 points for implementing the `multiplicative` attention mechanism correctly.

2. **Step #2: BERT-based Text Classification Task (4 points)**  
   - 2 points for setting up the BERT model correctly for classification.
   - 2 points for evaluating the model performance accurately.

3. **Code Quality and Comments (2 points)**  
   - 1 point for code clarity and logical structuring of functions and classes.
   - 1 point for detailed comments explaining each part of the code.

**Total: 10 points**  
Each section will be reviewed to ensure that all requirements are met and that the code is efficient and well-documented. Pay attention to using proper variable names and providing comments that describe the purpose of each function and major code sections.


## Attention & BERT

For this homework assignment, your goal is to delve into the Attention mechanism (implementing several of its variants) and revisit the text classification task, this time solving it with BERT.


In [None]:
import os
import random

import numpy as np
import pandas as pd

from sklearn.metrics import roc_auc_score, roc_curve, accuracy_score

import torch
import torch.nn as nn
import torch.nn.functional as F

import matplotlib.pyplot as plt
from IPython.display import clear_output 
# Display inline matplotlib plots
%matplotlib inline

### Step #1. Implementation of Attention

In this task, please implement the Attention mechanism, specifically several methods for calculating attention scores. While this mechanism is already implemented in popular frameworks, you'll implement it using `numpy` to gain a better understanding.

Your task in this part: implement `additive` and `multiplicative` variants of Attention. For your convenience (and as an example), the `dot product` attention (based on scalar product) is already implemented.

Detailed descriptions of these types of Attention are available in the lecture slides.

In [None]:
decoder_hidden_state = np.array([7, 11, 4]).astype(float)[:, None]

plt.figure(figsize=(2, 5))
plt.pcolormesh(decoder_hidden_state)
plt.colorbar()
plt.title("Decoder state")

Dot product attention (example implementation)

Let's consider a single encoder state – a vector with dimensions `(n_hidden, 1)`, where `n_hidden = 3`:


In [None]:
single_encoder_hidden_state = np.array([1, 5, 11]).astype(float)[:, None]

plt.figure(figsize=(2, 5))
plt.pcolormesh(single_encoder_hidden_state)
plt.colorbar()

The attention score between these encoder and decoder states is simply calculated as a dot product:

In [None]:
np.dot(decoder_hidden_state.T, single_encoder_hidden_state)

In the general case, there are, of course, multiple encoder states. Attention scores are computed with each encoder state:

In [None]:
encoder_hidden_states = (
    np.array([[1, 5, 11], [7, 4, 1], [8, 12, 2], [-9, 0, 1]]).astype(float).T
)

encoder_hidden_states

Then, to calculate the dot products between a single decoder state and all encoder states, we can use the following function (which is essentially just matrix multiplication and type conversion):


In [None]:
def dot_product_attention_score(decoder_hidden_state, encoder_hidden_states):
    """
    decoder_hidden_state: np.array of shape (n_features, 1)
    encoder_hidden_states: np.array of shape (n_features, n_states)

    return: np.array of shape (1, n_states)
        Array with dot product attention scores
    """
    attention_scores = np.dot(decoder_hidden_state.T, encoder_hidden_states)
    return attention_scores

In [None]:
dot_product_attention_score(decoder_hidden_state, encoder_hidden_states)

To calculate the "weights," we need Softmax:

In [None]:
def softmax(vector):
    """
    vector: np.array of shape (n, m)

    return: np.array of shape (n, m)
        Matrix where softmax is computed for every row independently
    """
    nice_vector = vector - vector.max()
    exp_vector = np.exp(nice_vector)
    exp_denominator = np.sum(exp_vector, axis=1)[:, np.newaxis]
    softmax_ = exp_vector / exp_denominator
    return softmax_

In [None]:
weights_vector = softmax(
    dot_product_attention_score(decoder_hidden_state, encoder_hidden_states)
)

weights_vector

Finally, we'll use these weights and compute the final vector, as described for dot product attention.


In [None]:
attention_vector = weights_vector.dot(encoder_hidden_states.T).T
print(attention_vector)

plt.figure(figsize=(2, 5))
plt.pcolormesh(attention_vector, cmap="spring")
plt.colorbar()

This vector accumulates information from all encoder states, weighted based on proximity to the given decoder state. Let's implement all the above transformations in a single function:

In [None]:
def dot_product_attention(decoder_hidden_state, encoder_hidden_states):
    """
    decoder_hidden_state: np.array of shape (n_features, 1)
    encoder_hidden_states: np.array of shape (n_features, n_states)

    return: np.array of shape (n_features, 1)
        Final attention vector
    """
    softmax_vector = softmax(
        dot_product_attention_score(decoder_hidden_state, encoder_hidden_states)
    )
    attention_vector = softmax_vector.dot(encoder_hidden_states.T).T
    return attention_vector

In [None]:
assert (
    attention_vector
    == dot_product_attention(decoder_hidden_state, encoder_hidden_states)
).all()

Multiplicative attention. Your current task: implement multiplicative attention.

$$
e_i = \mathbf{s}^TW_{mult}\mathbf{h}_i
$$

The weight matrix `W_mult` is given below. It should be noted that multiplicative attention allows working with encoder and decoder states of different dimensions, so the encoder states will be updated:


In [None]:
encoder_hidden_states_complex = (
    np.array([[1, 5, 11, 4, -4], [7, 4, 1, 2, 2], [8, 12, 2, 11, 5], [-9, 0, 1, 8, 12]])
    .astype(float)
    .T
)

W_mult = np.array(
    [
        [-0.78, -0.97, -1.09, -1.79, 0.24],
        [0.04, -0.27, -0.98, -0.49, 0.52],
        [1.08, 0.91, -0.99, 2.04, -0.15],
    ]
)

In [None]:
# your code here

Implement the attention calculation according to the formulas and create the final function `multiplicative_attention`:

In [None]:
def multiplicative_attention(decoder_hidden_state, encoder_hidden_states, W_mult):
    """
    decoder_hidden_state: np.array of shape (n_features_dec, 1)
    encoder_hidden_states: np.array of shape (n_features_enc, n_states)
    W_mult: np.array of shape (n_features_dec, n_features_enc)

    return: np.array of shape (n_features_enc, 1)
        Final attention vector
    """
    # your code here
    return attention_vector

Additive attention. Now you need to implement additive attention.

$$
e_i = \mathbf{v}^T \text{tanh} (W_{add-enc} \mathbf{h}_i + W_{add-dec} \mathbf{s})
$$

The weight matrices `W_add_enc` and `W_add_dec` are provided below, as well as the weight vector `v_add`. For activation calculation, you can use `np.tanh`.

In [None]:
v_add = np.array([[-0.35, -0.58, 0.07, 1.39, -0.79, -1.78, -0.35]]).T

W_add_enc = np.array(
    [
        [-1.34, -0.1, -0.38, 0.12, -0.34],
        [-1.0, 1.28, 0.49, -0.41, -0.32],
        [-0.39, -1.38, 1.26, 1.21, 0.15],
        [-0.18, 0.04, 1.36, -1.18, -0.53],
        [-0.23, 0.96, 1.02, 0.39, -1.26],
        [-1.27, 0.89, -0.85, -0.01, -1.19],
        [0.46, -0.12, -0.86, -0.93, -0.4],
    ]
)

W_add_dec = np.array(
    [
        [-1.62, -0.02, -0.39],
        [0.43, 0.61, -0.23],
        [-1.5, -0.43, -0.91],
        [-0.14, 0.03, 0.05],
        [0.85, 0.51, 0.63],
        [0.39, -0.42, 1.34],
        [-0.47, -0.31, -1.34],
    ]
)

In [None]:
# your code here

Implement the attention calculation according to the formulas and create the final function `additive_attention`:

In [None]:
def additive_attention(
    decoder_hidden_state, encoder_hidden_states, v_add, W_add_enc, W_add_dec
):
    """
    decoder_hidden_state: np.array of shape (n_features_dec, 1)
    encoder_hidden_states: np.array of shape (n_features_enc, n_states)
    v_add: np.array of shape (n_features_int, 1)
    W_add_enc: np.array of shape (n_features_int, n_features_enc)
    W_add_dec: np.array of shape (n_features_int, n_features_dec)

    return: np.array of shape (n_features_enc, 1)
        Final attention vector
    """
    # your code here
    return attention_vector

Submit the `multiplicative_attention` and `additive_attention` functions in the contest. Don’t forget to import `numpy`!

### Step #2. Text classification using a pretrained language model.

We work with the SST-2 dataset. Split the dataset into train and test sets.

In [None]:
# do not change the code in the block below
# __________start of block__________

!wget https://raw.githubusercontent.com/girafe-ai/ml-course/msu_branch/homeworks/hw08_attention/holdout_texts08.npy
# __________end of block__________

In [None]:
# do not change the code in the block below
# __________start of block__________
df = pd.read_csv(
    "https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv",
    delimiter="\t",
    header=None,
)
texts_train = df[0].values[:5000]
y_train = df[1].values[:5000]
texts_test = df[0].values[5000:]
y_test = df[1].values[5000:]
texts_holdout = np.load("holdout_texts08.npy", allow_pickle=True)
# __________end of block__________

The rest of the code is up to you to write.
To successfully achieve the maximum score, you need to reach at least 84.5% accuracy on the test part of the dataset.

In [None]:
# your beautiful experiments here

Submitting the Assignment in the Contest

Save the probabilities of belonging to class 0 and class 1, respectively, in the dictionary `out_dict`:



In [None]:
out_dict = {
    'train': # np.array of size (5000, 2) with probas
    'test': # np.array of size (1920, 2) with probas
    'holdout': # np.array of size (500, 2) with probas
}

Several `assert`s to check your solution:

In [None]:
assert isinstance(out_dict["train"], np.ndarray), "Dict values should be numpy arrays"
assert out_dict["train"].shape == (
    5000,
    2,
), "The predicted probas shape does not match the train set size"
assert np.allclose(
    out_dict["train"].sum(axis=1), 1.0
), "Probas do not sum up to 1 for some of the objects"

assert isinstance(out_dict["test"], np.ndarray), "Dict values should be numpy arrays"
assert out_dict["test"].shape == (
    1920,
    2,
), "The predicted probas shape does not match the test set size"
assert np.allclose(
    out_dict["test"].sum(axis=1), 1.0
), "Probas do not sum up to 1 for some of the object"

assert isinstance(out_dict["holdout"], np.ndarray), "Dict values should be numpy arrays"
assert out_dict["holdout"].shape == (
    500,
    2,
), "The predicted probas shape does not match the holdout set size"
assert np.allclose(
    out_dict["holdout"].sum(axis=1), 1.0
), "Probas do not sum up to 1 for some of the object"