# CS247 Advanced Data Mining - Assignment 4
## Deadline: 11:59PM, February 16, 2023

## Instructions
Each assignment is structured as a Jupyter notebook, offering interactive tutorials that align with our lectures. You will encounter two types of problems: *write-up problems* and *coding problems*.

1. **Write-up Problems:** These problems are primarily theoretical, requiring you to demonstrate your understanding of lecture concepts and to provide mathematical proofs for key theorems. Your answers should include sufficient steps for the mathematical derivations.
2. **Coding Problems:** Here, you will be engaging with practical coding tasks. These may involve completing code segments provided in the notebooks or developing models from scratch.

To ensure clarity and consistency in your submissions, please adhere to the following guidelines:

* For write-up problems, use Markdown bullet points to format text answers. Also, express all mathematical equations using $\LaTeX$ and avoid plain text such as `x0`, `x^1`, or `R x Q` for equations.
* For coding problems, comment on your code thoroughly for readability and ensure your code is executable. Non-runnable code may lead to a loss of **all** points. Coding problems have automated grading, and altering the grading code will result in a deduction of **all** points.
* Your submission should show the entire process of data loading, preprocessing, model implementation, training, and result analysis. This can be achieved through a mix of explanatory text cells, inline comments, intermediate result displays, and experimental visualizations.

### Submission Requirements

* Submit your solutions through GradeScope in BruinLearn.
* Late submissions are allowed up to 24 hours post-deadline with a penalty factor of $\mathbf{1}(t\leq24)e^{-(\ln(2)/12)t}$.

### Collaboration and Integrity

* Collaboration is encouraged, but all final submissions must be your own work. Please acknowledge any collaboration or external sources used, including websites, papers, and GitHub repositories.
* Any suspicious cases of academic misconduct will be reported to The Office of the Dean of Students.

## Outline
* Part 1: The Transformer Model (80 points)
* Part 2: Spectral Clustering (30 points)

## Part 1: The Transformer Model (80 points)

In this assignment, you will build a Transformer model from scratch using PyTorch. This exercise aims to deepen your understanding of the Transformer architecture, as introduced by Vaswani et al. in the landmark paper [*Attention is All You Need*](https://arxiv.org/abs/1706.03762). By implementing the various components of the Transformer, you will gain hands-on experience with key concepts such as self-attention mechanisms, positional encoding, and the overall architecture of the Transformer model.

- This Jupyter Notebook contains template code and instructions for implementing various parts of the Transformer model.
- Follow the instructions and complete the code in the cells marked `# TODO`.
- Make sure to read the comments carefully to understand what each part of the code should do.
- You can refer to the original "Attention is All You Need" paper and the PyTorch documentation for guidance.
- After implementing the components, you will train your Transformer model on a sample dataset to see it in action.

### Key Components of the Transformer

<img src="transformer.svg" width="30%" />

1. **Encoder and Decoder**: The Transformer model consists of an encoder to process the input text and a decoder to generate the output text. Both the encoder and decoder are composed of multiple layers that contain self-attention and feed-forward neural network components.
2. **Multi-Head Attention**: This component allows the model to jointly attend to information from different representation subspaces at different positions. Implementing multi-head attention is a critical part of this assignment.
3. **Positional Encoding**: Since the model contains no recurrence or convolution, positional encodings are added to give the model some information about the relative or absolute position of the tokens in the sequence.
4. **Feed-Forward Networks**: Each layer of the encoder and decoder contains a feed-forward neural network which applies two linear transformations and a ReLU activation in between.

In the following sections, you will implement these components step by step.

#### GPU Support

Considering the size of the training data, it is strongly suggested to use [Google Colab](https://colab.research.google.com/) or a GPU server for this exercise. If you are using Colab, you can manually switch to a CPU device on Colab by clicking `Runtime -> Change runtime type` and selecting `GPU` under `Hardware Accelerator`.

In [1]:
import math
import torch
from torch import nn
from torch.nn import functional as F

USE_GPU = True

if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
elif USE_GPU and torch.backends.mps.is_available():
    device = torch.device('mps')
else:
    device = torch.device('cpu')

### Exercise 1: Multi-Head Attention (20 points)

The attention mechanism computes the dot product between the query and key vectors, scaled by the square root of the dimension of the key vectors. The attention weights are then used to compute a weighted sum of the value vectors.
Please implement the `scaled_dot_product_attention` function at first, which will used as a building block for the multi-head attention mechanism. This function computes the attention weights and the weighted sum of the value vectors, given the projected query, key, and value vectors.

In [2]:
def scaled_dot_product_attention(query, key, value, mask=None):
    """
    Compute the scaled dot product attention.

    Parameters:
    - query (torch.Tensor): Queries tensor with shape (batch_size, num_heads, seq_len_q, depth).
    - key (torch.Tensor): Keys tensor with shape (batch_size, num_heads, seq_len_k, depth).
    - value (torch.Tensor): Values tensor with shape (batch_size, num_heads, seq_len_v, depth).
    - mask (torch.Tensor, optional): Mask tensor to filter out certain positions before
      applying softmax. The mask's shape is broadcastable to (batch_size, num_heads, seq_len_q, seq_len_k).
      The mask will contain either 0 values to indicate that the corresponding token in the input sequence
      should be considered in the computations or a 1 to indicate otherwise.

    Returns:
    - torch.Tensor: The output after applying attention to the value vector. Shape is (batch_size, num_heads, seq_len_q, depth).
    - torch.Tensor: The attention weights. Shape is (batch_size, num_heads, seq_len_q, seq_len_k).
    """

    # Compute the matrix multiplication between the query and key tensors
    # The resulting tensor has shape (batch_size, num_heads, seq_len_q, seq_len_k)
    matmul_qk = torch.matmul(query, key.transpose(-2, -1))
    assert matmul_qk.size() == (query.size(0), query.size(1), query.size(2), key.size(2))

    d_k = query.size(-1)
    # Scale the attention weights by the square root of the dimension of the key
    scaled_attention_logits = matmul_qk / math.sqrt(d_k)

    # Apply the mask to the scaled tensor
    if mask is not None:
        scaled_attention_logits = scaled_attention_logits.masked_fill(mask==0, -1e9)

    # Apply the softmax function to obtain the attention weights
    attention_weights = F.softmax(scaled_attention_logits, dim=-1)

    # Apply the attention weights to the value tensor
    output = torch.matmul(attention_weights, value)

    return output, attention_weights

You can verify your implementation by running the test cases provided.

In [3]:
batch_size = 4
num_heads = 8
seq_len_q = 10
seq_len_k = 10
seq_len_v = 10
depth = 128

query = torch.rand(batch_size, num_heads, seq_len_q, depth)
key = torch.rand(batch_size, num_heads, seq_len_k, depth)
value = torch.rand(batch_size, num_heads, seq_len_v, depth)

output, attention_weights = scaled_dot_product_attention(query, key, value)

assert output.shape == (batch_size, num_heads, seq_len_q, depth)
assert attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)

from torch.nn.functional import scaled_dot_product_attention as torch_scaled_dot_product_attention
from torch.testing import assert_close

torch_output = torch_scaled_dot_product_attention(query, key, value)
assert_close(output, torch_output, rtol=1e-6, atol=1e-6)

In Transformers, there are multiple "attention heads", each of which captures a different aspect of the input.
At first, the query and key vectors are passed through a linear layer to project them to a higher-dimensional space. Then, the scaled dot-product attention is applied to each of these projected versions of the query and key vectors. The output of the linear layer is then reshaped to split the attention heads. The attention weights are computed for each head, and the weighted sum is then concatenated and passed through another linear layer to produce the final output.
Please implement the `MultiHeadAttention` class, which contains the logic for the multi-head attention mechanism.

In [4]:
class MultiHeadAttention(torch.nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        assert d_model % self.num_heads == 0
        self.depth = d_model // self.num_heads

        self.wq = torch.nn.Linear(d_model, d_model)
        self.wk = torch.nn.Linear(d_model, d_model)
        self.wv = torch.nn.Linear(d_model, d_model)
        self.dense = torch.nn.Linear(d_model, d_model)

    def split_heads(self, x, batch_size):
        x = x.view(batch_size, -1, self.num_heads, self.depth).transpose(1, 2)
        return x

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)

        # Apply linear transformations to the queries, keys, and values
        q = self.wq(q)
        k = self.wk(k)
        v = self.wv(v)

        # Split the queries, keys, and values into (batch_size, num_heads, seq_len, depth) using the split_heads method
        q = self.split_heads(q, batch_size)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)

        output, attention_weights = scaled_dot_product_attention(q, k, v, mask)

        # Concatenate multiple attention heads
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)

        output = self.dense(output)
        return output, attention_weights

### Exercise 2: Positional Encoding (20 points)

Positional Encoding is a method used to inject some information about the relative or absolute position of the tokens in the sequence. The positional encodings have the same dimension as the embeddings so that the two can be summed. There are many choices of positional encodings, learned and fixed. In this assignment, you will implement the fixed positional encoding as described in the following equations:

$$
\begin{align}
    PE_{(pos, 2i)} & = \sin(pos/10000^{2i/d_{model}}), \\
    PE_{(pos, 2i+1)} & = \cos(pos/10000^{2i/d_{model}}),
\end{align}
$$
where $pos$ is the word position and $i$ is the embedding dimension.

In [5]:
class PositionalEncoding(torch.nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)

        # Use the formula given in the original paper to compute the positional encodings
        div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        # Apply the positional encoding to the input tensor
        x = x + self.pe[:, : x.size(1)].requires_grad_(False)
        return x

### Exercise 3: Encoder and Decoder (20 points)

This section would involve more detailed implementation, including the sub-layer connections, normalization, and how they are combined to form the complete encoder and decoder architecture. Specifically, you will implement the following components:
- `EncoderLayer`, which contains a multi-head attention layer and a feed-forward neural network, each followed by a residual connection and layer normalization.
- `DecoderLayer`, which contains three sub-layers: masked multi-head attention, multi-head attention, and a feed-forward neural network, each followed by a residual connection and layer normalization.

In [6]:
def pointwise_feedforward_network(d_model, dff):
    return nn.Sequential(
        nn.Linear(d_model, dff),
        nn.ReLU(),
        nn.Linear(dff, d_model)
    )

In [7]:
class EncoderLayer(torch.nn.Module):
    def __init__(self, d_model, num_heads, dff, dropout_rate=0.1):
        """
        Initialize an EncoderLayer.

        Parameters:
        - d_model (int): The dimensionality of the model.
        - num_heads (int): The number of attention heads.
        - dff (int): The dimensionality of the feed-forward network model.
        - dropout_rate (float): The dropout rate.
        """
        super(EncoderLayer, self).__init__()
        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = pointwise_feedforward_network(d_model, dff)

        self.layernorm1 = torch.nn.LayerNorm(d_model)
        self.layernorm2 = torch.nn.LayerNorm(d_model)

        self.dropout1 = torch.nn.Dropout(dropout_rate)
        self.dropout2 = torch.nn.Dropout(dropout_rate)

    def forward(self, x, mask):
        """
        The forward pass for the EncoderLayer.

        Parameters:
        - x (Tensor): Input tensor to the encoder layer.
        - mask (Tensor, optional): The mask for padding tokens to ignore during self-attention.

        Returns:
        - Tensor: The output of the encoder layer.
        """
        # Step 1: Apply multi-head attention (with padding mask) and add & norm
        attn_output, _ = self.mha(x, x, x, mask)
        attn_output = self.dropout1(attn_output)
        out1 = self.layernorm1(x + attn_output)

        # Step 2: Apply the feed-forward network and add & norm
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output)
        out2 = self.layernorm2(out1 + ffn_output)

        return out2

In [8]:
class DecoderLayer(torch.nn.Module):
    def __init__(self, d_model, num_heads, dff, dropout_rate=0.1):
        """
        Initialize a DecoderLayer.

        Parameters:
        - d_model (int): The dimensionality of the model, i.e., the size of the input and output embeddings.
        - num_heads (int): The number of attention heads.
        - dff (int): The dimensionality of the feed-forward network model.
        - dropout_rate (float): The dropout rate.
        """
        super(DecoderLayer, self).__init__()
        self.mha1 = MultiHeadAttention(d_model, num_heads)  # Self-attention
        self.mha2 = MultiHeadAttention(d_model, num_heads)  # Cross-attention

        self.ffn = pointwise_feedforward_network(d_model, dff)

        self.layernorm1 = torch.nn.LayerNorm(d_model)
        self.layernorm2 = torch.nn.LayerNorm(d_model)
        self.layernorm3 = torch.nn.LayerNorm(d_model)

        self.dropout1 = torch.nn.Dropout(dropout_rate)
        self.dropout2 = torch.nn.Dropout(dropout_rate)
        self.dropout3 = torch.nn.Dropout(dropout_rate)

    def forward(self, x, enc_output, look_ahead_mask=None, padding_mask=None):
        """
        The forward pass for the DecoderLayer.

        Parameters:
        - x (Tensor): Input tensor for decoder layer.
        - enc_output (Tensor): Output from the encoder (serves as Key and Value for cross attention).
        - look_ahead_mask (Tensor, optional): The mask for future tokens in a sequence within the self-attention mechanism.
        - padding_mask (Tensor, optional): The mask for padding tokens within the encoder output.
        """
        # Step 1: Self attention with look ahead mask and padding mask
        attn1, _ = self.mha1(x, x, x, look_ahead_mask)
        attn1 = self.dropout1(attn1)
        out1 = self.layernorm1(x + attn1)

        # Step 2: Cross attention where query comes from previous layer, and key, value come from encoder output
        attn2, _ = self.mha2(out1, enc_output, enc_output, padding_mask)
        attn2 = self.dropout2(attn2)
        out2 = self.layernorm2(out1 + attn2)

        # Step 3: Apply the feed forward network
        ffn_output = self.ffn(out2)
        ffn_output = self.dropout3(ffn_output)
        out3 = self.layernorm3(out2 + ffn_output)

        return out3

In [9]:
class Encoder(torch.nn.Module):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, max_position_encoding, dropout_rate):
        super(Encoder, self).__init__()
        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = torch.nn.Embedding(input_vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_position_encoding)

        self.enc_layers = torch.nn.ModuleList([EncoderLayer(d_model, num_heads, dff, dropout_rate) for _ in range(num_layers)])

        self.dropout = torch.nn.Dropout(dropout_rate)

    def forward(self, x, mask):
        # Adding embedding and position encoding
        x = self.embedding(x) * math.sqrt(self.d_model)
        x = self.pos_encoding(x)

        x = self.dropout(x)

        for i in range(self.num_layers):
            x = self.enc_layers[i](x, mask)

        return x  # (batch_size, input_seq_len, d_model)

In [10]:
class Decoder(torch.nn.Module):
    def __init__(self, num_layers, d_model, num_heads, dff, target_vocab_size, max_position_encoding, dropout_rate):
        super(Decoder, self).__init__()
        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = torch.nn.Embedding(target_vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_position_encoding)

        self.dec_layers = torch.nn.ModuleList([DecoderLayer(d_model, num_heads, dff, dropout_rate) for _ in range(num_layers)])
        self.dropout = torch.nn.Dropout(dropout_rate)

    def forward(self, x, enc_output, look_ahead_mask, padding_mask):
        x = self.embedding(x) * math.sqrt(self.d_model)
        x = self.pos_encoding(x)
        x = self.dropout(x)

        for i in range(self.num_layers):
            x = self.dec_layers[i](x, enc_output, look_ahead_mask, padding_mask)

        return x  # (batch_size, target_seq_len, d_model)

### Exercise 4: Masking (20 points)

An important step in training the Transformer model is to mask the attention weights for the future tokens in the sequence.
Let's consider a simplified scenario where we have an input sequence in English "Hello World" and a hypothetical target sequence in German "Hallo Welt" during training. We will tokenize these sequences into numerical tokens (assume a simple tokenization for illustration) and show how the `generate_mask` function generates the padding and look-ahead masks for these sequences.

First, let's assign numerical tokens to our sequences. In a real scenario, these would come from a tokenizer's vocabulary:
* English (Source) Tokens: "Hello World" → [1, 2]
* German (Target) Tokens: "Hallo Welt" → [1, 2]

Assume padding token ID is 0, and both sequences are already padded to a maximum length of 4 for this example:
* Padded English Sequence: [1, 2, 0, 0]
* Padded German Sequence (with EOS token for simplicity): [1, 2, 3, 0] where 3 is the EOS token.

The source mask allows the model to ignore the padding tokens in the source sequence. It would look something like this for the example:
```Python
src_mask = [[[1, 1, 0, 0]]]  # Shape: (batch_size, 1, 1, src_seq_len)
```
This indicates that the first two tokens are valid while the last two are padding tokens that should be ignored.

The target mask is a combination of padding mask and look-ahead mask to ensure that for predicting each token, the model can only attend to previous tokens and ignores future tokens as well as padding. For our target sequence, considering both padding and look-ahead constraints, the mask might look like:
```Python
tgt_mask =
[[[[1, 0, 0, 0],
   [1, 1, 0, 0],
   [1, 1, 1, 0],
   [1, 1, 1, 0]]]]  # Shape: (batch_size, 1, tgt_seq_len, tgt_seq_len)
```
Here, the first row allows attention to the first token, the second row to the first and second tokens, and so on. The last token does not attend to future tokens (it can't see them), and since it's an EOS token, it correctly doesn't need to see beyond its position, but the model design might mask it differently based on implementation specifics.

Please refer to the comments in the code cells for more detailed instructions on how to implement these two masks.

In [11]:
def generate_mask(src, tgt):
    """
    Generates padding and look-ahead masks for source and target sequences.
    Suppose that the padding token is 0.

    Parameters:
    - src (Tensor): The source sequence tensor with shape (batch_size, src_seq_len).
    - tgt (Tensor): The target sequence tensor with shape (batch_size, tgt_seq_len).

    Returns:
    - Tensor: The padding mask for the source sequence.
    - Tensor: The combined padding and look-ahead mask for the target sequence.
    """

    # Create a mask for the source sequence padding tokens
    src_mask = (src == 0).unsqueeze(1).unsqueeze(1)

    # Create a mask for the target sequence padding tokens
    tgt_mask = (tgt == 0).unsqueeze(1).unsqueeze(3)

    # Generate a no-peek (look-ahead) mask to prevent positions from attending to subsequent positions
    seq_len = tgt.size(1)
    nopeak_mask = torch.triu(torch.ones(1, seq_len, seq_len), diagonal=1).bool()
    nopeak_mask =nopeak_mask.to(device)

    # Combine the padding mask and the look-ahead mask for the target sequence
    tgt_mask = tgt_mask | nopeak_mask

    return src_mask, tgt_mask

Putting all these components together, we get our complete Transformer model.

In [12]:
class Transformer(torch.nn.Module):
    def __init__(self, num_encoder_layers, num_decoder_layers, d_model, num_heads, dff, input_vocab_size, target_vocab_size, pe_input, pe_target, dropout_rate=0.1):
        super(Transformer, self).__init__()
        self.encoder = Encoder(num_encoder_layers, d_model, num_heads, dff, input_vocab_size, pe_input, dropout_rate)
        self.decoder = Decoder(num_decoder_layers, d_model, num_heads, dff, target_vocab_size, pe_target, dropout_rate)
        self.final_layer = torch.nn.Linear(d_model, target_vocab_size)

    def forward(self, inp, tar):
        # Generate masks
        src_mask, tgt_mask = generate_mask(inp, tar)

        # Pass the input through the encoder, which uses src_mask
        enc_output = self.encoder(inp, src_mask)  # (batch_size, inp_seq_len, d_model)

        # Pass the encoder output and target through the decoder, which uses tgt_mask and src_mask
        dec_output = self.decoder(tar, enc_output, tgt_mask, src_mask)  # (batch_size, tar_seq_len, d_model)

        final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)

        return final_output

Before you move on to experiment on a real dataset, you may verify that your Transformer model is working correctly with a synthetically generated dataset.

In [13]:
src_vocab_size = 5000
tgt_vocab_size = 5000
d_model = 512
num_heads = 8
num_layers = 2
d_ff = 2048
max_seq_length = 100
dropout = 0.1

transformer = Transformer(
    num_layers, num_layers, d_model, num_heads, d_ff,
    src_vocab_size, tgt_vocab_size,
    pe_input=max_seq_length, pe_target=max_seq_length, dropout_rate=dropout).to(device)

src_data = torch.randint(1, src_vocab_size, (64, max_seq_length)).to(device)  # (batch_size, seq_length)
tgt_data = torch.randint(1, tgt_vocab_size, (64, max_seq_length)).to(device)  # (batch_size, seq_length)

output = transformer(src_data, tgt_data)

## Exercise 5: Model Training and Evaluation (no grading)

In this section, you will train your Transformer model on a translation task using the WMT German-English dataset.
After training, you will evaluate the model on a test set using BLEU score as the evaluation metric.
Due to the large size of the WMT dataset, we will not be grading this part. However, you are encouraged to experiment with the real dataset to see how well your model performs.
Before you start training, make sure that you have installed the necessary packages and have access to a GPU for faster training.

In [14]:
%pip install datasets transformers sacrebleu

Collecting datasets
  Downloading datasets-2.17.0-py3-none-any.whl (536 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/536.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m286.7/536.6 kB[0m [31m8.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
Collecting sacrebleu
  Downloading sacrebleu-2.4.0-py3-none-any.whl (106 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.3/106.3 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=12.0.0 (from datasets)
  Downloading pyarrow-15.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (38.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.3/38.3 MB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     

In [15]:
from datasets import load_dataset

dataset = load_dataset('wmt14', 'de-en', split={'train': 'train[:1%]', 'test': 'test', 'validation': 'validation'})
train_data = dataset['test']
valid_data = dataset['validation']

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/2.97k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.62k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/41.1k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/658M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/919M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/80.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/38.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4508785 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3003 [00:00<?, ? examples/s]

In [16]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('facebook/bart-base')

def tokenize_data(examples, max_length=128):
    inputs = [ex['de'] for ex in examples['translation']]
    targets = [ex['en'] for ex in examples['translation']]
    model_inputs = tokenizer(inputs, max_length=max_length, truncation=True, padding='max_length')
    labels = tokenizer(targets, max_length=max_length, truncation=True, padding='max_length')

    model_inputs['labels'] = labels['input_ids']
    return model_inputs

tokenized_train_data = train_data.map(tokenize_data, batched=True).with_format(type='torch', columns=['input_ids', 'labels'])
tokenized_valid_data = valid_data.map(tokenize_data, batched=True).with_format(type='torch', columns=['input_ids', 'labels'])

config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Map:   0%|          | 0/3003 [00:00<?, ? examples/s]

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

In [17]:
from torch.utils.data import DataLoader

train_loader = DataLoader(tokenized_train_data, batch_size=32, shuffle=True)
valid_loader = DataLoader(tokenized_valid_data, batch_size=64)

To determine the appropriate length of the maximum sequence, you might want to inspect the length distribution of the training data. You can then set the maximum sequence length to a value that covers most of the training data.

In [18]:
import numpy as np

lengths = [len(tokenizer.tokenize(example['en'])) for example in train_data['translation']]

lengths = np.array(lengths)
mean_length = np.mean(lengths)
max_length = np.max(lengths)
median_length = np.median(lengths)
percentile_90 = np.percentile(lengths, 90)

print(f"Mean length: {mean_length}")
print(f"Max length: {max_length}")
print(f"Median length: {median_length}")
print(f"90th percentile length: {percentile_90}")

Mean length: 24.776223776223777
Max length: 116
Median length: 22.0
90th percentile length: 42.0


In [19]:
import torch.optim as optim

num_encoder_layers = 6
num_decoder_layers = 6
max_seq_length = 512
d_model = 512
num_heads = 8
dff = 2048
dropout_rate = 0.1
input_vocab_size = tokenizer.vocab_size
target_vocab_size = tokenizer.vocab_size

model = Transformer(
    num_layers, num_layers, d_model, num_heads, d_ff,
    input_vocab_size, target_vocab_size,
    pe_input=max_seq_length, pe_target=max_seq_length, dropout_rate=dropout).to(device)

optimizer = optim.Adam(model.parameters(), lr=0.0005)
criterion = torch.nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id)

In sequence-to-sequence (seq2seq) models, such as Transformers, the label shifting technique, often referred to as "teacher forcing" when used during training, plays a crucial role in the learning process. This technique involves shifting the labels by one position so that the model predicts the next token in the sequence given all the previous tokens up to that point.

Suppose we have the following English sentence (source) and its French translation (target):
- **English (Source)**: "Hello, world"
- **French (Target)**: "Bonjour, le monde"

Input to the Model (Encoder Input): The input sequence to the encoder would be the English sentence, tokenized and possibly including special tokens like start-of-sequence (SOS) or end-of-sequence (EOS) tokens, depending on the model architecture:
- **Encoder Input**: `[SOS] Hello, world [EOS]`

Target Sequence for Teacher Forcing (Decoder Input): The target sequence for teacher forcing (used as input to the decoder) is shifted by one token to the right, to teach the model to predict the next token in the sequence. It includes an SOS token at the beginning to indicate the start of the sequence but omits the EOS token or includes it only as part of the ground truth for the final step, ensuring that for each input token, the model learns to predict the subsequent token:
- **Decoder Input**: `[SOS] Bonjour, le monde`

The ground truth data against which the model's predictions are compared is the target sequence shifted one position to the left, excluding the SOS token and including the EOS token. This ensures that for every step of the sequence, the model is trained to predict the next token:
- **Ground Truth Data**: `Bonjour, le monde [EOS]`

To visualize the shifting, consider how each token in the decoder input is used to predict the corresponding token in the ground truth data:
- Decoder Input: `[SOS]` → Predicts → `Bonjour`
- Decoder Input: `Bonjour` → Predicts → `,`
- Decoder Input: `,` → Predicts → `le`
- Decoder Input: `le` → Predicts → `monde`
- Decoder Input: `monde` → Predicts → `[EOS]`

In [21]:
model.train()
for epoch in range(50):
    for batch in train_loader:
        optimizer.zero_grad()

        input_ids = batch['input_ids'].to(device)
        labels = batch['labels'].to(device)

        output = model(input_ids, labels[:, :-1])
        output_dim = output.shape[-1]  # Vocabulary size
        # Reshape output to (batch_size * seq_len, output_dim) for calculating loss
        output = output.reshape(-1, output_dim)
        labels = labels[:, 1:].reshape(-1)  # Flatten labels to align with output for CrossEntropyLoss

        loss = criterion(output, labels)
        loss.backward()
        # max_grad_norm = 4.0  # Gradient Clipping by setting a suitable max_grad_norm
        # torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
        optimizer.step()


    print(f"Epoch {epoch+1}, Loss: {loss.item()}")

Epoch 1, Loss: 6.202316761016846
Epoch 2, Loss: 4.652083873748779
Epoch 3, Loss: 3.952393054962158
Epoch 4, Loss: 3.0091633796691895
Epoch 5, Loss: 2.5502829551696777
Epoch 6, Loss: 1.8142486810684204
Epoch 7, Loss: 1.405409574508667
Epoch 8, Loss: 1.0702543258666992
Epoch 9, Loss: 0.7337517738342285
Epoch 10, Loss: 0.5610472559928894
Epoch 11, Loss: 0.41281720995903015
Epoch 12, Loss: 0.32359522581100464
Epoch 13, Loss: 0.24260513484477997
Epoch 14, Loss: 0.1935943067073822
Epoch 15, Loss: 0.16754066944122314
Epoch 16, Loss: 0.13239189982414246
Epoch 17, Loss: 0.12168551236391068
Epoch 18, Loss: 0.16486980020999908
Epoch 19, Loss: 0.11260131746530533
Epoch 20, Loss: 0.07621283084154129
Epoch 21, Loss: 0.10526382178068161
Epoch 22, Loss: 0.07447303086519241
Epoch 23, Loss: 0.07626289129257202
Epoch 24, Loss: 0.07833409309387207
Epoch 25, Loss: 0.06205301359295845
Epoch 26, Loss: 0.08553972840309143
Epoch 27, Loss: 0.06268931925296783
Epoch 28, Loss: 0.15206049382686615
Epoch 29, Loss: 

Hugging Face provides various metrics through its datasets library. For translation tasks, BLEU is a common metric used to evaluate the quality of the model's translations. You can use the `datasets` library to load the WMT dataset and evaluate your model using BLEU score.

In [22]:
from datasets import load_metric

bleu_metric = load_metric('sacrebleu')

model.eval()
predictions = []
references = []

for batch in valid_loader:
    input_ids = batch['input_ids'].to(device)
    labels = batch['labels'].to(device)  # Ground truth labels

    with torch.no_grad():
        outputs = model(input_ids, labels)

    # Convert model outputs to predicted tokens
    predicted_tokens = torch.argmax(outputs, dim=-1)

    # Convert tokens to texts
    predicted_texts = [tokenizer.decode(ids, skip_special_tokens=True) for ids in predicted_tokens]
    reference_texts = [[tokenizer.decode(ids, skip_special_tokens=True)] for ids in labels]

    predictions.extend(predicted_texts)
    references.extend(reference_texts)

# Compute BLEU score
results = bleu_metric.compute(predictions=predictions, references=references)
print(f"BLEU score: {results['score']}")

  bleu_metric = load_metric('sacrebleu')
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/2.85k [00:00<?, ?B/s]

BLEU score: 16.771825813848054


In [23]:
random_idx = np.random.randint(0, len(predictions))
print(f"Reference: {references[random_idx][0]}")
print(f"Predicted: {predictions[random_idx]}")

Reference: Visitors and holidaymakers arrive mostly from the USA, Russia, France, Germany, Italy, England, and Ukraine.
Predicted: Alreadyn Diet families programme,, from the USA,954,330, Diet, Paul, Paul, and complement.,


You might notice that there are lot of repetitions of the word "the" in the decoder's output.
This is a common issue known as "repetition problem" or "degeneration problem." This usually occurs during the generation phase, where the model falls into a loop, outputting the same word or phrase repeatedly. This is mostly due to the inadequate training of the model. There are several techniques to mitigate this issue, such as using beam search, nucleus sampling, or top-k sampling during the generation phase.