# Assignmnet 3 (100 + 5 points)

**Name:** <br>
**Email:** <br>
**Group:** A/B <br>
**Hours spend *(optional)* :** <br>

### Question 1: Transformer model *(100 points)*

As a Machine Learning engineer at a tech company, you were given a task to develop a machine translation system that translates **English (source) to German (Target)**. You can use existing libraries but the training needs to be done from scratch (usage of pretrained weights is not allowed). You have the freedom to select any dataset for training the model. Use a small subset of data as a validation dataset and report the BLEU score on the validation set. Also, provide a short description of your transformer model architecture, hyperparameters, and training (also provide the training loss curve).

<h3> Submission </h3>

The test set **(test.txt)** will be released one week before the deadline. You should submit the output of your model on the test set separately. Name the output file as **"first name_last_name_test_result.txt"**. Each line of the submission file should contain only the translated text of the corresponding sentence from 'test.txt'.

The 'first name_last_name_test_result.txt' file will be evaluated by your instructor and the student who could get the best BLEU score will get 5 additional points. 

**Dataset**

Here are some of the parallel datasets (see Datasets and Resources file):
* Europarl Parallel corpus - https://www.statmt.org/europarl/v7/de-en.tgz
* News Commentary - https://www.statmt.org/wmt14/training-parallel-nc-v9.tgz (use DE-EN parallel data)
* Common Crawl corpus - https://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz (use DE-EN parallel data)

You can also use other datasets of your choice. In the above datasets, **'.en'** file has the text in English, and **'.de'** file contains their corresponding German translations.

## Notes:

1) You can also consider using a small subset of the dataset if the training dataset is large
2) Sometimes you can also get out of memory errors while training, so choose the hyperparameters carefully.
3) Your training will be much faster if you use a GPU. If you are using a CPU, it may take several hours or even days. (you can also use Google Colab GPUs for training. link: https://colab.research.google.com/)

In [1]:
import torch
print(torch.cuda.device_count())
print(torch.cuda.get_device_name(0))
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

1
NVIDIA GeForce RTX 4090 Laptop GPU


In [5]:
import re

def clean_text(text):
    text = str(text).lower().strip()
    text = text.rstrip('\n')
    text = re.sub(r"<[^>]+>", "", text)
    text = re.sub(r"[^a-zA-ZÀ-ÿ0-9\s.,;!?':()\[\]{}-]", " ", text)
    text = re.sub(r"\s+", " ", text)
    text = text.encode("utf-8", errors="ignore").decode("utf-8")
    return text

In [10]:
import pandas as pd

with open('./train/news-commentary-v9.cs-en.en', 'r', encoding="utf-8") as en_file, open('./train/news-commentary-v9.de-en.de', 'r', encoding="utf-8") as fr_file:
    en = pd.Series(en_file.readlines(), name='en').apply(lambda text: clean_text(text))
    de = pd.Series(fr_file.readlines(), name='de').apply(lambda text: clean_text(text))

translation_df = pd.concat([en, de], axis=1)
translation_df.head()

Unnamed: 0,en,de
0,"10,000 gold?",steigt gold auf 10.000 dollar?
1,san francisco it has never been easy to have a...,"san francisco es war noch nie leicht, ein rati..."
2,"lately, with gold prices up more than 300 over...",in letzter zeit allerdings ist dies schwierige...
3,"just last december, fellow economists martin f...",erst letzten dezember verfassten meine kollege...
4,wouldn t you know it?,"und es kam, wie es kommen musste."


In [2]:
!pip install transformers




[notice] A new release of pip is available: 23.3 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [16]:
from transformers import AutoTokenizer

PRE_TRAINED_MODEL_NAME = "distilbert/distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)
print(f"Vocabulary Size: {tokenizer.vocab_size}")
print(f"Special tokens: {tokenizer.all_special_tokens}")
print (list(tokenizer.get_vocab().items()) [11950:11957])

Vocabulary Size: 119547
Special tokens: ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
[('##чают', 42131), ('Gogledd', 108351), ('##folgenden', 65593), ('##анд', 27780), ('##ონ', 44171), ('##ldeko', 61663), ('hoc', 29317)]


In [17]:
tokenizer(['has never', 'war noch']).input_ids
# 4 elements per embedding because first and last is [CLS] and [SEP]

[[101, 10393, 14794, 102], [101, 10338, 11230, 102]]

In [None]:
# Get the embedding layer from our pre-trained model.

from transformers import AutoModelForMaskedLM

# Note 👀 how we are using the same model name.
pre_trained_model = AutoModelForMaskedLM.from_pretrained(PRE_TRAINED_MODEL_NAME) # downloads the model.

# Fetch the embedding layer from the pre-trained model.
embedding_layer = pre_trained_model.get_input_embeddings()

# These line just tells pytorch we don't intend to further train the embedding layer.
# So it freezes the layers knowledge, so we don't scatter it while our model is still starting to learn.
embedding_layer = embedding_layer.requires_grad_(False)

print('Vocabulary Size :', tokenizer.vocab_size)
print('Embedding Layer :', embedding_layer)

In [18]:
import torch

from torch.utils.data import *
import torch.nn as nn

In [7]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.nn import TransformerEncoder, TransformerEncoderLayer
import math
from torch.utils.data import dataset
from torch.nn.utils.rnn import pad_sequence
import spacy

# Load spacy tokenizers for English and German
spacy_eng = spacy.load("en_core_web_sm")
spacy_ger = spacy.load("de_core_news_sm")

# Function to tokenize English text
def tokenize_eng(text):
    return [tok.text.lower() for tok in spacy_eng.tokenizer(text)]

# Function to tokenize German text
def tokenize_ger(text):
    return [tok.text.lower() for tok in spacy_ger.tokenizer(text)]

# Custom dataset class
class TranslationDataset(Dataset):
    def __init__(self, src_sentences, trg_sentences, src_vocab, trg_vocab):
        self.src_sentences = src_sentences
        self.trg_sentences = trg_sentences
        self.src_vocab = src_vocab
        self.trg_vocab = trg_vocab

    def __len__(self):
        return len(self.src_sentences)

    def __getitem__(self, idx):
        src_tokenized = [self.src_vocab[word] for word in tokenize_eng(self.src_sentences[idx])]
        trg_tokenized = [self.trg_vocab[word] for word in tokenize_ger(self.trg_sentences[idx])]
        return torch.tensor(src_tokenized), torch.tensor(trg_tokenized)

# Function to create vocabulary
def build_vocab(sentences, tokenizer):
    vocab = {"<pad>": 0, "<sos>": 1, "<eos>": 2, "<unk>": 3}
    idx = 4
    for sentence in sentences:
        for word in tokenizer(sentence):
            if word not in vocab:
                vocab[word] = idx
                idx += 1
    return vocab

# Read the files
with open('./train/news-commentary-v9.cs-en.en', 'r', encoding="utf-8") as f:
    english_sentences = f.readlines()

with open('./train/news-commentary-v9.de-en.de', 'r', encoding="utf-8") as f:
    german_sentences = f.readlines()

# Build vocabularies
src_vocab = build_vocab(english_sentences, tokenize_eng)
trg_vocab = build_vocab(german_sentences, tokenize_ger)

# Custom collate function
def collate_fn(batch):
    src_batch, trg_batch = zip(*batch)
    src_batch = pad_sequence(src_batch, batch_first=True, padding_value=src_vocab["<pad>"])
    trg_batch = pad_sequence(trg_batch, batch_first=True, padding_value=trg_vocab["<pad>"])
    return src_batch, trg_batch

# Create dataset and dataloader
dataset = TranslationDataset(english_sentences, german_sentences, src_vocab, trg_vocab)
dataloader = DataLoader(dataset, batch_size=32, collate_fn=collate_fn)

class PositionalEncoding(nn.Module):

    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe = torch.zeros(max_len, 1, d_model)
        pe[:, 0, 0::2] = torch.sin(position * div_term)
        pe[:, 0, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Arguments:
            x: Tensor, shape ``[seq_len, batch_size, embedding_dim]``
        """
        x = x + self.pe[:x.size(0)]
        return self.dropout(x)

# Define Transformer Model
class TransformerModel(nn.Module):

    def __init__(self, ntoken: int, d_model: int, nhead: int, d_hid: int,
                 nlayers: int, dropout: float = 0.5):
        super().__init__()
        self.model_type = 'Transformer'
        self.pos_encoder = PositionalEncoding(d_model, dropout)
        encoder_layers = TransformerEncoderLayer(d_model, nhead, d_hid, dropout)
        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)
        self.embedding = nn.Embedding(ntoken, d_model)
        self.d_model = d_model
        self.linear = nn.Linear(d_model, ntoken)

        self.init_weights()

    def init_weights(self) -> None:
        initrange = 0.1
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.linear.bias.data.zero_()
        self.linear.weight.data.uniform_(-initrange, initrange)

    # Inside the forward method of TransformerModel class

    def forward(self, src: torch.Tensor, src_mask: torch.Tensor = None) -> torch.Tensor:
        """
        Arguments:
            src: Tensor, shape ``[seq_len, batch_size]``
            src_mask: Tensor, shape ``[seq_len, seq_len]`` or None

        Returns:
            output Tensor of shape ``[seq_len, batch_size, ntoken]``
        """
        src = self.embedding(src) * math.sqrt(self.d_model)
        src = self.pos_encoder(src)
        
        if src_mask is None:
            # Generate a square causal mask for the sequence.
            src_mask = self._generate_square_subsequent_mask(len(src)).to(src.device)
        else:
            # Ensure the mask is of type torch.bool or torch.float
            src_mask = src_mask.to(torch.bool) if src_mask.dtype == torch.float else src_mask
        
        output = self.transformer_encoder(src, src_key_padding_mask=src_mask)
        output = self.linear(output)
        return output

    def _generate_square_subsequent_mask(self, sz: int) -> torch.Tensor:
        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
        return mask

# Hyperparameters
SRC_VOCAB_SIZE = len(src_vocab)
TRG_VOCAB_SIZE = len(trg_vocab)
SRC_PAD_IDX = src_vocab["<pad>"]
TRG_PAD_IDX = trg_vocab["<pad>"]

# Initialize the model, loss function, and optimizer
ntokens = len(src_vocab)  # size of vocabulary
emsize = 200  # embedding dimension
d_hid = 200  # dimension of the feedforward network model in ``nn.TransformerEncoder``
nlayers = 2  # number of ``nn.TransformerEncoderLayer`` in ``nn.TransformerEncoder``
nhead = 2  # number of heads in ``nn.MultiheadAttention``
dropout = 0.2  # dropout probability
model = TransformerModel(ntokens, emsize, nhead, d_hid, nlayers, dropout).to(device)

# model = TransformerModel(SRC_VOCAB_SIZE, TRG_VOCAB_SIZE, SRC_PAD_IDX, TRG_PAD_IDX)
criterion = nn.CrossEntropyLoss(ignore_index=TRG_PAD_IDX)
optimizer = optim.Adam(model.parameters(), lr=0.0005)

In [8]:
# Training loop
for epoch in range(10):
    model.train()
    epoch_loss = 0
    print(f"Training Epoch {epoch}")
    for src, trg in dataloader:
        src = src.transpose(0, 1).to(device)  # Move src to GPU
        trg_input = trg[:, :-1].transpose(0, 1).to(device)  # Move trg_input to GPU
        trg_output = trg[:, 1:].transpose(0, 1).to(device)  # Move trg_output to GPU
        
        optimizer.zero_grad()
        output = model(src, trg_input)
        
        output_dim = output.shape[-1]
        output = output.contiguous().view(-1, output_dim)
        trg_output = trg_output.contiguous().view(-1)
        
        loss = criterion(output, trg_output)
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
    
    print(f'Epoch {epoch+1} Loss {epoch_loss/len(dataloader):.4f}')


Training Epoch 0


AssertionError: only bool and floating types of src_key_padding_mask are supported