## Assigment 3: Transformers for translation 🙊


Have you ever wondered how applications like Google Translate or language translation features in social media platforms work? Behind these impressive technologies are sophisticated machine learning models that can understand and translate text between different languages. One of the most powerful and groundbreaking models used for this purpose is the Transformer model.

In this assignment, you will step into the shoes of an AI researcher and engineer to create your own Transformer model for translating text from English to French. This journey will not only enhance your understanding of machine learning and deep learning but also give you hands-on experience with state-of-the-art techniques in natural language processing.

Let's start by downloading important libraries

In [None]:
!pip install datasets
!pip install evaluate
!pip install transformers
!pip install bert_score
!pip install rouge_score

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

For this assignment we are using the IWSLT2017 dataset (read more about it [here](https://huggingface.co/datasets/IWSLT/iwslt2017) ). This dataset easily found in Huggingface fits perfectly for our machine translation task.

In [None]:
from datasets import load_dataset

dataset = load_dataset("IWSLT/iwslt2017",'iwslt2017-en-fr')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/18.5k [00:00<?, ?B/s]

iwslt2017.py:   0%|          | 0.00/8.17k [00:00<?, ?B/s]

The repository for IWSLT/iwslt2017 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/IWSLT/iwslt2017.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


en-fr.zip:   0%|          | 0.00/27.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/232825 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8597 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/890 [00:00<?, ? examples/s]

Just to have an idea let's have a quick peak at what our dataset looks like.

In [None]:
dataset['train']['translation'][0]

{'en': "Thank you so much, Chris. And it's truly a great honor to have the opportunity to come to this stage twice; I'm extremely grateful.",
 'fr': "Merci beaucoup, Chris. C'est vraiment un honneur de pouvoir venir sur cette scène une deuxième fois. Je suis très reconnaissant."}

Since we don't want to take 8 hours training, let's trim our dataset a bit (although this might lead to underperformance, feel free to use the complete dataset if you have the computing power).

SUGESTION: start with a small dataset to debug your code and increase it gradually (the same principle applies for the number of epochs, batch size, test set size...).

In [None]:
trim_dataset= dataset['train']['translation'][:100000]

### Preprocessing


Same as our previous assignments preprocessing is an essential part of any NLP task.

In [None]:
import string
def preprocess_data(text):
  """ Method to clean text from noise and standarize text across the different classes.
      The preprocessing includes converting to joining all datapoints, lowercase, removing punctuation, and removing stopwords.
  Arguments
  ---------
  text : List of String
     Text to clean
  Returns
  -------
  text : String
      Cleaned and joined text
  """

  text = " ".join(text).lower() #make everything lower case
  text = text.replace("\n", " ") #remove \n characters
  text=  text.translate(str.maketrans("", "", string.punctuation)) #remove any punctuation or special characters
  text = ''.join(filter(lambda x: not x.isdigit(), text)) #remove all numbers

  return text


For an easier training structure, it is useful to format our training and validation sets. The following function should help with this.

In [None]:
def create_dataset(dataset,source_lang,target_lang):
  """ Method to create a dataset from a list of text.
  Arguments
  ---------
  text : List of String
     Text from dataset
  source_lang : String
     Source language
  target_lang : String
     Target language
  Returns
  -------
  new_dataset : Tuple of String
      Source and target text in format (source, target)
  """
  new_dataset=[]
  for data in dataset:
    # Extract source and target translations
    source_text = data[source_lang]
    target_text = data[target_lang]

    # Preprocess the source and target text (assuming a preprocess_data function is available)
    source_text = preprocess_data([source_text])
    target_text = preprocess_data([target_text])

    # Append the tuple of source and target text to the new dataset
    new_dataset.append((source_text, target_text))

  return new_dataset

training_set=create_dataset(trim_dataset,'en','fr')
validation_set=create_dataset(dataset['validation']['translation'],'en','fr')
test_set=create_dataset(dataset['test']['translation'],'en','fr')

### Model Creation


Now that our data is ready, we can get started. Let's start by creating our Sequence to Sequence Transformer model.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class TransformerModel(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward,dropout):
        super(TransformerModel, self).__init__()
        self.src_embedding = nn.Embedding(src_vocab_size, d_model)  # Embedding layer for source language
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)  # Embedding layer for target language
        self.transformer = nn.Transformer(
            d_model=d_model,
            nhead=nhead,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True
        )  # Transformer with batch_first=True
        self.fc = nn.Linear(d_model, tgt_vocab_size)  # Last linear layer

    def positional_encoding(self, d_model, maxlen = 5000):
        """Method to create a positional encoding buffer.
        Arguments
        ---------
        d_model: int
            Embedding size
        maxlen: int
            Maximum sequence length
        Returns
        -------
        PE: Tensor
            Positional encoding buffer
        """
        pos = torch.arange(0, maxlen).unsqueeze(1)
        denominator = 10000 ** (torch.arange(0, d_model, 2) / d_model)

        PE = torch.zeros((maxlen, d_model))
        PE[:, 0::2] = torch.sin(pos / denominator)  # Calculate sine for even positions
        PE[:, 1::2] = torch.cos(pos / denominator)  # Calculate cosine for odd positions

        PE = PE.unsqueeze(0)  # Add batch dimension

        return PE


    def forward(self, src, tgt, src_mask=None, tgt_mask=None, src_key_padding_mask=None, tgt_key_padding_mask=None):
        """Method to forward a batch of data through the model."""
        #pass source and target throught embedding layer
        src = self.src_embedding(src)
        tgt = self.tgt_embedding(tgt)

        positional_encoding = self.positional_encoding(src.size(2)).to(src.device)#get positional encoding and move it to device

        #get src_emb and tgt_emb by adding positional encoder
        src_emb = src + positional_encoding[:,:src.shape[1], :]
        tgt_emb = tgt + positional_encoding[:,:tgt.shape[1], :]

        #pass src, tgt and all masks throught transformer
        output = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None, src_key_padding_mask, tgt_key_padding_mask,src_key_padding_mask)

        #pass output throught linear layer
        output = self.fc(output)
        return output

    def encode(self, src, src_mask):
        """Method to encode a batch of data through the transformer model."""
        src = self.src_embedding(src) #pass src throught embedding layer
        positional_encoding = self.positional_encoding(src.size(2)).to(src.device) #create positional encoding
        src_emb = src + positional_encoding[:, :src.size(1), :] #get src_emb
        return self.transformer.encoder(src_emb, src_mask) #pass src_emb through transformer encoder (look pytorch documentation)


    def decode(self, tgt, memory,tgt_mask):
        """Method to decode a batch of data through the transformer model."""
        tgt = self.tgt_embedding(tgt) #pass tgt throught embedding layer
        positional_encoding = self.positional_encoding(tgt.size(2)).to(tgt.device) #create positional encoding
        tgt_emb = tgt + positional_encoding[:, :tgt.size(1), :] #get tgt_emb
        return self.transformer.decoder(tgt_emb, memory, tgt_mask) #pass tgt_emb through transformer decoder (look pytorch documentation)


Now that our model is ready, we still need some methods that will come in handy during training.

In [None]:
def create_padding_mask(seq):
  """ Method to create a padding mask based on given sequence.
  Arguments
  ---------
  seq : Tensor
     Sequence to create padding mask for
  Returns
  -------
  mask : Tensor
      Padding mask
  """
  return (seq == 0).float() #float matrix that is 1 when datapoint is equal to 0

def create_triu_mask(sz):
  """ Method to create a triangular mask based on given sequence. This is used for the tgt mask in the Transformer model to avoid looking ahead.
  Arguments
  ---------
  seq : Tensor
     Sequence to create triangular mask for
  Returns
  -------
  mask : Tensor
      Triangular mask
  """
  # Create an upper triangular matrix
  mask = torch.triu(torch.ones(sz, sz), diagonal=1)  # Upper triangular mask
  # Replace 1's with -inf and 0's with 0
  mask = mask.float().masked_fill(mask == 1, float('-inf')).masked_fill(mask == 0, float(0.0))
  return mask

def tokenize_batch(source, targets,tokenizer):
  """ Method to tokenize a batch of data given a tokenizer.
  Arguments
  ---------
  source : List of String
     Source text
  targets : List of String
     Target text
  tokenizer : Tokenizer
     Tokenizer to use for tokenization
  Returns
  -------
  tokenized_source : Tensor
      Tokenized source text
  """

  tokenized_source = tokenizer(source, padding='max_length', max_length=120, return_tensors='pt')

  tokenized_targets = tokenizer(targets,  padding='max_length', max_length=120, return_tensors='pt')

  return tokenized_source['input_ids'], tokenized_targets['input_ids']


### Training


In [None]:
from transformers import AutoTokenizer

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer=AutoTokenizer.from_pretrained('google-bert/bert-base-multilingual-uncased')
PAD_IDX = tokenizer.pad_token_id #for padding
BOS_IDX = tokenizer.bos_token_id #for beggining of sentence
EOS_IDX = tokenizer.eos_token_id #for end of sentence

model = TransformerModel(tokenizer.vocab_size, tokenizer.vocab_size,512, 8, 3, 3, 256,0.1).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)
loss_function = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

train_loader = torch.utils.data.DataLoader(training_set, batch_size=8, shuffle=True)
validation_loader = torch.utils.data.DataLoader(validation_set, batch_size=8, shuffle=False)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.72M [00:00<?, ?B/s]

In [None]:
from torch.utils.data import DataLoader
from tqdm import tqdm

def train_epoch(model,train_loader,tokenizer):
    model.train()
    losses = 0

    for src, tgt in tqdm(train_loader):
        src, tgt = tokenize_batch(src, tgt, tokenizer)
        src = src.to(device)
        tgt = tgt.to(device)

        tgt_input = tgt[:,:-1]

        #TODO
        src_mask = torch.zeros((src.size(1), src.size(1)), device=device) #creat src_mask this is basically a matrix of 0s of shape Sequence x Sequence (see https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html)
        tgt_mask = create_triu_mask(tgt_input.size(1)).to(device) #create triangular mask for target

        src_padding_mask = create_padding_mask(src).to(device) #create padding mask for src
        tgt_padding_mask = create_padding_mask(tgt_input).to(device) #create padding mask for tgt

        logits = model(
            src,
            tgt_input,
            src_mask=src_mask,
            tgt_mask=tgt_mask,
            src_key_padding_mask=src_padding_mask,
            tgt_key_padding_mask=tgt_padding_mask
        ) #pass it through model

        optimizer.zero_grad()

        tgt_out = tgt[:,1:]
        loss = loss_function(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()

        optimizer.step()
        losses += loss.item()

    return losses / len(list(train_loader))


def evaluate(model,val_dataloader ):
    model.eval()
    losses = 0
    with torch.no_grad():
      for src, tgt in tqdm(val_dataloader):
          src, tgt = tokenize_batch(src, tgt, tokenizer)
          src = src.to(device)
          tgt = tgt.to(device)

          tgt_input = tgt[:,:-1]

          #do the same as in Train
          # Create masks and padding masks
          src_mask = torch.zeros((src.size(1), src.size(1)), device=device)  # Source mask
          tgt_mask = create_triu_mask(tgt_input.size(1)).to(device)  # Triangular target mask

          src_padding_mask = create_padding_mask(src).to(device)  # Source padding mask
          tgt_padding_mask = create_padding_mask(tgt_input).to(device)  # Target padding mask

          # Forward pass through the model
          logits = model(
              src,
              tgt_input,
              src_mask=src_mask,
              tgt_mask=tgt_mask,
              src_key_padding_mask=src_padding_mask,
              tgt_key_padding_mask=tgt_padding_mask
          )

          tgt_out = tgt[:,1:]
          loss = loss_function(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
          losses += loss.item()

    return losses / len(list(val_dataloader))

Now we can start training! Keep in mind this code is very demanding computationally, it has been set to 10 epochs (which can take up to 6-8 hours) but feel free to change this value depending on your resources, in this case the more epochs you can execute the better 😀

In [None]:
def train(model, epochs, train_loader,validation_loader ):
  for epoch in range(1, epochs+1):
        train_loss = train_epoch(model,train_loader, tokenizer)
        val_loss = evaluate(model,validation_loader)
        print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}"))
        torch.save(model.state_dict(), "/content/drive/MyDrive/my_model.pth")

train(model, 5, train_loader,validation_loader)

100%|██████████| 12500/12500 [41:40<00:00,  5.00it/s]
100%|██████████| 112/112 [00:06<00:00, 16.85it/s]


Epoch: 1, Train loss: 5.217, Val loss: 4.791


100%|██████████| 12500/12500 [41:44<00:00,  4.99it/s]
100%|██████████| 112/112 [00:06<00:00, 17.72it/s]


Epoch: 2, Train loss: 4.208, Val loss: 4.351


 78%|███████▊  | 9731/12500 [32:22<09:10,  5.03it/s]

The GPU runtime got disconnected at this point, so we proceed by loading the saved model with whatever epochs was completed.

In [None]:
model_path = '/content/drive/MyDrive/my_model.pth'
model.load_state_dict(torch.load(model_path))

  model.load_state_dict(torch.load(model_path))


<All keys matched successfully>

### Testing


In this assignment, we will use three different evaluation metrics to see our model's test performance: [Bert Score](https://huggingface.co/spaces/evaluate-metric/bertscore), [Meteor](https://huggingface.co/spaces/evaluate-metric/meteor) and [Rouge](https://huggingface.co/spaces/evaluate-metric/rouge). Please access their hugging face documentation to know how to implement them.

In [None]:
from evaluate import load
bertscore = load("bertscore")
rouge = load('rouge')
meteor = load('meteor')

Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Implement greedy decode as seen in class in the NLG slides.

In [None]:
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(device)
    src_mask = src_mask.to(device)

    # Pass the source through the encoder
    memory = model.encode(src, src_mask)  # This line was missing

    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(device)

    for i in range(max_len-1):
        memory = memory.to(device)

        # Create the triangular mask for the target sequence
        tgt_mask = create_triu_mask(ys.size(1)).to(device)  # This line was missing

        # Pass the generated sequence through the decoder
        out = model.decode(ys, memory, tgt_mask)  # This line was missing

        # Get the probabilities of the next token
        prob = model.fc(out[:, -1])

        # Get the next word with the highest probability
        _, next_word = torch.max(prob, dim=1)  # Make sure to use `.item()` for scalar value

        # Append the next word to the sequence
        ys = torch.cat([ys, torch.ones(1, 1).type_as(src.data).fill_(next_word.item())], dim=1)  # Use `.item()`

        # If EOS token is generated, stop decoding
        if next_word == EOS_IDX:
            break

    return ys

def translate(model: torch.nn.Module, src_sentence: str, tokenizer):
    model.eval()
    src, _ = tokenize_batch(src_sentence, "", tokenizer)
    src = src.to(device)
    num_tokens = src.shape[1]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.float).to(device)
    tgt_tokens = greedy_decode(
        model,  src, src_mask, max_len= int(num_tokens * 1.2 ), start_symbol=tokenizer.cls_token_id).flatten()
    return tokenizer.decode(tgt_tokens, skip_special_tokens=True)


In [None]:
print(translate(model, "Hello how are you today",tokenizer))

bon comment sont aujourdhui aujourdhuissss


In [None]:
# Define the path where you want to save the model
model_save_path = '/content/drive/MyDrive/my_model.pth'

# Save the model state_dict (recommended)
torch.save(model.state_dict(), model_save_path)

# Or, save the entire model (less preferred, as it's more prone to breaking with changes in the code)
torch.save(model, '/content/drive/MyDrive/my_model_full.pth')

print(f"Model saved to {model_save_path}")

In [None]:
len(test_set)


8597

In [None]:
import numpy as np
def test(test_loader, model, tokenizer, device, max_length=200):
    """
    Method to test our model using precision, recall, F1, and METEOR metrics.
    Arguments:
        test_loader: DataLoader
            DataLoader that holds the test set.
        model: nn.Module
            Trained machine translation model.
        tokenizer: Tokenizer
            Tokenizer for input/output processing.
        device: torch.device
            Device to run the model on ('cpu' or 'cuda').
        max_length: int
            Maximum length for generated translations.
    Returns:
        tuple: Averaged precision, recall, F1, and METEOR scores.
    """
    precision = 0
    recall = 0
    f1 = 0
    meteor_metric = 0

    for src, target in test_loader:
        # Use translate method to evaluate our model
        results_bert = [translate(model, src_sentence, tokenizer) for src_sentence in src]
        results_meteor = results_bert  # Using the same results for METEOR

        # Decode target sentences (if target is tokenized, we can pass directly; else, we tokenize it)
        target_sentences = [tokenizer.decode(tgt, skip_special_tokens=True) if isinstance(tgt, list) else tgt for tgt in target]

        if len(results_bert) != len(target_sentences):
            continue

        # Compute BERTScore metrics
        bert_results = bertscore.compute(
            predictions=results_bert,
            references=target_sentences,
            lang="fr"  # Setting French as the target language
        )

        # Compute METEOR metric
        meteor_metric += np.mean([
            meteor.compute(predictions=[pred], references=[ref])["meteor"]
            for pred, ref in zip(results_meteor, target_sentences)
        ])

        # Calculate precision, recall, f1 (using BERTScore metrics)
        precision += np.mean(bert_results["precision"])
        recall += np.mean(bert_results["recall"])
        f1 += np.mean(bert_results["f1"])

    # Return averaged precision, recall, F1, and METEOR
    return precision / len(test_loader), recall / len(test_loader), f1 / len(test_loader), meteor_metric / len(test_loader)

# Run the test function
test(test_set, model, tokenizer, device)


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]



For the above code, we had run it with the complete test_set, but after running for a long time, runtime got disconnected and we did not get any output.

The below code was used as GPU was not getting reconnected due to collab limitations. Hence, we reduced the test set size and tried running it to get results with CPU

In [None]:
model_path = '/content/drive/MyDrive/my_model.pth'
# Load the model weights onto CPU using map_location
state_dict = torch.load(model_path, map_location=torch.device('cpu'))
model.load_state_dict(state_dict)

  state_dict = torch.load(model_path, map_location=torch.device('cpu'))


<All keys matched successfully>

After this, we once again reran the entire testing portion using CPU with the test function results below.

In [None]:
import numpy as np
def test(test_loader, model, tokenizer, device, max_length=200):
    """
    Method to test our model using precision, recall, F1, and METEOR metrics.
    Arguments:
        test_loader: DataLoader
            DataLoader that holds the test set.
        model: nn.Module
            Trained machine translation model.
        tokenizer: Tokenizer
            Tokenizer for input/output processing.
        device: torch.device
            Device to run the model on ('cpu' or 'cuda').
        max_length: int
            Maximum length for generated translations.
    Returns:
        tuple: Averaged precision, recall, F1, and METEOR scores.
    """
    precision = 0
    recall = 0
    f1 = 0
    meteor_metric = 0

    for src, target in test_loader:
        # Use translate method to evaluate our model
        results_bert = [translate(model, src_sentence, tokenizer) for src_sentence in src]
        results_meteor = results_bert  # Using the same results for METEOR

        # Decode target sentences (if target is tokenized, we can pass directly; else, we tokenize it)
        target_sentences = [tokenizer.decode(tgt, skip_special_tokens=True) if isinstance(tgt, list) else tgt for tgt in target]

        if len(results_bert) != len(target_sentences):
            continue

        # Compute BERTScore metrics
        bert_results = bertscore.compute(
            predictions=results_bert,
            references=target_sentences,
            lang="fr"  # Setting French as the target language
        )

        # Compute METEOR metric
        meteor_metric += np.mean([
            meteor.compute(predictions=[pred], references=[ref])["meteor"]
            for pred, ref in zip(results_meteor, target_sentences)
        ])

        # Calculate precision, recall, f1 (using BERTScore metrics)
        precision += np.mean(bert_results["precision"])
        recall += np.mean(bert_results["recall"])
        f1 += np.mean(bert_results["f1"])

    # Return averaged precision, recall, F1, and METEOR
    return precision / len(test_loader), recall / len(test_loader), f1 / len(test_loader), meteor_metric / len(test_loader)

# Run the test function
test(test_set[:10], model, tokenizer, device)


(0.0, 0.0, 0.0, 0.0)

Next, we tried increasing the test set size to 100. But again the runtime got disconnected before any output is given with just the warning.

In [None]:
import numpy as np
def test(test_loader, model, tokenizer, device, max_length=200):
    """
    Method to test our model using precision, recall, F1, and METEOR metrics.
    Arguments:
        test_loader: DataLoader
            DataLoader that holds the test set.
        model: nn.Module
            Trained machine translation model.
        tokenizer: Tokenizer
            Tokenizer for input/output processing.
        device: torch.device
            Device to run the model on ('cpu' or 'cuda').
        max_length: int
            Maximum length for generated translations.
    Returns:
        tuple: Averaged precision, recall, F1, and METEOR scores.
    """
    precision = 0
    recall = 0
    f1 = 0
    meteor_metric = 0

    for src, target in test_loader:
        # Use translate method to evaluate our model
        results_bert = [translate(model, src_sentence, tokenizer) for src_sentence in src]
        results_meteor = results_bert  # Using the same results for METEOR

        # Decode target sentences (if target is tokenized, we can pass directly; else, we tokenize it)
        target_sentences = [tokenizer.decode(tgt, skip_special_tokens=True) if isinstance(tgt, list) else tgt for tgt in target]

        if len(results_bert) != len(target_sentences):
            continue

        # Compute BERTScore metrics
        bert_results = bertscore.compute(
            predictions=results_bert,
            references=target_sentences,
            lang="fr"  # Setting French as the target language
        )

        # Compute METEOR metric
        meteor_metric += np.mean([
            meteor.compute(predictions=[pred], references=[ref])["meteor"]
            for pred, ref in zip(results_meteor, target_sentences)
        ])

        # Calculate precision, recall, f1 (using BERTScore metrics)
        precision += np.mean(bert_results["precision"])
        recall += np.mean(bert_results["recall"])
        f1 += np.mean(bert_results["f1"])

    # Return averaged precision, recall, F1, and METEOR
    return precision / len(test_loader), recall / len(test_loader), f1 / len(test_loader), meteor_metric / len(test_loader)

# Run the test function
test(test_set[:100], model, tokenizer, device)


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]



## Let's experiment!

1. Play with a hyperparameter of your choice to measure its effect on the translation.

2. Compare the results of your model with the performance of using the T5 pretrained model. This [tutorial](https://huggingface.co/docs/transformers/en/tasks/translation) on using T5 for machine translation might come in handy.

As Alejadra (POD) later suggested to change the tokenizer, the experiments were performed with the suggested changes

### Learning rate = 0.0001, tokenizer = FacebookAI/xlm-roberta-base, training set = 10000, epochs = 2, test set = 10

In [None]:
!pip install datasets
!pip install evaluate
!pip install transformers
!pip install bert_score
!pip install rouge_score

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [None]:
from datasets import load_dataset

dataset = load_dataset("IWSLT/iwslt2017",'iwslt2017-en-fr')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/18.5k [00:00<?, ?B/s]

iwslt2017.py:   0%|          | 0.00/8.17k [00:00<?, ?B/s]

The repository for IWSLT/iwslt2017 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/IWSLT/iwslt2017.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


en-fr.zip:   0%|          | 0.00/27.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/232825 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8597 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/890 [00:00<?, ? examples/s]

In [None]:
dataset['train']['translation'][0]

{'en': "Thank you so much, Chris. And it's truly a great honor to have the opportunity to come to this stage twice; I'm extremely grateful.",
 'fr': "Merci beaucoup, Chris. C'est vraiment un honneur de pouvoir venir sur cette scène une deuxième fois. Je suis très reconnaissant."}

In [None]:
trim_dataset= dataset['train']['translation'][:10000]

In [None]:
import string
def preprocess_data(text):
  """ Method to clean text from noise and standarize text across the different classes.
      The preprocessing includes converting to joining all datapoints, lowercase, removing punctuation, and removing stopwords.
  Arguments
  ---------
  text : List of String
     Text to clean
  Returns
  -------
  text : String
      Cleaned and joined text
  """

  text = " ".join(text).lower() #make everything lower case
  text = text.replace("\n", " ") #remove \n characters
  text=  text.translate(str.maketrans("", "", string.punctuation)) #remove any punctuation or special characters
  text = ''.join(filter(lambda x: not x.isdigit(), text)) #remove all numbers

  return text


In [None]:
def create_dataset(dataset,source_lang,target_lang):
  """ Method to create a dataset from a list of text.
  Arguments
  ---------
  text : List of String
     Text from dataset
  source_lang : String
     Source language
  target_lang : String
     Target language
  Returns
  -------
  new_dataset : Tuple of String
      Source and target text in format (source, target)
  """
  new_dataset=[]
  for data in dataset:
    # Extract source and target translations
    source_text = data[source_lang]
    target_text = data[target_lang]

    # Preprocess the source and target text (assuming a preprocess_data function is available)
    source_text = preprocess_data([source_text])
    target_text = preprocess_data([target_text])

    # Append the tuple of source and target text to the new dataset
    new_dataset.append((source_text, target_text))

  return new_dataset

training_set=create_dataset(trim_dataset,'en','fr')
validation_set=create_dataset(dataset['validation']['translation'],'en','fr')
test_set=create_dataset(dataset['test']['translation'],'en','fr')

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class TransformerModel(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward,dropout):
        super(TransformerModel, self).__init__()
        self.src_embedding = nn.Embedding(src_vocab_size, d_model)  # Embedding layer for source language
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)  # Embedding layer for target language
        self.transformer = nn.Transformer(
            d_model=d_model,
            nhead=nhead,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True
        )  # Transformer with batch_first=True
        self.fc = nn.Linear(d_model, tgt_vocab_size)  # Last linear layer

    def positional_encoding(self, d_model, maxlen = 5000):
        """Method to create a positional encoding buffer.
        Arguments
        ---------
        d_model: int
            Embedding size
        maxlen: int
            Maximum sequence length
        Returns
        -------
        PE: Tensor
            Positional encoding buffer
        """
        pos = torch.arange(0, maxlen).unsqueeze(1)
        denominator = 10000 ** (torch.arange(0, d_model, 2) / d_model)

        PE = torch.zeros((maxlen, d_model))
        PE[:, 0::2] = torch.sin(pos / denominator)  # Calculate sine for even positions
        PE[:, 1::2] = torch.cos(pos / denominator)  # Calculate cosine for odd positions

        PE = PE.unsqueeze(0)  # Add batch dimension

        return PE


    def forward(self, src, tgt, src_mask=None, tgt_mask=None, src_key_padding_mask=None, tgt_key_padding_mask=None):
        """Method to forward a batch of data through the model."""
        #pass source and target throught embedding layer
        src = self.src_embedding(src)
        tgt = self.tgt_embedding(tgt)

        positional_encoding = self.positional_encoding(src.size(2)).to(src.device)#get positional encoding and move it to device

        #get src_emb and tgt_emb by adding positional encoder
        src_emb = src + positional_encoding[:,:src.shape[1], :]
        tgt_emb = tgt + positional_encoding[:,:tgt.shape[1], :]

        #pass src, tgt and all masks throught transformer
        output = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None, src_key_padding_mask, tgt_key_padding_mask,src_key_padding_mask)

        #pass output throught linear layer
        output = self.fc(output)
        return output

    def encode(self, src, src_mask):
        """Method to encode a batch of data through the transformer model."""
        src = self.src_embedding(src) #pass src throught embedding layer
        positional_encoding = self.positional_encoding(src.size(2)).to(src.device) #create positional encoding
        src_emb = src + positional_encoding[:, :src.size(1), :] #get src_emb
        return self.transformer.encoder(src_emb, src_mask) #pass src_emb through transformer encoder (look pytorch documentation)


    def decode(self, tgt, memory,tgt_mask):
        """Method to decode a batch of data through the transformer model."""
        tgt = self.tgt_embedding(tgt) #pass tgt throught embedding layer
        positional_encoding = self.positional_encoding(tgt.size(2)).to(tgt.device) #create positional encoding
        tgt_emb = tgt + positional_encoding[:, :tgt.size(1), :] #get tgt_emb
        return self.transformer.decoder(tgt_emb, memory, tgt_mask) #pass tgt_emb through transformer decoder (look pytorch documentation)


In [None]:
def create_padding_mask(seq):
  """ Method to create a padding mask based on given sequence.
  Arguments
  ---------
  seq : Tensor
     Sequence to create padding mask for
  Returns
  -------
  mask : Tensor
      Padding mask
  """
  return (seq == 0).float() #float matrix that is 1 when datapoint is equal to 0

def create_triu_mask(sz):
  """ Method to create a triangular mask based on given sequence. This is used for the tgt mask in the Transformer model to avoid looking ahead.
  Arguments
  ---------
  seq : Tensor
     Sequence to create triangular mask for
  Returns
  -------
  mask : Tensor
      Triangular mask
  """
  # Create an upper triangular matrix
  mask = torch.triu(torch.ones(sz, sz), diagonal=1)  # Upper triangular mask
  # Replace 1's with -inf and 0's with 0
  mask = mask.float().masked_fill(mask == 1, float('-inf')).masked_fill(mask == 0, float(0.0))
  return mask

def tokenize_batch(source, targets,tokenizer):
  """ Method to tokenize a batch of data given a tokenizer.
  Arguments
  ---------
  source : List of String
     Source text
  targets : List of String
     Target text
  tokenizer : Tokenizer
     Tokenizer to use for tokenization
  Returns
  -------
  tokenized_source : Tensor
      Tokenized source text
  """

  tokenized_source = tokenizer(source, padding='max_length', max_length=120, truncation=True, return_tensors='pt')

  tokenized_targets = tokenizer(targets,  padding='max_length', max_length=120, truncation=True, return_tensors='pt')

  return tokenized_source['input_ids'], tokenized_targets['input_ids']


In [None]:
from transformers import AutoTokenizer

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer=AutoTokenizer.from_pretrained('FacebookAI/xlm-roberta-base')
PAD_IDX = tokenizer.pad_token_id #for padding
BOS_IDX = tokenizer.bos_token_id #for beggining of sentence
EOS_IDX = tokenizer.eos_token_id #for end of sentence

model = TransformerModel(tokenizer.vocab_size, tokenizer.vocab_size,512, 8, 3, 3, 256,0.1).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)
loss_function = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

train_loader = torch.utils.data.DataLoader(training_set, batch_size=8, shuffle=True)
validation_loader = torch.utils.data.DataLoader(validation_set, batch_size=8, shuffle=False)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

In [None]:
from torch.utils.data import DataLoader
from tqdm import tqdm

def train_epoch(model,train_loader,tokenizer):
    model.train()
    losses = 0

    for src, tgt in tqdm(train_loader):
        src, tgt = tokenize_batch(src, tgt, tokenizer)
        src = src.to(device)
        tgt = tgt.to(device)

        tgt_input = tgt[:,:-1]

        #TODO
        src_mask = torch.zeros((src.size(1), src.size(1)), device=device) #creat src_mask this is basically a matrix of 0s of shape Sequence x Sequence (see https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html)
        tgt_mask = create_triu_mask(tgt_input.size(1)).to(device) #create triangular mask for target

        src_padding_mask = create_padding_mask(src).to(device) #create padding mask for src
        tgt_padding_mask = create_padding_mask(tgt_input).to(device) #create padding mask for tgt

        logits = model(
            src,
            tgt_input,
            src_mask=src_mask,
            tgt_mask=tgt_mask,
            src_key_padding_mask=src_padding_mask,
            tgt_key_padding_mask=tgt_padding_mask
        ) #pass it through model

        optimizer.zero_grad()

        tgt_out = tgt[:,1:]
        loss = loss_function(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()

        optimizer.step()
        losses += loss.item()

    return losses / len(list(train_loader))


def evaluate(model,val_dataloader ):
    model.eval()
    losses = 0
    with torch.no_grad():
      for src, tgt in tqdm(val_dataloader):
          src, tgt = tokenize_batch(src, tgt, tokenizer)
          src = src.to(device)
          tgt = tgt.to(device)

          tgt_input = tgt[:,:-1]

          #do the same as in Train
          # Create masks and padding masks
          src_mask = torch.zeros((src.size(1), src.size(1)), device=device)  # Source mask
          tgt_mask = create_triu_mask(tgt_input.size(1)).to(device)  # Triangular target mask

          src_padding_mask = create_padding_mask(src).to(device)  # Source padding mask
          tgt_padding_mask = create_padding_mask(tgt_input).to(device)  # Target padding mask

          # Forward pass through the model
          logits = model(
              src,
              tgt_input,
              src_mask=src_mask,
              tgt_mask=tgt_mask,
              src_key_padding_mask=src_padding_mask,
              tgt_key_padding_mask=tgt_padding_mask
          )

          tgt_out = tgt[:,1:]
          loss = loss_function(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
          losses += loss.item()

    return losses / len(list(val_dataloader))

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
def train(model, epochs, train_loader,validation_loader ):
  for epoch in range(1, epochs+1):
        train_loss = train_epoch(model,train_loader, tokenizer)
        val_loss = evaluate(model,validation_loader)
        print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}"))
        torch.save(model.state_dict(), "/content/drive/MyDrive/my_model.pth")

train(model, 2, train_loader,validation_loader)

100%|██████████| 1250/1250 [08:10<00:00,  2.55it/s]
100%|██████████| 112/112 [00:12<00:00,  9.33it/s]


Epoch: 1, Train loss: 6.681, Val loss: 6.119


100%|██████████| 1250/1250 [08:21<00:00,  2.49it/s]
100%|██████████| 112/112 [00:11<00:00,  9.42it/s]


Epoch: 2, Train loss: 5.568, Val loss: 5.760


In [None]:
model_path = '/content/drive/MyDrive/my_model.pth'
model.load_state_dict(torch.load(model_path))

In [None]:
from evaluate import load
bertscore = load("bertscore")
rouge = load('rouge')
meteor = load('meteor')

Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [None]:
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(device)
    src_mask = src_mask.to(device)

    # Pass the source through the encoder
    memory = model.encode(src, src_mask)

    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(device)

    for i in range(max_len-1):
        memory = memory.to(device)

        # Create the triangular mask for the target sequence
        tgt_mask = create_triu_mask(ys.size(1)).to(device)  # This line was missing

        # Pass the generated sequence through the decoder
        out = model.decode(ys, memory, tgt_mask)  # This line was missing

        # Get the probabilities of the next token
        prob = model.fc(out[:, -1])

        # Get the next word with the highest probability
        _, next_word = torch.max(prob, dim=1)  # Make sure to use `.item()` for scalar value

        # Append the next word to the sequence
        ys = torch.cat([ys, torch.ones(1, 1).type_as(src.data).fill_(next_word.item())], dim=1)  # Use `.item()`

        # If EOS token is generated, stop decoding
        if next_word == EOS_IDX:
            break

    return ys

def translate(model: torch.nn.Module, src_sentence: str, tokenizer):
    model.eval()
    src, _ = tokenize_batch(src_sentence, "", tokenizer)
    src = src.to(device)
    num_tokens = src.shape[1]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.float).to(device)
    tgt_tokens = greedy_decode(
        model,  src, src_mask, max_len= int(num_tokens * 1.2 ), start_symbol=tokenizer.cls_token_id).flatten()
    return tokenizer.decode(tgt_tokens, skip_special_tokens=True)


In [None]:
print(translate(model, "Hello how are you today",tokenizer))

vous savez vous savez vous savez


In [None]:
import numpy as np
def test(test_loader, model, tokenizer, device, max_length=200):
    """
    Method to test our model using precision, recall, F1, and METEOR metrics.
    Arguments:
        test_loader: DataLoader
            DataLoader that holds the test set.
        model: nn.Module
            Trained machine translation model.
        tokenizer: Tokenizer
            Tokenizer for input/output processing.
        device: torch.device
            Device to run the model on ('cpu' or 'cuda').
        max_length: int
            Maximum length for generated translations.
    Returns:
        tuple: Averaged precision, recall, F1, and METEOR scores.
    """
    precision = 0
    recall = 0
    f1 = 0
    meteor_metric = 0

    for src, target in test_loader:
        # Use translate method to evaluate our model
        results_bert = [translate(model, src_sentence, tokenizer) for src_sentence in src]
        results_meteor = results_bert  # Using the same results for METEOR

        # Decode target sentences (if target is tokenized, we can pass directly; else, we tokenize it)
        target_sentences = [tokenizer.decode(tgt, skip_special_tokens=True) if isinstance(tgt, list) else tgt for tgt in target]

        if len(results_bert) != len(target_sentences):
            continue

        # Compute BERTScore metrics
        bert_results = bertscore.compute(
            predictions=results_bert,
            references=target_sentences,
            lang="fr"  # Setting French as the target language
        )

        # Compute METEOR metric
        meteor_metric += np.mean([
            meteor.compute(predictions=[pred], references=[ref])["meteor"]
            for pred, ref in zip(results_meteor, target_sentences)
        ])

        # Calculate precision, recall, f1 (using BERTScore metrics)
        precision += np.mean(bert_results["precision"])
        recall += np.mean(bert_results["recall"])
        f1 += np.mean(bert_results["f1"])

    # Return averaged precision, recall, F1, and METEOR
    return precision / len(test_loader), recall / len(test_loader), f1 / len(test_loader), meteor_metric / len(test_loader)

# Run the test function
#test(test_set, model, tokenizer, device)

# Define a size for the subset you want (e.g., 1000 samples)
subset_size = 10

# Trim the test set by slicing it
trimmed_test_set = test_set[:subset_size]

# Example call to the test function
test(trimmed_test_set, model, tokenizer, device)


(0.0, 0.0, 0.0, 0.0)

### Learning rate = 0.001, tokenizer = FacebookAI/xlm-roberta-base, training set size = 10000, epochs = 1, test set size = 10

In [None]:
!pip install datasets
!pip install evaluate
!pip install transformers
!pip install bert_score
!pip install rouge_score

from datasets import load_dataset

dataset = load_dataset("IWSLT/iwslt2017",'iwslt2017-en-fr')

dataset['train']['translation'][0]

trim_dataset= dataset['train']['translation'][:10000]

import string
def preprocess_data(text):
  """ Method to clean text from noise and standarize text across the different classes.
      The preprocessing includes converting to joining all datapoints, lowercase, removing punctuation, and removing stopwords.
  Arguments
  ---------
  text : List of String
     Text to clean
  Returns
  -------
  text : String
      Cleaned and joined text
  """

  text = " ".join(text).lower() #make everything lower case
  text = text.replace("\n", " ") #remove \n characters
  text=  text.translate(str.maketrans("", "", string.punctuation)) #remove any punctuation or special characters
  text = ''.join(filter(lambda x: not x.isdigit(), text)) #remove all numbers

  return text

def create_dataset(dataset,source_lang,target_lang):
  """ Method to create a dataset from a list of text.
  Arguments
  ---------
  text : List of String
     Text from dataset
  source_lang : String
     Source language
  target_lang : String
     Target language
  Returns
  -------
  new_dataset : Tuple of String
      Source and target text in format (source, target)
  """
  new_dataset=[]
  for data in dataset:
    # Extract source and target translations
    source_text = data[source_lang]
    target_text = data[target_lang]

    # Preprocess the source and target text (assuming a preprocess_data function is available)
    source_text = preprocess_data([source_text])
    target_text = preprocess_data([target_text])

    # Append the tuple of source and target text to the new dataset
    new_dataset.append((source_text, target_text))

  return new_dataset

import torch
import torch.nn as nn
import torch.nn.functional as F


class TransformerModel(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward,dropout):
        super(TransformerModel, self).__init__()
        self.src_embedding = nn.Embedding(src_vocab_size, d_model)  # Embedding layer for source language
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)  # Embedding layer for target language
        self.transformer = nn.Transformer(
            d_model=d_model,
            nhead=nhead,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True
        )  # Transformer with batch_first=True
        self.fc = nn.Linear(d_model, tgt_vocab_size)  # Last linear layer

    def positional_encoding(self, d_model, maxlen = 5000):
        """Method to create a positional encoding buffer.
        Arguments
        ---------
        d_model: int
            Embedding size
        maxlen: int
            Maximum sequence length
        Returns
        -------
        PE: Tensor
            Positional encoding buffer
        """
        pos = torch.arange(0, maxlen).unsqueeze(1)
        denominator = 10000 ** (torch.arange(0, d_model, 2) / d_model)

        PE = torch.zeros((maxlen, d_model))
        PE[:, 0::2] = torch.sin(pos / denominator)  # Calculate sine for even positions
        PE[:, 1::2] = torch.cos(pos / denominator)  # Calculate cosine for odd positions

        PE = PE.unsqueeze(0)  # Add batch dimension

        return PE


    def forward(self, src, tgt, src_mask=None, tgt_mask=None, src_key_padding_mask=None, tgt_key_padding_mask=None):
        """Method to forward a batch of data through the model."""
        #pass source and target throught embedding layer
        src = self.src_embedding(src)
        tgt = self.tgt_embedding(tgt)

        positional_encoding = self.positional_encoding(src.size(2)).to(src.device)#get positional encoding and move it to device

        #get src_emb and tgt_emb by adding positional encoder
        src_emb = src + positional_encoding[:,:src.shape[1], :]
        tgt_emb = tgt + positional_encoding[:,:tgt.shape[1], :]

        #pass src, tgt and all masks throught transformer
        output = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None, src_key_padding_mask, tgt_key_padding_mask,src_key_padding_mask)

        #pass output throught linear layer
        output = self.fc(output)
        return output

    def encode(self, src, src_mask):
        """Method to encode a batch of data through the transformer model."""
        src = self.src_embedding(src) #pass src throught embedding layer
        positional_encoding = self.positional_encoding(src.size(2)).to(src.device) #create positional encoding
        src_emb = src + positional_encoding[:, :src.size(1), :] #get src_emb
        return self.transformer.encoder(src_emb, src_mask) #pass src_emb through transformer encoder (look pytorch documentation)


    def decode(self, tgt, memory,tgt_mask):
        """Method to decode a batch of data through the transformer model."""
        tgt = self.tgt_embedding(tgt) #pass tgt throught embedding layer
        positional_encoding = self.positional_encoding(tgt.size(2)).to(tgt.device) #create positional encoding
        tgt_emb = tgt + positional_encoding[:, :tgt.size(1), :] #get tgt_emb
        return self.transformer.decoder(tgt_emb, memory, tgt_mask) #pass tgt_emb through transformer decoder (look pytorch documentation)

def create_padding_mask(seq):
  """ Method to create a padding mask based on given sequence.
  Arguments
  ---------
  seq : Tensor
     Sequence to create padding mask for
  Returns
  -------
  mask : Tensor
      Padding mask
  """
  return (seq == 0).float() #float matrix that is 1 when datapoint is equal to 0

def create_triu_mask(sz):
  """ Method to create a triangular mask based on given sequence. This is used for the tgt mask in the Transformer model to avoid looking ahead.
  Arguments
  ---------
  seq : Tensor
     Sequence to create triangular mask for
  Returns
  -------
  mask : Tensor
      Triangular mask
  """
  # Create an upper triangular matrix
  mask = torch.triu(torch.ones(sz, sz), diagonal=1)  # Upper triangular mask
  # Replace 1's with -inf and 0's with 0
  mask = mask.float().masked_fill(mask == 1, float('-inf')).masked_fill(mask == 0, float(0.0))
  return mask

def tokenize_batch(source, targets,tokenizer):
  """ Method to tokenize a batch of data given a tokenizer.
  Arguments
  ---------
  source : List of String
     Source text
  targets : List of String
     Target text
  tokenizer : Tokenizer
     Tokenizer to use for tokenization
  Returns
  -------
  tokenized_source : Tensor
      Tokenized source text
  """

  tokenized_source = tokenizer(source, padding='max_length', max_length=120, truncation=True, return_tensors='pt')

  tokenized_targets = tokenizer(targets,  padding='max_length', max_length=120, truncation=True, return_tensors='pt')

  return tokenized_source['input_ids'], tokenized_targets['input_ids']


training_set=create_dataset(trim_dataset,'en','fr')
validation_set=create_dataset(dataset['validation']['translation'],'en','fr')
test_set=create_dataset(dataset['test']['translation'],'en','fr')


from transformers import AutoTokenizer

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer=AutoTokenizer.from_pretrained('FacebookAI/xlm-roberta-base')
PAD_IDX = tokenizer.pad_token_id #for padding
BOS_IDX = tokenizer.bos_token_id #for beggining of sentence
EOS_IDX = tokenizer.eos_token_id #for end of sentence

model = TransformerModel(tokenizer.vocab_size, tokenizer.vocab_size,512, 8, 3, 3, 256,0.1).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.98), eps=1e-9)
loss_function = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

train_loader = torch.utils.data.DataLoader(training_set, batch_size=8, shuffle=True)
validation_loader = torch.utils.data.DataLoader(validation_set, batch_size=8, shuffle=False)

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


README.md:   0%|          | 0.00/18.5k [00:00<?, ?B/s]

iwslt2017.py:   0%|          | 0.00/8.17k [00:00<?, ?B/s]

The repository for IWSLT/iwslt2017 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/IWSLT/iwslt2017.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


en-fr.zip:   0%|          | 0.00/27.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/232825 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8597 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/890 [00:00<?, ? examples/s]

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

In [None]:
from torch.utils.data import DataLoader
from tqdm import tqdm

def train_epoch(model,train_loader,tokenizer):
    model.train()
    losses = 0

    for src, tgt in tqdm(train_loader):
        src, tgt = tokenize_batch(src, tgt, tokenizer)
        src = src.to(device)
        tgt = tgt.to(device)

        tgt_input = tgt[:,:-1]

        #TODO
        src_mask = torch.zeros((src.size(1), src.size(1)), device=device) #creat src_mask this is basically a matrix of 0s of shape Sequence x Sequence (see https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html)
        tgt_mask = create_triu_mask(tgt_input.size(1)).to(device) #create triangular mask for target

        src_padding_mask = create_padding_mask(src).to(device) #create padding mask for src
        tgt_padding_mask = create_padding_mask(tgt_input).to(device) #create padding mask for tgt

        logits = model(
            src,
            tgt_input,
            src_mask=src_mask,
            tgt_mask=tgt_mask,
            src_key_padding_mask=src_padding_mask,
            tgt_key_padding_mask=tgt_padding_mask
        ) #pass it through model

        optimizer.zero_grad()

        tgt_out = tgt[:,1:]
        loss = loss_function(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()

        optimizer.step()
        losses += loss.item()

    return losses / len(list(train_loader))


def evaluate(model,val_dataloader ):
    model.eval()
    losses = 0
    with torch.no_grad():
      for src, tgt in tqdm(val_dataloader):
          src, tgt = tokenize_batch(src, tgt, tokenizer)
          src = src.to(device)
          tgt = tgt.to(device)

          tgt_input = tgt[:,:-1]

          #do the same as in Train
          # Create masks and padding masks
          src_mask = torch.zeros((src.size(1), src.size(1)), device=device)  # Source mask
          tgt_mask = create_triu_mask(tgt_input.size(1)).to(device)  # Triangular target mask

          src_padding_mask = create_padding_mask(src).to(device)  # Source padding mask
          tgt_padding_mask = create_padding_mask(tgt_input).to(device)  # Target padding mask

          # Forward pass through the model
          logits = model(
              src,
              tgt_input,
              src_mask=src_mask,
              tgt_mask=tgt_mask,
              src_key_padding_mask=src_padding_mask,
              tgt_key_padding_mask=tgt_padding_mask
          )

          tgt_out = tgt[:,1:]
          loss = loss_function(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
          losses += loss.item()

    return losses / len(list(val_dataloader))

In [None]:
def train(model, epochs, train_loader,validation_loader ):
  for epoch in range(1, epochs+1):
        train_loss = train_epoch(model,train_loader, tokenizer)
        val_loss = evaluate(model,validation_loader)
        print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}"))
        torch.save(model.state_dict(), "/content/drive/MyDrive/fb_lr001_50k.pth")

train(model, 10, train_loader,validation_loader)

100%|██████████| 6250/6250 [40:28<00:00,  2.57it/s]
100%|██████████| 112/112 [00:11<00:00,  9.60it/s]


Epoch: 1, Train loss: 5.675, Val loss: 5.220


RuntimeError: Parent directory /content/drive/MyDrive does not exist.

Google drive was not mounted, hence the above error occurred. However, the training had happened properly. Let's mount the drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Saving the model

In [None]:
torch.save(model.state_dict(), "/content/drive/MyDrive/fb_lr001_50k.pth")

In [None]:
model_path = '/content/drive/MyDrive/fb_lr001_50k.pth'
model.load_state_dict(torch.load(model_path))

  model.load_state_dict(torch.load(model_path))


<All keys matched successfully>

Testing




In [None]:
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(device)
    src_mask = src_mask.to(device)

    # Pass the source through the encoder
    memory = model.encode(src, src_mask)

    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(device)

    for i in range(max_len-1):
        memory = memory.to(device)

        # Create the triangular mask for the target sequence
        tgt_mask = create_triu_mask(ys.size(1)).to(device)  # This line was missing

        # Pass the generated sequence through the decoder
        out = model.decode(ys, memory, tgt_mask)  # This line was missing

        # Get the probabilities of the next token
        prob = model.fc(out[:, -1])

        # Get the next word with the highest probability
        _, next_word = torch.max(prob, dim=1)  # Make sure to use `.item()` for scalar value

        # Append the next word to the sequence
        ys = torch.cat([ys, torch.ones(1, 1).type_as(src.data).fill_(next_word.item())], dim=1)  # Use `.item()`

        # If EOS token is generated, stop decoding
        if next_word == EOS_IDX:
            break

    return ys

def translate(model: torch.nn.Module, src_sentence: str, tokenizer):
    model.eval()
    src, _ = tokenize_batch(src_sentence, "", tokenizer)
    src = src.to(device)
    num_tokens = src.shape[1]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.float).to(device)
    tgt_tokens = greedy_decode(
        model,  src, src_mask, max_len= int(num_tokens * 1.2 ), start_symbol=tokenizer.cls_token_id).flatten()
    return tokenizer.decode(tgt_tokens, skip_special_tokens=True)


In [None]:
print(translate(model, "Hello how are you today",tokenizer))

et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et et


In [None]:

def test(test_loader, model, tokenizer, device, max_length=200):
    """
    Method to test our model using precision, recall, F1, and METEOR metrics.
    Arguments:
        test_loader: DataLoader
            DataLoader that holds the test set.
        model: nn.Module
            Trained machine translation model.
        tokenizer: Tokenizer
            Tokenizer for input/output processing.
        device: torch.device
            Device to run the model on ('cpu' or 'cuda').
        max_length: int
            Maximum length for generated translations.
    Returns:
        tuple: Averaged precision, recall, F1, and METEOR scores.
    """
    precision = 0
    recall = 0
    f1 = 0
    meteor_metric = 0

    for src, target in test_loader:
        # Use translate method to evaluate our model
        results_bert = [translate(model, src_sentence, tokenizer) for src_sentence in src]
        results_meteor = results_bert  # Using the same results for METEOR

        # Decode target sentences (if target is tokenized, we can pass directly; else, we tokenize it)
        target_sentences = [tokenizer.decode(tgt, skip_special_tokens=True) if isinstance(tgt, list) else tgt for tgt in target]

        if len(results_bert) != len(target_sentences):
            continue

        # Compute BERTScore metrics
        bert_results = bertscore.compute(
            predictions=results_bert,
            references=target_sentences,
            lang="fr"  # Setting French as the target language
        )

        # Compute METEOR metric
        meteor_metric += np.mean([
            meteor.compute(predictions=[pred], references=[ref])["meteor"]
            for pred, ref in zip(results_meteor, target_sentences)
        ])

        # Calculate precision, recall, f1 (using BERTScore metrics)
        precision += np.mean(bert_results["precision"])
        recall += np.mean(bert_results["recall"])
        f1 += np.mean(bert_results["f1"])

    # Return averaged precision, recall, F1, and METEOR
    return precision / len(test_loader), recall / len(test_loader), f1 / len(test_loader), meteor_metric / len(test_loader)

# Run the test function
test(test_set[:10], model, tokenizer, device)


(0.0, 0.0, 0.0, 0.0)

As the translated version does not make sense, it appears that learning rate of 0.0001 is the best suited for this experiment. Let's tune it further by increasing the training size, and test it on a bigger subset too.

### Learning rate = 0.0001, tokenizer=FacebookAI/xlm-roberta-base, training set size = 50000, epochs = ~2, test set size = 100

In [None]:
!pip install datasets
!pip install evaluate
!pip install transformers
!pip install bert_score
!pip install rouge_score

from datasets import load_dataset

dataset = load_dataset("IWSLT/iwslt2017",'iwslt2017-en-fr')

dataset['train']['translation'][0]

trim_dataset= dataset['train']['translation'][:50000]

import string
def preprocess_data(text):
  """ Method to clean text from noise and standarize text across the different classes.
      The preprocessing includes converting to joining all datapoints, lowercase, removing punctuation, and removing stopwords.
  Arguments
  ---------
  text : List of String
     Text to clean
  Returns
  -------
  text : String
      Cleaned and joined text
  """

  text = " ".join(text).lower() #make everything lower case
  text = text.replace("\n", " ") #remove \n characters
  text=  text.translate(str.maketrans("", "", string.punctuation)) #remove any punctuation or special characters
  text = ''.join(filter(lambda x: not x.isdigit(), text)) #remove all numbers

  return text

def create_dataset(dataset,source_lang,target_lang):
  """ Method to create a dataset from a list of text.
  Arguments
  ---------
  text : List of String
     Text from dataset
  source_lang : String
     Source language
  target_lang : String
     Target language
  Returns
  -------
  new_dataset : Tuple of String
      Source and target text in format (source, target)
  """
  new_dataset=[]
  for data in dataset:
    # Extract source and target translations
    source_text = data[source_lang]
    target_text = data[target_lang]

    # Preprocess the source and target text (assuming a preprocess_data function is available)
    source_text = preprocess_data([source_text])
    target_text = preprocess_data([target_text])

    # Append the tuple of source and target text to the new dataset
    new_dataset.append((source_text, target_text))

  return new_dataset

import torch
import torch.nn as nn
import torch.nn.functional as F


class TransformerModel(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward,dropout):
        super(TransformerModel, self).__init__()
        self.src_embedding = nn.Embedding(src_vocab_size, d_model)  # Embedding layer for source language
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)  # Embedding layer for target language
        self.transformer = nn.Transformer(
            d_model=d_model,
            nhead=nhead,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True
        )  # Transformer with batch_first=True
        self.fc = nn.Linear(d_model, tgt_vocab_size)  # Last linear layer

    def positional_encoding(self, d_model, maxlen = 5000):
        """Method to create a positional encoding buffer.
        Arguments
        ---------
        d_model: int
            Embedding size
        maxlen: int
            Maximum sequence length
        Returns
        -------
        PE: Tensor
            Positional encoding buffer
        """
        pos = torch.arange(0, maxlen).unsqueeze(1)
        denominator = 10000 ** (torch.arange(0, d_model, 2) / d_model)

        PE = torch.zeros((maxlen, d_model))
        PE[:, 0::2] = torch.sin(pos / denominator)  # Calculate sine for even positions
        PE[:, 1::2] = torch.cos(pos / denominator)  # Calculate cosine for odd positions

        PE = PE.unsqueeze(0)  # Add batch dimension

        return PE


    def forward(self, src, tgt, src_mask=None, tgt_mask=None, src_key_padding_mask=None, tgt_key_padding_mask=None):
        """Method to forward a batch of data through the model."""
        #pass source and target throught embedding layer
        src = self.src_embedding(src)
        tgt = self.tgt_embedding(tgt)

        positional_encoding = self.positional_encoding(src.size(2)).to(src.device)#get positional encoding and move it to device

        #get src_emb and tgt_emb by adding positional encoder
        src_emb = src + positional_encoding[:,:src.shape[1], :]
        tgt_emb = tgt + positional_encoding[:,:tgt.shape[1], :]

        #pass src, tgt and all masks throught transformer
        output = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None, src_key_padding_mask, tgt_key_padding_mask,src_key_padding_mask)

        #pass output throught linear layer
        output = self.fc(output)
        return output

    def encode(self, src, src_mask):
        """Method to encode a batch of data through the transformer model."""
        src = self.src_embedding(src) #pass src throught embedding layer
        positional_encoding = self.positional_encoding(src.size(2)).to(src.device) #create positional encoding
        src_emb = src + positional_encoding[:, :src.size(1), :] #get src_emb
        return self.transformer.encoder(src_emb, src_mask) #pass src_emb through transformer encoder (look pytorch documentation)


    def decode(self, tgt, memory,tgt_mask):
        """Method to decode a batch of data through the transformer model."""
        tgt = self.tgt_embedding(tgt) #pass tgt throught embedding layer
        positional_encoding = self.positional_encoding(tgt.size(2)).to(tgt.device) #create positional encoding
        tgt_emb = tgt + positional_encoding[:, :tgt.size(1), :] #get tgt_emb
        return self.transformer.decoder(tgt_emb, memory, tgt_mask) #pass tgt_emb through transformer decoder (look pytorch documentation)

def create_padding_mask(seq):
  """ Method to create a padding mask based on given sequence.
  Arguments
  ---------
  seq : Tensor
     Sequence to create padding mask for
  Returns
  -------
  mask : Tensor
      Padding mask
  """
  return (seq == 0).float() #float matrix that is 1 when datapoint is equal to 0

def create_triu_mask(sz):
  """ Method to create a triangular mask based on given sequence. This is used for the tgt mask in the Transformer model to avoid looking ahead.
  Arguments
  ---------
  seq : Tensor
     Sequence to create triangular mask for
  Returns
  -------
  mask : Tensor
      Triangular mask
  """
  # Create an upper triangular matrix
  mask = torch.triu(torch.ones(sz, sz), diagonal=1)  # Upper triangular mask
  # Replace 1's with -inf and 0's with 0
  mask = mask.float().masked_fill(mask == 1, float('-inf')).masked_fill(mask == 0, float(0.0))
  return mask

def tokenize_batch(source, targets,tokenizer):
  """ Method to tokenize a batch of data given a tokenizer.
  Arguments
  ---------
  source : List of String
     Source text
  targets : List of String
     Target text
  tokenizer : Tokenizer
     Tokenizer to use for tokenization
  Returns
  -------
  tokenized_source : Tensor
      Tokenized source text
  """

  tokenized_source = tokenizer(source, padding='max_length', max_length=120, truncation=True, return_tensors='pt')

  tokenized_targets = tokenizer(targets,  padding='max_length', max_length=120, truncation=True, return_tensors='pt')

  return tokenized_source['input_ids'], tokenized_targets['input_ids']


training_set=create_dataset(trim_dataset,'en','fr')
validation_set=create_dataset(dataset['validation']['translation'],'en','fr')
test_set=create_dataset(dataset['test']['translation'],'en','fr')


from transformers import AutoTokenizer

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer=AutoTokenizer.from_pretrained('FacebookAI/xlm-roberta-base')
PAD_IDX = tokenizer.pad_token_id #for padding
BOS_IDX = tokenizer.bos_token_id #for beggining of sentence
EOS_IDX = tokenizer.eos_token_id #for end of sentence

model = TransformerModel(tokenizer.vocab_size, tokenizer.vocab_size,512, 8, 3, 3, 256,0.1).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)
loss_function = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

train_loader = torch.utils.data.DataLoader(training_set, batch_size=8, shuffle=True)
validation_loader = torch.utils.data.DataLoader(validation_set, batch_size=8, shuffle=False)

from torch.utils.data import DataLoader
from tqdm import tqdm

def train_epoch(model,train_loader,tokenizer):
    model.train()
    losses = 0

    for src, tgt in tqdm(train_loader):
        src, tgt = tokenize_batch(src, tgt, tokenizer)
        src = src.to(device)
        tgt = tgt.to(device)

        tgt_input = tgt[:,:-1]

        #TODO
        src_mask = torch.zeros((src.size(1), src.size(1)), device=device) #creat src_mask this is basically a matrix of 0s of shape Sequence x Sequence (see https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html)
        tgt_mask = create_triu_mask(tgt_input.size(1)).to(device) #create triangular mask for target

        src_padding_mask = create_padding_mask(src).to(device) #create padding mask for src
        tgt_padding_mask = create_padding_mask(tgt_input).to(device) #create padding mask for tgt

        logits = model(
            src,
            tgt_input,
            src_mask=src_mask,
            tgt_mask=tgt_mask,
            src_key_padding_mask=src_padding_mask,
            tgt_key_padding_mask=tgt_padding_mask
        ) #pass it through model

        optimizer.zero_grad()

        tgt_out = tgt[:,1:]
        loss = loss_function(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()

        optimizer.step()
        losses += loss.item()

    return losses / len(list(train_loader))


def evaluate(model,val_dataloader ):
    model.eval()
    losses = 0
    with torch.no_grad():
      for src, tgt in tqdm(val_dataloader):
          src, tgt = tokenize_batch(src, tgt, tokenizer)
          src = src.to(device)
          tgt = tgt.to(device)

          tgt_input = tgt[:,:-1]

          #do the same as in Train
          # Create masks and padding masks
          src_mask = torch.zeros((src.size(1), src.size(1)), device=device)  # Source mask
          tgt_mask = create_triu_mask(tgt_input.size(1)).to(device)  # Triangular target mask

          src_padding_mask = create_padding_mask(src).to(device)  # Source padding mask
          tgt_padding_mask = create_padding_mask(tgt_input).to(device)  # Target padding mask

          # Forward pass through the model
          logits = model(
              src,
              tgt_input,
              src_mask=src_mask,
              tgt_mask=tgt_mask,
              src_key_padding_mask=src_padding_mask,
              tgt_key_padding_mask=tgt_padding_mask
          )

          tgt_out = tgt[:,1:]
          loss = loss_function(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
          losses += loss.item()

    return losses / len(list(val_dataloader))

def train(model, epochs, train_loader,validation_loader ):
  for epoch in range(1, epochs+1):
        train_loss = train_epoch(model,train_loader, tokenizer)
        val_loss = evaluate(model,validation_loader)
        print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}"))
        torch.save(model.state_dict(), "/content/drive/MyDrive/fb_lr0001_50k.pth")

train(model, 10, train_loader,validation_loader)



Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/18.5k [00:00<?, ?B/s]

iwslt2017.py:   0%|          | 0.00/8.17k [00:00<?, ?B/s]

en-fr.zip:   0%|          | 0.00/27.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/232825 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8597 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/890 [00:00<?, ? examples/s]

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

100%|██████████| 6250/6250 [41:17<00:00,  2.52it/s]
100%|██████████| 112/112 [00:11<00:00,  9.36it/s]


Epoch: 1, Train loss: 5.640, Val loss: 5.135


 81%|████████  | 5050/6250 [33:37<08:01,  2.49it/s]

In [None]:
!pip install datasets
!pip install evaluate
!pip install transformers
!pip install bert_score
!pip install rouge_score

from datasets import load_dataset

dataset = load_dataset("IWSLT/iwslt2017",'iwslt2017-en-fr')

dataset['train']['translation'][0]

trim_dataset= dataset['train']['translation'][:50000]

import string
def preprocess_data(text):
  """ Method to clean text from noise and standarize text across the different classes.
      The preprocessing includes converting to joining all datapoints, lowercase, removing punctuation, and removing stopwords.
  Arguments
  ---------
  text : List of String
     Text to clean
  Returns
  -------
  text : String
      Cleaned and joined text
  """

  text = " ".join(text).lower() #make everything lower case
  text = text.replace("\n", " ") #remove \n characters
  text=  text.translate(str.maketrans("", "", string.punctuation)) #remove any punctuation or special characters
  text = ''.join(filter(lambda x: not x.isdigit(), text)) #remove all numbers

  return text

def create_dataset(dataset,source_lang,target_lang):
  """ Method to create a dataset from a list of text.
  Arguments
  ---------
  text : List of String
     Text from dataset
  source_lang : String
     Source language
  target_lang : String
     Target language
  Returns
  -------
  new_dataset : Tuple of String
      Source and target text in format (source, target)
  """
  new_dataset=[]
  for data in dataset:
    # Extract source and target translations
    source_text = data[source_lang]
    target_text = data[target_lang]

    # Preprocess the source and target text (assuming a preprocess_data function is available)
    source_text = preprocess_data([source_text])
    target_text = preprocess_data([target_text])

    # Append the tuple of source and target text to the new dataset
    new_dataset.append((source_text, target_text))

  return new_dataset

import torch
import torch.nn as nn
import torch.nn.functional as F


class TransformerModel(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward,dropout):
        super(TransformerModel, self).__init__()
        self.src_embedding = nn.Embedding(src_vocab_size, d_model)  # Embedding layer for source language
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)  # Embedding layer for target language
        self.transformer = nn.Transformer(
            d_model=d_model,
            nhead=nhead,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True
        )  # Transformer with batch_first=True
        self.fc = nn.Linear(d_model, tgt_vocab_size)  # Last linear layer

    def positional_encoding(self, d_model, maxlen = 5000):
        """Method to create a positional encoding buffer.
        Arguments
        ---------
        d_model: int
            Embedding size
        maxlen: int
            Maximum sequence length
        Returns
        -------
        PE: Tensor
            Positional encoding buffer
        """
        pos = torch.arange(0, maxlen).unsqueeze(1)
        denominator = 10000 ** (torch.arange(0, d_model, 2) / d_model)

        PE = torch.zeros((maxlen, d_model))
        PE[:, 0::2] = torch.sin(pos / denominator)  # Calculate sine for even positions
        PE[:, 1::2] = torch.cos(pos / denominator)  # Calculate cosine for odd positions

        PE = PE.unsqueeze(0)  # Add batch dimension

        return PE


    def forward(self, src, tgt, src_mask=None, tgt_mask=None, src_key_padding_mask=None, tgt_key_padding_mask=None):
        """Method to forward a batch of data through the model."""
        #pass source and target throught embedding layer
        src = self.src_embedding(src)
        tgt = self.tgt_embedding(tgt)

        positional_encoding = self.positional_encoding(src.size(2)).to(src.device)#get positional encoding and move it to device

        #get src_emb and tgt_emb by adding positional encoder
        src_emb = src + positional_encoding[:,:src.shape[1], :]
        tgt_emb = tgt + positional_encoding[:,:tgt.shape[1], :]

        #pass src, tgt and all masks throught transformer
        output = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None, src_key_padding_mask, tgt_key_padding_mask,src_key_padding_mask)

        #pass output throught linear layer
        output = self.fc(output)
        return output

    def encode(self, src, src_mask):
        """Method to encode a batch of data through the transformer model."""
        src = self.src_embedding(src) #pass src throught embedding layer
        positional_encoding = self.positional_encoding(src.size(2)).to(src.device) #create positional encoding
        src_emb = src + positional_encoding[:, :src.size(1), :] #get src_emb
        return self.transformer.encoder(src_emb, src_mask) #pass src_emb through transformer encoder (look pytorch documentation)


    def decode(self, tgt, memory,tgt_mask):
        """Method to decode a batch of data through the transformer model."""
        tgt = self.tgt_embedding(tgt) #pass tgt throught embedding layer
        positional_encoding = self.positional_encoding(tgt.size(2)).to(tgt.device) #create positional encoding
        tgt_emb = tgt + positional_encoding[:, :tgt.size(1), :] #get tgt_emb
        return self.transformer.decoder(tgt_emb, memory, tgt_mask) #pass tgt_emb through transformer decoder (look pytorch documentation)

def create_padding_mask(seq):
  """ Method to create a padding mask based on given sequence.
  Arguments
  ---------
  seq : Tensor
     Sequence to create padding mask for
  Returns
  -------
  mask : Tensor
      Padding mask
  """
  return (seq == 0).float() #float matrix that is 1 when datapoint is equal to 0

def create_triu_mask(sz):
  """ Method to create a triangular mask based on given sequence. This is used for the tgt mask in the Transformer model to avoid looking ahead.
  Arguments
  ---------
  seq : Tensor
     Sequence to create triangular mask for
  Returns
  -------
  mask : Tensor
      Triangular mask
  """
  # Create an upper triangular matrix
  mask = torch.triu(torch.ones(sz, sz), diagonal=1)  # Upper triangular mask
  # Replace 1's with -inf and 0's with 0
  mask = mask.float().masked_fill(mask == 1, float('-inf')).masked_fill(mask == 0, float(0.0))
  return mask

def tokenize_batch(source, targets,tokenizer):
  """ Method to tokenize a batch of data given a tokenizer.
  Arguments
  ---------
  source : List of String
     Source text
  targets : List of String
     Target text
  tokenizer : Tokenizer
     Tokenizer to use for tokenization
  Returns
  -------
  tokenized_source : Tensor
      Tokenized source text
  """

  tokenized_source = tokenizer(source, padding='max_length', max_length=120, truncation=True, return_tensors='pt')

  tokenized_targets = tokenizer(targets,  padding='max_length', max_length=120, truncation=True, return_tensors='pt')

  return tokenized_source['input_ids'], tokenized_targets['input_ids']


training_set=create_dataset(trim_dataset,'en','fr')
validation_set=create_dataset(dataset['validation']['translation'],'en','fr')
test_set=create_dataset(dataset['test']['translation'],'en','fr')


from transformers import AutoTokenizer

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer=AutoTokenizer.from_pretrained('FacebookAI/xlm-roberta-base')
PAD_IDX = tokenizer.pad_token_id #for padding
BOS_IDX = tokenizer.bos_token_id #for beggining of sentence
EOS_IDX = tokenizer.eos_token_id #for end of sentence

model = TransformerModel(tokenizer.vocab_size, tokenizer.vocab_size,512, 8, 3, 3, 256,0.1).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)
loss_function = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

train_loader = torch.utils.data.DataLoader(training_set, batch_size=8, shuffle=True)
validation_loader = torch.utils.data.DataLoader(validation_set, batch_size=8, shuffle=False)

from torch.utils.data import DataLoader
from tqdm import tqdm

def train_epoch(model,train_loader,tokenizer):
    model.train()
    losses = 0

    for src, tgt in tqdm(train_loader):
        src, tgt = tokenize_batch(src, tgt, tokenizer)
        src = src.to(device)
        tgt = tgt.to(device)

        tgt_input = tgt[:,:-1]

        #TODO
        src_mask = torch.zeros((src.size(1), src.size(1)), device=device) #creat src_mask this is basically a matrix of 0s of shape Sequence x Sequence (see https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html)
        tgt_mask = create_triu_mask(tgt_input.size(1)).to(device) #create triangular mask for target

        src_padding_mask = create_padding_mask(src).to(device) #create padding mask for src
        tgt_padding_mask = create_padding_mask(tgt_input).to(device) #create padding mask for tgt

        logits = model(
            src,
            tgt_input,
            src_mask=src_mask,
            tgt_mask=tgt_mask,
            src_key_padding_mask=src_padding_mask,
            tgt_key_padding_mask=tgt_padding_mask
        ) #pass it through model

        optimizer.zero_grad()

        tgt_out = tgt[:,1:]
        loss = loss_function(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()

        optimizer.step()
        losses += loss.item()

    return losses / len(list(train_loader))


def evaluate(model,val_dataloader ):
    model.eval()
    losses = 0
    with torch.no_grad():
      for src, tgt in tqdm(val_dataloader):
          src, tgt = tokenize_batch(src, tgt, tokenizer)
          src = src.to(device)
          tgt = tgt.to(device)

          tgt_input = tgt[:,:-1]

          #do the same as in Train
          # Create masks and padding masks
          src_mask = torch.zeros((src.size(1), src.size(1)), device=device)  # Source mask
          tgt_mask = create_triu_mask(tgt_input.size(1)).to(device)  # Triangular target mask

          src_padding_mask = create_padding_mask(src).to(device)  # Source padding mask
          tgt_padding_mask = create_padding_mask(tgt_input).to(device)  # Target padding mask

          # Forward pass through the model
          logits = model(
              src,
              tgt_input,
              src_mask=src_mask,
              tgt_mask=tgt_mask,
              src_key_padding_mask=src_padding_mask,
              tgt_key_padding_mask=tgt_padding_mask
          )

          tgt_out = tgt[:,1:]
          loss = loss_function(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
          losses += loss.item()

    return losses / len(list(val_dataloader))

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/18.5k [00:00<?, ?B/s]

iwslt2017.py:   0%|          | 0.00/8.17k [00:00<?, ?B/s]

The repository for IWSLT/iwslt2017 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/IWSLT/iwslt2017.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


en-fr.zip:   0%|          | 0.00/27.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/232825 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8597 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/890 [00:00<?, ? examples/s]

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

In [None]:
import torch
model_path = '/content/drive/MyDrive/fb_lr0001_50k.pth'
# Load the model weights onto CPU using map_location
state_dict = torch.load(model_path, map_location=torch.device('cpu'))
model.load_state_dict(state_dict)

  state_dict = torch.load(model_path, map_location=torch.device('cpu'))


<All keys matched successfully>

In [None]:
!pip install datasets
!pip install evaluate
!pip install transformers
!pip install bert_score
!pip install rouge_score
from evaluate import load
bertscore = load("bertscore")
rouge = load('rouge')
meteor = load('meteor')
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(device)
    src_mask = src_mask.to(device)

    # Pass the source through the encoder
    memory = model.encode(src, src_mask)  # This line was missing

    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(device)

    for i in range(max_len-1):
        memory = memory.to(device)

        # Create the triangular mask for the target sequence
        tgt_mask = create_triu_mask(ys.size(1)).to(device)  # This line was missing

        # Pass the generated sequence through the decoder
        out = model.decode(ys, memory, tgt_mask)  # This line was missing

        # Get the probabilities of the next token
        prob = model.fc(out[:, -1])

        # Get the next word with the highest probability
        _, next_word = torch.max(prob, dim=1)  # Make sure to use `.item()` for scalar value

        # Append the next word to the sequence
        ys = torch.cat([ys, torch.ones(1, 1).type_as(src.data).fill_(next_word.item())], dim=1)  # Use `.item()`

        # If EOS token is generated, stop decoding
        if next_word == EOS_IDX:
            break

    return ys

def translate(model: torch.nn.Module, src_sentence: str, tokenizer):
    model.eval()
    src, _ = tokenize_batch(src_sentence, "", tokenizer)
    src = src.to(device)
    num_tokens = src.shape[1]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.float).to(device)
    tgt_tokens = greedy_decode(
        model,  src, src_mask, max_len= int(num_tokens * 1.2 ), start_symbol=tokenizer.cls_token_id).flatten()
    return tokenizer.decode(tgt_tokens, skip_special_tokens=True)

print(translate(model, "Hello how are you today",tokenizer))


import numpy as np
def test(test_loader, model, tokenizer, device, max_length=200):
    """
    Method to test our model using precision, recall, F1, and METEOR metrics.
    Arguments:
        test_loader: DataLoader
            DataLoader that holds the test set.
        model: nn.Module
            Trained machine translation model.
        tokenizer: Tokenizer
            Tokenizer for input/output processing.
        device: torch.device
            Device to run the model on ('cpu' or 'cuda').
        max_length: int
            Maximum length for generated translations.
    Returns:
        tuple: Averaged precision, recall, F1, and METEOR scores.
    """
    precision = 0
    recall = 0
    f1 = 0
    meteor_metric = 0

    for src, target in test_loader:
        # Use translate method to evaluate our model
        results_bert = [translate(model, src_sentence, tokenizer) for src_sentence in src]
        results_meteor = results_bert  # Using the same results for METEOR

        # Decode target sentences (if target is tokenized, we can pass directly; else, we tokenize it)
        target_sentences = [tokenizer.decode(tgt, skip_special_tokens=True) if isinstance(tgt, list) else tgt for tgt in target]

        if len(results_bert) != len(target_sentences):
            continue

        # Compute BERTScore metrics
        bert_results = bertscore.compute(
            predictions=results_bert,
            references=target_sentences,
            lang="fr"  # Setting French as the target language
        )

        # Compute METEOR metric
        meteor_metric += np.mean([
            meteor.compute(predictions=[pred], references=[ref])["meteor"]
            for pred, ref in zip(results_meteor, target_sentences)
        ])

        # Calculate precision, recall, f1 (using BERTScore metrics)
        precision += np.mean(bert_results["precision"])
        recall += np.mean(bert_results["recall"])
        f1 += np.mean(bert_results["f1"])

    # Return averaged precision, recall, F1, and METEOR
    return precision / len(test_loader), recall / len(test_loader), f1 / len(test_loader), meteor_metric / len(test_loader)

# Run the test function
#test(test_set, model, tokenizer, device)

# Define a size for the subset you want (e.g., 100 samples)
subset_size = 100

# Trim the test set by slicing it
trimmed_test_set = test_set[:subset_size]

# Example call to the test function
test(trimmed_test_set, model, tokenizer, device)



Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


comment


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]



(0.008886380815131833, 0.008965440135017502, 0.00892481924960149, 0.0)

### Learning rate = 0.0001, tokenizer = FacebookAI/xlm-roberta-base, training set size = 50000, epochs = 7, test set size = 100

Previously trained model had run for 2 epochs, so now when we reload that model and train again for 5 epochs, we are effectively training for 7 epochs

In [2]:
!pip install datasets
!pip install evaluate
!pip install transformers
!pip install bert_score
!pip install rouge_score

from datasets import load_dataset

dataset = load_dataset("IWSLT/iwslt2017",'iwslt2017-en-fr')

dataset['train']['translation'][0]

trim_dataset= dataset['train']['translation'][:50000]

import string
def preprocess_data(text):
  """ Method to clean text from noise and standarize text across the different classes.
      The preprocessing includes converting to joining all datapoints, lowercase, removing punctuation, and removing stopwords.
  Arguments
  ---------
  text : List of String
     Text to clean
  Returns
  -------
  text : String
      Cleaned and joined text
  """

  text = " ".join(text).lower() #make everything lower case
  text = text.replace("\n", " ") #remove \n characters
  text=  text.translate(str.maketrans("", "", string.punctuation)) #remove any punctuation or special characters
  text = ''.join(filter(lambda x: not x.isdigit(), text)) #remove all numbers

  return text

def create_dataset(dataset,source_lang,target_lang):
  """ Method to create a dataset from a list of text.
  Arguments
  ---------
  text : List of String
     Text from dataset
  source_lang : String
     Source language
  target_lang : String
     Target language
  Returns
  -------
  new_dataset : Tuple of String
      Source and target text in format (source, target)
  """
  new_dataset=[]
  for data in dataset:
    # Extract source and target translations
    source_text = data[source_lang]
    target_text = data[target_lang]

    # Preprocess the source and target text (assuming a preprocess_data function is available)
    source_text = preprocess_data([source_text])
    target_text = preprocess_data([target_text])

    # Append the tuple of source and target text to the new dataset
    new_dataset.append((source_text, target_text))

  return new_dataset

import torch
import torch.nn as nn
import torch.nn.functional as F


class TransformerModel(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward,dropout):
        super(TransformerModel, self).__init__()
        self.src_embedding = nn.Embedding(src_vocab_size, d_model)  # Embedding layer for source language
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)  # Embedding layer for target language
        self.transformer = nn.Transformer(
            d_model=d_model,
            nhead=nhead,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True
        )  # Transformer with batch_first=True
        self.fc = nn.Linear(d_model, tgt_vocab_size)  # Last linear layer

    def positional_encoding(self, d_model, maxlen = 5000):
        """Method to create a positional encoding buffer.
        Arguments
        ---------
        d_model: int
            Embedding size
        maxlen: int
            Maximum sequence length
        Returns
        -------
        PE: Tensor
            Positional encoding buffer
        """
        pos = torch.arange(0, maxlen).unsqueeze(1)
        denominator = 10000 ** (torch.arange(0, d_model, 2) / d_model)

        PE = torch.zeros((maxlen, d_model))
        PE[:, 0::2] = torch.sin(pos / denominator)  # Calculate sine for even positions
        PE[:, 1::2] = torch.cos(pos / denominator)  # Calculate cosine for odd positions

        PE = PE.unsqueeze(0)  # Add batch dimension

        return PE


    def forward(self, src, tgt, src_mask=None, tgt_mask=None, src_key_padding_mask=None, tgt_key_padding_mask=None):
        """Method to forward a batch of data through the model."""
        #pass source and target throught embedding layer
        src = self.src_embedding(src)
        tgt = self.tgt_embedding(tgt)

        positional_encoding = self.positional_encoding(src.size(2)).to(src.device)#get positional encoding and move it to device

        #get src_emb and tgt_emb by adding positional encoder
        src_emb = src + positional_encoding[:,:src.shape[1], :]
        tgt_emb = tgt + positional_encoding[:,:tgt.shape[1], :]

        #pass src, tgt and all masks throught transformer
        output = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None, src_key_padding_mask, tgt_key_padding_mask,src_key_padding_mask)

        #pass output throught linear layer
        output = self.fc(output)
        return output

    def encode(self, src, src_mask):
        """Method to encode a batch of data through the transformer model."""
        src = self.src_embedding(src) #pass src throught embedding layer
        positional_encoding = self.positional_encoding(src.size(2)).to(src.device) #create positional encoding
        src_emb = src + positional_encoding[:, :src.size(1), :] #get src_emb
        return self.transformer.encoder(src_emb, src_mask) #pass src_emb through transformer encoder (look pytorch documentation)


    def decode(self, tgt, memory,tgt_mask):
        """Method to decode a batch of data through the transformer model."""
        tgt = self.tgt_embedding(tgt) #pass tgt throught embedding layer
        positional_encoding = self.positional_encoding(tgt.size(2)).to(tgt.device) #create positional encoding
        tgt_emb = tgt + positional_encoding[:, :tgt.size(1), :] #get tgt_emb
        return self.transformer.decoder(tgt_emb, memory, tgt_mask) #pass tgt_emb through transformer decoder (look pytorch documentation)

def create_padding_mask(seq):
  """ Method to create a padding mask based on given sequence.
  Arguments
  ---------
  seq : Tensor
     Sequence to create padding mask for
  Returns
  -------
  mask : Tensor
      Padding mask
  """
  return (seq == 0).float() #float matrix that is 1 when datapoint is equal to 0

def create_triu_mask(sz):
  """ Method to create a triangular mask based on given sequence. This is used for the tgt mask in the Transformer model to avoid looking ahead.
  Arguments
  ---------
  seq : Tensor
     Sequence to create triangular mask for
  Returns
  -------
  mask : Tensor
      Triangular mask
  """
  # Create an upper triangular matrix
  mask = torch.triu(torch.ones(sz, sz), diagonal=1)  # Upper triangular mask
  # Replace 1's with -inf and 0's with 0
  mask = mask.float().masked_fill(mask == 1, float('-inf')).masked_fill(mask == 0, float(0.0))
  return mask

def tokenize_batch(source, targets,tokenizer):
  """ Method to tokenize a batch of data given a tokenizer.
  Arguments
  ---------
  source : List of String
     Source text
  targets : List of String
     Target text
  tokenizer : Tokenizer
     Tokenizer to use for tokenization
  Returns
  -------
  tokenized_source : Tensor
      Tokenized source text
  """

  tokenized_source = tokenizer(source, padding='max_length', max_length=120, truncation=True, return_tensors='pt')

  tokenized_targets = tokenizer(targets,  padding='max_length', max_length=120, truncation=True, return_tensors='pt')

  return tokenized_source['input_ids'], tokenized_targets['input_ids']


training_set=create_dataset(trim_dataset,'en','fr')
validation_set=create_dataset(dataset['validation']['translation'],'en','fr')
test_set=create_dataset(dataset['test']['translation'],'en','fr')


from transformers import AutoTokenizer

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer=AutoTokenizer.from_pretrained('FacebookAI/xlm-roberta-base')
PAD_IDX = tokenizer.pad_token_id #for padding
BOS_IDX = tokenizer.bos_token_id #for beggining of sentence
EOS_IDX = tokenizer.eos_token_id #for end of sentence

model = TransformerModel(tokenizer.vocab_size, tokenizer.vocab_size,512, 8, 3, 3, 256,0.1).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)
loss_function = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

train_loader = torch.utils.data.DataLoader(training_set, batch_size=8, shuffle=True)
validation_loader = torch.utils.data.DataLoader(validation_set, batch_size=8, shuffle=False)

from torch.utils.data import DataLoader
from tqdm import tqdm

def train_epoch(model,train_loader,tokenizer):
    model.train()
    losses = 0

    for src, tgt in tqdm(train_loader):
        src, tgt = tokenize_batch(src, tgt, tokenizer)
        src = src.to(device)
        tgt = tgt.to(device)

        tgt_input = tgt[:,:-1]

        #TODO
        src_mask = torch.zeros((src.size(1), src.size(1)), device=device) #creat src_mask this is basically a matrix of 0s of shape Sequence x Sequence (see https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html)
        tgt_mask = create_triu_mask(tgt_input.size(1)).to(device) #create triangular mask for target

        src_padding_mask = create_padding_mask(src).to(device) #create padding mask for src
        tgt_padding_mask = create_padding_mask(tgt_input).to(device) #create padding mask for tgt

        logits = model(
            src,
            tgt_input,
            src_mask=src_mask,
            tgt_mask=tgt_mask,
            src_key_padding_mask=src_padding_mask,
            tgt_key_padding_mask=tgt_padding_mask
        ) #pass it through model

        optimizer.zero_grad()

        tgt_out = tgt[:,1:]
        loss = loss_function(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()

        optimizer.step()
        losses += loss.item()

    return losses / len(list(train_loader))


def evaluate(model,val_dataloader ):
    model.eval()
    losses = 0
    with torch.no_grad():
      for src, tgt in tqdm(val_dataloader):
          src, tgt = tokenize_batch(src, tgt, tokenizer)
          src = src.to(device)
          tgt = tgt.to(device)

          tgt_input = tgt[:,:-1]

          #do the same as in Train
          # Create masks and padding masks
          src_mask = torch.zeros((src.size(1), src.size(1)), device=device)  # Source mask
          tgt_mask = create_triu_mask(tgt_input.size(1)).to(device)  # Triangular target mask

          src_padding_mask = create_padding_mask(src).to(device)  # Source padding mask
          tgt_padding_mask = create_padding_mask(tgt_input).to(device)  # Target padding mask

          # Forward pass through the model
          logits = model(
              src,
              tgt_input,
              src_mask=src_mask,
              tgt_mask=tgt_mask,
              src_key_padding_mask=src_padding_mask,
              tgt_key_padding_mask=tgt_padding_mask
          )

          tgt_out = tgt[:,1:]
          loss = loss_function(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
          losses += loss.item()

    return losses / len(list(val_dataloader))


Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/18.5k [00:00<?, ?B/s]

iwslt2017.py:   0%|          | 0.00/8.17k [00:00<?, ?B/s]

The repository for IWSLT/iwslt2017 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/IWSLT/iwslt2017.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


en-fr.zip:   0%|          | 0.00/27.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/232825 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8597 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/890 [00:00<?, ? examples/s]

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

In [3]:
model_path = '/content/drive/MyDrive/fb_lr0001_50k.pth'
model.load_state_dict(torch.load(model_path))

  model.load_state_dict(torch.load(model_path))


<All keys matched successfully>

In [4]:
from evaluate import load
bertscore = load("bertscore")
rouge = load('rouge')
meteor = load('meteor')

def train(model, epochs, train_loader,validation_loader ):
  for epoch in range(1, epochs+1):
        train_loss = train_epoch(model,train_loader, tokenizer)
        val_loss = evaluate(model,validation_loader)
        print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}"))
        torch.save(model.state_dict(), "/content/drive/MyDrive/fb_lr0001_50k_moreEpoch.pth")

train(model, 5, train_loader,validation_loader)


def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(device)
    src_mask = src_mask.to(device)

    # Pass the source through the encoder
    memory = model.encode(src, src_mask)  # This line was missing

    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(device)

    for i in range(max_len-1):
        memory = memory.to(device)

        # Create the triangular mask for the target sequence
        tgt_mask = create_triu_mask(ys.size(1)).to(device)  # This line was missing

        # Pass the generated sequence through the decoder
        out = model.decode(ys, memory, tgt_mask)  # This line was missing

        # Get the probabilities of the next token
        prob = model.fc(out[:, -1])

        # Get the next word with the highest probability
        _, next_word = torch.max(prob, dim=1)  # Make sure to use `.item()` for scalar value

        # Append the next word to the sequence
        ys = torch.cat([ys, torch.ones(1, 1).type_as(src.data).fill_(next_word.item())], dim=1)  # Use `.item()`

        # If EOS token is generated, stop decoding
        if next_word == EOS_IDX:
            break

    return ys

def translate(model: torch.nn.Module, src_sentence: str, tokenizer):
    model.eval()
    src, _ = tokenize_batch(src_sentence, "", tokenizer)
    src = src.to(device)
    num_tokens = src.shape[1]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.float).to(device)
    tgt_tokens = greedy_decode(
        model,  src, src_mask, max_len= int(num_tokens * 1.2 ), start_symbol=tokenizer.cls_token_id).flatten()
    return tokenizer.decode(tgt_tokens, skip_special_tokens=True)

print(translate(model, "Hello how are you today",tokenizer))


import numpy as np
def test(test_loader, model, tokenizer, device, max_length=200):
    """
    Method to test our model using precision, recall, F1, and METEOR metrics.
    Arguments:
        test_loader: DataLoader
            DataLoader that holds the test set.
        model: nn.Module
            Trained machine translation model.
        tokenizer: Tokenizer
            Tokenizer for input/output processing.
        device: torch.device
            Device to run the model on ('cpu' or 'cuda').
        max_length: int
            Maximum length for generated translations.
    Returns:
        tuple: Averaged precision, recall, F1, and METEOR scores.
    """
    precision = 0
    recall = 0
    f1 = 0
    meteor_metric = 0

    for src, target in test_loader:
        # Use translate method to evaluate our model
        results_bert = [translate(model, src_sentence, tokenizer) for src_sentence in src]
        results_meteor = results_bert  # Using the same results for METEOR

        # Decode target sentences (if target is tokenized, we can pass directly; else, we tokenize it)
        target_sentences = [tokenizer.decode(tgt, skip_special_tokens=True) if isinstance(tgt, list) else tgt for tgt in target]

        if len(results_bert) != len(target_sentences):
            continue

        # Compute BERTScore metrics
        bert_results = bertscore.compute(
            predictions=results_bert,
            references=target_sentences,
            lang="fr"  # Setting French as the target language
        )

        # Compute METEOR metric
        meteor_metric += np.mean([
            meteor.compute(predictions=[pred], references=[ref])["meteor"]
            for pred, ref in zip(results_meteor, target_sentences)
        ])

        # Calculate precision, recall, f1 (using BERTScore metrics)
        precision += np.mean(bert_results["precision"])
        recall += np.mean(bert_results["recall"])
        f1 += np.mean(bert_results["f1"])

    # Return averaged precision, recall, F1, and METEOR
    return precision / len(test_loader), recall / len(test_loader), f1 / len(test_loader), meteor_metric / len(test_loader)

# Run the test function
#test(test_set, model, tokenizer, device)

# Define a size for the subset you want (e.g., 100 samples)
subset_size = 100

# Trim the test set by slicing it
trimmed_test_set = test_set[:subset_size]

# Example call to the test function
test(trimmed_test_set, model, tokenizer, device)



Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
100%|██████████| 6250/6250 [41:16<00:00,  2.52it/s]
100%|██████████| 112/112 [00:11<00:00,  9.46it/s]


Epoch: 1, Train loss: 4.612, Val loss: 4.639


100%|██████████| 6250/6250 [41:23<00:00,  2.52it/s]
100%|██████████| 112/112 [00:11<00:00,  9.58it/s]


Epoch: 2, Train loss: 4.093, Val loss: 4.314


100%|██████████| 6250/6250 [41:18<00:00,  2.52it/s]
100%|██████████| 112/112 [00:11<00:00,  9.54it/s]


Epoch: 3, Train loss: 3.716, Val loss: 4.043


100%|██████████| 6250/6250 [41:19<00:00,  2.52it/s]
100%|██████████| 112/112 [00:11<00:00,  9.52it/s]


Epoch: 4, Train loss: 3.438, Val loss: 3.860


100%|██████████| 6250/6250 [41:18<00:00,  2.52it/s]
100%|██████████| 112/112 [00:11<00:00,  9.55it/s]


Epoch: 5, Train loss: 3.224, Val loss: 3.765
combien dentre vous aujourdhui


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]



(0.008893276407975773,
 0.009085463576941302,
 0.008986177471099288,
 0.00033545840922890105)