## Assigment 3: Transformers for translation 🙊


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Testing


## Let's experiment!

1. Play with a hyperparameter of your choice to measure its effect on the translation.

2. Compare the results of your model with the performance of using the T5 pretrained model. This [tutorial](https://huggingface.co/docs/transformers/en/tasks/translation) on using T5 for machine translation might come in handy.

Below we have run 3 experiments:

For part 1:
1. Learning rate = 0.0001, tokenizer = FacebookAI/xlm-roberta-base, training set = 10000, epochs = 2, test set = 10
2. Learning rate = 0.01, tokenizer = FacebookAI/xlm-roberta-base, training set = 10000, epochs = 2, test set = 10

The other uploaded notebook contains the rest of the hyperparameter tuning along with the rest of the assignment (initial part)

For part 2:
Using T5

### Learning rate = 0.0001, tokenizer = FacebookAI/xlm-roberta-base, training set = 10000, epochs = 2, test set = 10

In [None]:
!pip install datasets
!pip install evaluate
!pip install transformers
!pip install bert_score
!pip install rouge_score

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [None]:
from datasets import load_dataset

dataset = load_dataset("IWSLT/iwslt2017",'iwslt2017-en-fr')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/18.5k [00:00<?, ?B/s]

iwslt2017.py:   0%|          | 0.00/8.17k [00:00<?, ?B/s]

The repository for IWSLT/iwslt2017 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/IWSLT/iwslt2017.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


en-fr.zip:   0%|          | 0.00/27.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/232825 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8597 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/890 [00:00<?, ? examples/s]

In [None]:
dataset['train']['translation'][0]

{'en': "Thank you so much, Chris. And it's truly a great honor to have the opportunity to come to this stage twice; I'm extremely grateful.",
 'fr': "Merci beaucoup, Chris. C'est vraiment un honneur de pouvoir venir sur cette scène une deuxième fois. Je suis très reconnaissant."}

In [None]:
trim_dataset= dataset['train']['translation'][:10000]

In [None]:
import string
def preprocess_data(text):
  """ Method to clean text from noise and standarize text across the different classes.
      The preprocessing includes converting to joining all datapoints, lowercase, removing punctuation, and removing stopwords.
  Arguments
  ---------
  text : List of String
     Text to clean
  Returns
  -------
  text : String
      Cleaned and joined text
  """

  text = " ".join(text).lower() #make everything lower case
  text = text.replace("\n", " ") #remove \n characters
  text=  text.translate(str.maketrans("", "", string.punctuation)) #remove any punctuation or special characters
  text = ''.join(filter(lambda x: not x.isdigit(), text)) #remove all numbers

  return text


In [None]:
def create_dataset(dataset,source_lang,target_lang):
  """ Method to create a dataset from a list of text.
  Arguments
  ---------
  text : List of String
     Text from dataset
  source_lang : String
     Source language
  target_lang : String
     Target language
  Returns
  -------
  new_dataset : Tuple of String
      Source and target text in format (source, target)
  """
  new_dataset=[]
  for data in dataset:
    # Extract source and target translations
    source_text = data[source_lang]
    target_text = data[target_lang]

    # Preprocess the source and target text (assuming a preprocess_data function is available)
    source_text = preprocess_data([source_text])
    target_text = preprocess_data([target_text])

    # Append the tuple of source and target text to the new dataset
    new_dataset.append((source_text, target_text))

  return new_dataset

training_set=create_dataset(trim_dataset,'en','fr')
validation_set=create_dataset(dataset['validation']['translation'],'en','fr')
test_set=create_dataset(dataset['test']['translation'],'en','fr')

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class TransformerModel(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward,dropout):
        super(TransformerModel, self).__init__()
        self.src_embedding = nn.Embedding(src_vocab_size, d_model)  # Embedding layer for source language
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)  # Embedding layer for target language
        self.transformer = nn.Transformer(
            d_model=d_model,
            nhead=nhead,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True
        )  # Transformer with batch_first=True
        self.fc = nn.Linear(d_model, tgt_vocab_size)  # Last linear layer

    def positional_encoding(self, d_model, maxlen = 5000):
        """Method to create a positional encoding buffer.
        Arguments
        ---------
        d_model: int
            Embedding size
        maxlen: int
            Maximum sequence length
        Returns
        -------
        PE: Tensor
            Positional encoding buffer
        """
        pos = torch.arange(0, maxlen).unsqueeze(1)
        denominator = 10000 ** (torch.arange(0, d_model, 2) / d_model)

        PE = torch.zeros((maxlen, d_model))
        PE[:, 0::2] = torch.sin(pos / denominator)  # Calculate sine for even positions
        PE[:, 1::2] = torch.cos(pos / denominator)  # Calculate cosine for odd positions

        PE = PE.unsqueeze(0)  # Add batch dimension

        return PE


    def forward(self, src, tgt, src_mask=None, tgt_mask=None, src_key_padding_mask=None, tgt_key_padding_mask=None):
        """Method to forward a batch of data through the model."""
        #pass source and target throught embedding layer
        src = self.src_embedding(src)
        tgt = self.tgt_embedding(tgt)

        positional_encoding = self.positional_encoding(src.size(2)).to(src.device)#get positional encoding and move it to device

        #get src_emb and tgt_emb by adding positional encoder
        src_emb = src + positional_encoding[:,:src.shape[1], :]
        tgt_emb = tgt + positional_encoding[:,:tgt.shape[1], :]

        #pass src, tgt and all masks throught transformer
        output = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None, src_key_padding_mask, tgt_key_padding_mask,src_key_padding_mask)

        #pass output throught linear layer
        output = self.fc(output)
        return output

    def encode(self, src, src_mask):
        """Method to encode a batch of data through the transformer model."""
        src = self.src_embedding(src) #pass src throught embedding layer
        positional_encoding = self.positional_encoding(src.size(2)).to(src.device) #create positional encoding
        src_emb = src + positional_encoding[:, :src.size(1), :] #get src_emb
        return self.transformer.encoder(src_emb, src_mask) #pass src_emb through transformer encoder (look pytorch documentation)


    def decode(self, tgt, memory,tgt_mask):
        """Method to decode a batch of data through the transformer model."""
        tgt = self.tgt_embedding(tgt) #pass tgt throught embedding layer
        positional_encoding = self.positional_encoding(tgt.size(2)).to(tgt.device) #create positional encoding
        tgt_emb = tgt + positional_encoding[:, :tgt.size(1), :] #get tgt_emb
        return self.transformer.decoder(tgt_emb, memory, tgt_mask) #pass tgt_emb through transformer decoder (look pytorch documentation)


In [None]:
def create_padding_mask(seq):
  """ Method to create a padding mask based on given sequence.
  Arguments
  ---------
  seq : Tensor
     Sequence to create padding mask for
  Returns
  -------
  mask : Tensor
      Padding mask
  """
  return (seq == 0).float() #float matrix that is 1 when datapoint is equal to 0

def create_triu_mask(sz):
  """ Method to create a triangular mask based on given sequence. This is used for the tgt mask in the Transformer model to avoid looking ahead.
  Arguments
  ---------
  seq : Tensor
     Sequence to create triangular mask for
  Returns
  -------
  mask : Tensor
      Triangular mask
  """
  # Create an upper triangular matrix
  mask = torch.triu(torch.ones(sz, sz), diagonal=1)  # Upper triangular mask
  # Replace 1's with -inf and 0's with 0
  mask = mask.float().masked_fill(mask == 1, float('-inf')).masked_fill(mask == 0, float(0.0))
  return mask

def tokenize_batch(source, targets,tokenizer):
  """ Method to tokenize a batch of data given a tokenizer.
  Arguments
  ---------
  source : List of String
     Source text
  targets : List of String
     Target text
  tokenizer : Tokenizer
     Tokenizer to use for tokenization
  Returns
  -------
  tokenized_source : Tensor
      Tokenized source text
  """

  tokenized_source = tokenizer(source, padding='max_length', max_length=120, truncation=True, return_tensors='pt')

  tokenized_targets = tokenizer(targets,  padding='max_length', max_length=120, truncation=True, return_tensors='pt')

  return tokenized_source['input_ids'], tokenized_targets['input_ids']


In [None]:
from transformers import AutoTokenizer

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer=AutoTokenizer.from_pretrained('FacebookAI/xlm-roberta-base')
PAD_IDX = tokenizer.pad_token_id #for padding
BOS_IDX = tokenizer.bos_token_id #for beggining of sentence
EOS_IDX = tokenizer.eos_token_id #for end of sentence

model = TransformerModel(tokenizer.vocab_size, tokenizer.vocab_size,512, 8, 3, 3, 256,0.1).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)
loss_function = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

train_loader = torch.utils.data.DataLoader(training_set, batch_size=8, shuffle=True)
validation_loader = torch.utils.data.DataLoader(validation_set, batch_size=8, shuffle=False)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

In [None]:
from torch.utils.data import DataLoader
from tqdm import tqdm

def train_epoch(model,train_loader,tokenizer):
    model.train()
    losses = 0

    for src, tgt in tqdm(train_loader):
        src, tgt = tokenize_batch(src, tgt, tokenizer)
        src = src.to(device)
        tgt = tgt.to(device)

        tgt_input = tgt[:,:-1]

        #TODO
        src_mask = torch.zeros((src.size(1), src.size(1)), device=device) #creat src_mask this is basically a matrix of 0s of shape Sequence x Sequence (see https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html)
        tgt_mask = create_triu_mask(tgt_input.size(1)).to(device) #create triangular mask for target

        src_padding_mask = create_padding_mask(src).to(device) #create padding mask for src
        tgt_padding_mask = create_padding_mask(tgt_input).to(device) #create padding mask for tgt

        logits = model(
            src,
            tgt_input,
            src_mask=src_mask,
            tgt_mask=tgt_mask,
            src_key_padding_mask=src_padding_mask,
            tgt_key_padding_mask=tgt_padding_mask
        ) #pass it through model

        optimizer.zero_grad()

        tgt_out = tgt[:,1:]
        loss = loss_function(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()

        optimizer.step()
        losses += loss.item()

    return losses / len(list(train_loader))


def evaluate(model,val_dataloader ):
    model.eval()
    losses = 0
    with torch.no_grad():
      for src, tgt in tqdm(val_dataloader):
          src, tgt = tokenize_batch(src, tgt, tokenizer)
          src = src.to(device)
          tgt = tgt.to(device)

          tgt_input = tgt[:,:-1]

          #do the same as in Train
          # Create masks and padding masks
          src_mask = torch.zeros((src.size(1), src.size(1)), device=device)  # Source mask
          tgt_mask = create_triu_mask(tgt_input.size(1)).to(device)  # Triangular target mask

          src_padding_mask = create_padding_mask(src).to(device)  # Source padding mask
          tgt_padding_mask = create_padding_mask(tgt_input).to(device)  # Target padding mask

          # Forward pass through the model
          logits = model(
              src,
              tgt_input,
              src_mask=src_mask,
              tgt_mask=tgt_mask,
              src_key_padding_mask=src_padding_mask,
              tgt_key_padding_mask=tgt_padding_mask
          )

          tgt_out = tgt[:,1:]
          loss = loss_function(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
          losses += loss.item()

    return losses / len(list(val_dataloader))

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
def train(model, epochs, train_loader,validation_loader ):
  for epoch in range(1, epochs+1):
        train_loss = train_epoch(model,train_loader, tokenizer)
        val_loss = evaluate(model,validation_loader)
        print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}"))
        torch.save(model.state_dict(), "/content/drive/MyDrive/my_model.pth")

train(model, 2, train_loader,validation_loader)

100%|██████████| 1250/1250 [08:10<00:00,  2.55it/s]
100%|██████████| 112/112 [00:12<00:00,  9.33it/s]


Epoch: 1, Train loss: 6.681, Val loss: 6.119


100%|██████████| 1250/1250 [08:21<00:00,  2.49it/s]
100%|██████████| 112/112 [00:11<00:00,  9.42it/s]


Epoch: 2, Train loss: 5.568, Val loss: 5.760


In [None]:
model_path = '/content/drive/MyDrive/my_model.pth'
model.load_state_dict(torch.load(model_path))

In [None]:
from evaluate import load
bertscore = load("bertscore")
rouge = load('rouge')
meteor = load('meteor')

Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [None]:
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(device)
    src_mask = src_mask.to(device)

    # Pass the source through the encoder
    memory = model.encode(src, src_mask)

    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(device)

    for i in range(max_len-1):
        memory = memory.to(device)

        # Create the triangular mask for the target sequence
        tgt_mask = create_triu_mask(ys.size(1)).to(device)  # This line was missing

        # Pass the generated sequence through the decoder
        out = model.decode(ys, memory, tgt_mask)  # This line was missing

        # Get the probabilities of the next token
        prob = model.fc(out[:, -1])

        # Get the next word with the highest probability
        _, next_word = torch.max(prob, dim=1)  # Make sure to use `.item()` for scalar value

        # Append the next word to the sequence
        ys = torch.cat([ys, torch.ones(1, 1).type_as(src.data).fill_(next_word.item())], dim=1)  # Use `.item()`

        # If EOS token is generated, stop decoding
        if next_word == EOS_IDX:
            break

    return ys

def translate(model: torch.nn.Module, src_sentence: str, tokenizer):
    model.eval()
    src, _ = tokenize_batch(src_sentence, "", tokenizer)
    src = src.to(device)
    num_tokens = src.shape[1]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.float).to(device)
    tgt_tokens = greedy_decode(
        model,  src, src_mask, max_len= int(num_tokens * 1.2 ), start_symbol=tokenizer.cls_token_id).flatten()
    return tokenizer.decode(tgt_tokens, skip_special_tokens=True)


In [None]:
print(translate(model, "Hello how are you today",tokenizer))

vous savez vous savez vous savez


In [None]:
import numpy as np
def test(test_loader, model, tokenizer, device, max_length=200):
    """
    Method to test our model using precision, recall, F1, and METEOR metrics.
    Arguments:
        test_loader: DataLoader
            DataLoader that holds the test set.
        model: nn.Module
            Trained machine translation model.
        tokenizer: Tokenizer
            Tokenizer for input/output processing.
        device: torch.device
            Device to run the model on ('cpu' or 'cuda').
        max_length: int
            Maximum length for generated translations.
    Returns:
        tuple: Averaged precision, recall, F1, and METEOR scores.
    """
    precision = 0
    recall = 0
    f1 = 0
    meteor_metric = 0

    for src, target in test_loader:
        # Use translate method to evaluate our model
        results_bert = [translate(model, src_sentence, tokenizer) for src_sentence in src]
        results_meteor = results_bert  # Using the same results for METEOR

        # Decode target sentences (if target is tokenized, we can pass directly; else, we tokenize it)
        target_sentences = [tokenizer.decode(tgt, skip_special_tokens=True) if isinstance(tgt, list) else tgt for tgt in target]

        if len(results_bert) != len(target_sentences):
            continue

        # Compute BERTScore metrics
        bert_results = bertscore.compute(
            predictions=results_bert,
            references=target_sentences,
            lang="fr"  # Setting French as the target language
        )

        # Compute METEOR metric
        meteor_metric += np.mean([
            meteor.compute(predictions=[pred], references=[ref])["meteor"]
            for pred, ref in zip(results_meteor, target_sentences)
        ])

        # Calculate precision, recall, f1 (using BERTScore metrics)
        precision += np.mean(bert_results["precision"])
        recall += np.mean(bert_results["recall"])
        f1 += np.mean(bert_results["f1"])

    # Return averaged precision, recall, F1, and METEOR
    return precision / len(test_loader), recall / len(test_loader), f1 / len(test_loader), meteor_metric / len(test_loader)

# Run the test function
#test(test_set, model, tokenizer, device)

# Define a size for the subset you want (e.g., 1000 samples)
subset_size = 10

# Trim the test set by slicing it
trimmed_test_set = test_set[:subset_size]

# Example call to the test function
test(trimmed_test_set, model, tokenizer, device)


(0.0, 0.0, 0.0, 0.0)

### Learning rate = 0.01, tokenizer = FacebookAI/xlm-roberta-base, training set = 10000, epochs = 2, test set = 10

In [None]:
!pip install datasets
!pip install evaluate
!pip install transformers
!pip install bert_score
!pip install rouge_score



In [None]:
from datasets import load_dataset

dataset = load_dataset("IWSLT/iwslt2017",'iwslt2017-en-fr')

trim_dataset= dataset['train']['translation'][:10000]

import string
def preprocess_data(text):
  """ Method to clean text from noise and standarize text across the different classes.
      The preprocessing includes converting to joining all datapoints, lowercase, removing punctuation, and removing stopwords.
  Arguments
  ---------
  text : List of String
     Text to clean
  Returns
  -------
  text : String
      Cleaned and joined text
  """

  text = " ".join(text).lower() #make everything lower case
  text = text.replace("\n", " ") #remove \n characters
  text=  text.translate(str.maketrans("", "", string.punctuation)) #remove any punctuation or special characters
  text = ''.join(filter(lambda x: not x.isdigit(), text)) #remove all numbers

  return text

def create_dataset(dataset,source_lang,target_lang):
  """ Method to create a dataset from a list of text.
  Arguments
  ---------
  text : List of String
     Text from dataset
  source_lang : String
     Source language
  target_lang : String
     Target language
  Returns
  -------
  new_dataset : Tuple of String
      Source and target text in format (source, target)
  """
  new_dataset=[]
  for data in dataset:
    # Extract source and target translations
    source_text = data[source_lang]
    target_text = data[target_lang]

    # Preprocess the source and target text (assuming a preprocess_data function is available)
    source_text = preprocess_data([source_text])
    target_text = preprocess_data([target_text])

    # Append the tuple of source and target text to the new dataset
    new_dataset.append((source_text, target_text))

  return new_dataset

training_set=create_dataset(trim_dataset,'en','fr')
validation_set=create_dataset(dataset['validation']['translation'],'en','fr')
test_set=create_dataset(dataset['test']['translation'],'en','fr')

import torch
import torch.nn as nn
import torch.nn.functional as F


class TransformerModel(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward,dropout):
        super(TransformerModel, self).__init__()
        self.src_embedding = nn.Embedding(src_vocab_size, d_model)  # Embedding layer for source language
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)  # Embedding layer for target language
        self.transformer = nn.Transformer(
            d_model=d_model,
            nhead=nhead,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True
        )  # Transformer with batch_first=True
        self.fc = nn.Linear(d_model, tgt_vocab_size)  # Last linear layer

    def positional_encoding(self, d_model, maxlen = 5000):
        """Method to create a positional encoding buffer.
        Arguments
        ---------
        d_model: int
            Embedding size
        maxlen: int
            Maximum sequence length
        Returns
        -------
        PE: Tensor
            Positional encoding buffer
        """
        pos = torch.arange(0, maxlen).unsqueeze(1)
        denominator = 10000 ** (torch.arange(0, d_model, 2) / d_model)

        PE = torch.zeros((maxlen, d_model))
        PE[:, 0::2] = torch.sin(pos / denominator)  # Calculate sine for even positions
        PE[:, 1::2] = torch.cos(pos / denominator)  # Calculate cosine for odd positions

        PE = PE.unsqueeze(0)  # Add batch dimension

        return PE


    def forward(self, src, tgt, src_mask=None, tgt_mask=None, src_key_padding_mask=None, tgt_key_padding_mask=None):
        """Method to forward a batch of data through the model."""
        #pass source and target throught embedding layer
        src = self.src_embedding(src)
        tgt = self.tgt_embedding(tgt)

        positional_encoding = self.positional_encoding(src.size(2)).to(src.device)#get positional encoding and move it to device

        #get src_emb and tgt_emb by adding positional encoder
        src_emb = src + positional_encoding[:,:src.shape[1], :]
        tgt_emb = tgt + positional_encoding[:,:tgt.shape[1], :]

        #pass src, tgt and all masks throught transformer
        output = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None, src_key_padding_mask, tgt_key_padding_mask,src_key_padding_mask)

        #pass output throught linear layer
        output = self.fc(output)
        return output

    def encode(self, src, src_mask):
        """Method to encode a batch of data through the transformer model."""
        src = self.src_embedding(src) #pass src throught embedding layer
        positional_encoding = self.positional_encoding(src.size(2)).to(src.device) #create positional encoding
        src_emb = src + positional_encoding[:, :src.size(1), :] #get src_emb
        return self.transformer.encoder(src_emb, src_mask) #pass src_emb through transformer encoder (look pytorch documentation)


    def decode(self, tgt, memory,tgt_mask):
        """Method to decode a batch of data through the transformer model."""
        tgt = self.tgt_embedding(tgt) #pass tgt throught embedding layer
        positional_encoding = self.positional_encoding(tgt.size(2)).to(tgt.device) #create positional encoding
        tgt_emb = tgt + positional_encoding[:, :tgt.size(1), :] #get tgt_emb
        return self.transformer.decoder(tgt_emb, memory, tgt_mask) #pass tgt_emb through transformer decoder (look pytorch documentation)

def create_padding_mask(seq):
  """ Method to create a padding mask based on given sequence.
  Arguments
  ---------
  seq : Tensor
     Sequence to create padding mask for
  Returns
  -------
  mask : Tensor
      Padding mask
  """
  return (seq == 0).float() #float matrix that is 1 when datapoint is equal to 0

def create_triu_mask(sz):
  """ Method to create a triangular mask based on given sequence. This is used for the tgt mask in the Transformer model to avoid looking ahead.
  Arguments
  ---------
  seq : Tensor
     Sequence to create triangular mask for
  Returns
  -------
  mask : Tensor
      Triangular mask
  """
  # Create an upper triangular matrix
  mask = torch.triu(torch.ones(sz, sz), diagonal=1)  # Upper triangular mask
  # Replace 1's with -inf and 0's with 0
  mask = mask.float().masked_fill(mask == 1, float('-inf')).masked_fill(mask == 0, float(0.0))
  return mask

def tokenize_batch(source, targets,tokenizer):
  """ Method to tokenize a batch of data given a tokenizer.
  Arguments
  ---------
  source : List of String
     Source text
  targets : List of String
     Target text
  tokenizer : Tokenizer
     Tokenizer to use for tokenization
  Returns
  -------
  tokenized_source : Tensor
      Tokenized source text
  """

  tokenized_source = tokenizer(source, padding='max_length', max_length=120, truncation=True, return_tensors='pt')

  tokenized_targets = tokenizer(targets,  padding='max_length', max_length=120, truncation=True, return_tensors='pt')

  return tokenized_source['input_ids'], tokenized_targets['input_ids']



In [None]:
from transformers import AutoTokenizer

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer=AutoTokenizer.from_pretrained('FacebookAI/xlm-roberta-base')
PAD_IDX = tokenizer.pad_token_id #for padding
BOS_IDX = tokenizer.bos_token_id #for beggining of sentence
EOS_IDX = tokenizer.eos_token_id #for end of sentence

model = TransformerModel(tokenizer.vocab_size, tokenizer.vocab_size,512, 8, 3, 3, 256,0.1).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.01, betas=(0.9, 0.98), eps=1e-9)
loss_function = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

train_loader = torch.utils.data.DataLoader(training_set, batch_size=8, shuffle=True)
validation_loader = torch.utils.data.DataLoader(validation_set, batch_size=8, shuffle=False)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

In [None]:
from torch.utils.data import DataLoader
from tqdm import tqdm

def train_epoch(model,train_loader,tokenizer):
    model.train()
    losses = 0

    for src, tgt in tqdm(train_loader):
        src, tgt = tokenize_batch(src, tgt, tokenizer)
        src = src.to(device)
        tgt = tgt.to(device)

        tgt_input = tgt[:,:-1]

        #TODO
        src_mask = torch.zeros((src.size(1), src.size(1)), device=device) #creat src_mask this is basically a matrix of 0s of shape Sequence x Sequence (see https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html)
        tgt_mask = create_triu_mask(tgt_input.size(1)).to(device) #create triangular mask for target

        src_padding_mask = create_padding_mask(src).to(device) #create padding mask for src
        tgt_padding_mask = create_padding_mask(tgt_input).to(device) #create padding mask for tgt

        logits = model(
            src,
            tgt_input,
            src_mask=src_mask,
            tgt_mask=tgt_mask,
            src_key_padding_mask=src_padding_mask,
            tgt_key_padding_mask=tgt_padding_mask
        ) #pass it through model

        optimizer.zero_grad()

        tgt_out = tgt[:,1:]
        loss = loss_function(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()

        optimizer.step()
        losses += loss.item()

    return losses / len(list(train_loader))


def evaluate(model,val_dataloader ):
    model.eval()
    losses = 0
    with torch.no_grad():
      for src, tgt in tqdm(val_dataloader):
          src, tgt = tokenize_batch(src, tgt, tokenizer)
          src = src.to(device)
          tgt = tgt.to(device)

          tgt_input = tgt[:,:-1]

          #do the same as in Train
          # Create masks and padding masks
          src_mask = torch.zeros((src.size(1), src.size(1)), device=device)  # Source mask
          tgt_mask = create_triu_mask(tgt_input.size(1)).to(device)  # Triangular target mask

          src_padding_mask = create_padding_mask(src).to(device)  # Source padding mask
          tgt_padding_mask = create_padding_mask(tgt_input).to(device)  # Target padding mask

          # Forward pass through the model
          logits = model(
              src,
              tgt_input,
              src_mask=src_mask,
              tgt_mask=tgt_mask,
              src_key_padding_mask=src_padding_mask,
              tgt_key_padding_mask=tgt_padding_mask
          )

          tgt_out = tgt[:,1:]
          loss = loss_function(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
          losses += loss.item()

    return losses / len(list(val_dataloader))

def train(model, epochs, train_loader,validation_loader ):
  for epoch in range(1, epochs+1):
        train_loss = train_epoch(model,train_loader, tokenizer)
        val_loss = evaluate(model,validation_loader)
        print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}"))
        torch.save(model.state_dict(), "/content/drive/MyDrive/my_model_one.pth")

train(model, 5, train_loader,validation_loader)


 72%|███████▏  | 4508/6250 [28:41<11:01,  2.63it/s]

In [None]:
def train(model, epochs, train_loader,validation_loader ):
  for epoch in range(1, epochs+1):
        train_loss = train_epoch(model,train_loader, tokenizer)
        val_loss = evaluate(model,validation_loader)
        print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}"))
        torch.save(model.state_dict(), "/content/drive/MyDrive/my_model_one.pth")

train(model, 2, train_loader,validation_loader)

100%|██████████| 1250/1250 [08:10<00:00,  2.55it/s]
100%|██████████| 112/112 [00:11<00:00, 10.08it/s]


Epoch: 1, Train loss: 6.953, Val loss: 7.035


100%|██████████| 1250/1250 [08:20<00:00,  2.50it/s]
100%|██████████| 112/112 [00:11<00:00, 10.08it/s]


Epoch: 2, Train loss: 6.856, Val loss: 7.075


In [None]:
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(device)
    src_mask = src_mask.to(device)

    # Pass the source through the encoder
    memory = model.encode(src, src_mask)

    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(device)

    for i in range(max_len-1):
        memory = memory.to(device)

        # Create the triangular mask for the target sequence
        tgt_mask = create_triu_mask(ys.size(1)).to(device)  # This line was missing

        # Pass the generated sequence through the decoder
        out = model.decode(ys, memory, tgt_mask)  # This line was missing

        # Get the probabilities of the next token
        prob = model.fc(out[:, -1])

        # Get the next word with the highest probability
        _, next_word = torch.max(prob, dim=1)  # Make sure to use `.item()` for scalar value

        # Append the next word to the sequence
        ys = torch.cat([ys, torch.ones(1, 1).type_as(src.data).fill_(next_word.item())], dim=1)  # Use `.item()`

        # If EOS token is generated, stop decoding
        if next_word == EOS_IDX:
            break

    return ys

def translate(model: torch.nn.Module, src_sentence: str, tokenizer):
    model.eval()
    src, _ = tokenize_batch(src_sentence, "", tokenizer)
    src = src.to(device)
    num_tokens = src.shape[1]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.float).to(device)
    tgt_tokens = greedy_decode(
        model,  src, src_mask, max_len= int(num_tokens * 1.2 ), start_symbol=tokenizer.cls_token_id).flatten()
    return tokenizer.decode(tgt_tokens, skip_special_tokens=True)


In [None]:
print(translate(model, "Hello how are you today",tokenizer))




In [None]:
import numpy as np
def test(test_loader, model, tokenizer, device, max_length=200):
    """
    Method to test our model using precision, recall, F1, and METEOR metrics.
    Arguments:
        test_loader: DataLoader
            DataLoader that holds the test set.
        model: nn.Module
            Trained machine translation model.
        tokenizer: Tokenizer
            Tokenizer for input/output processing.
        device: torch.device
            Device to run the model on ('cpu' or 'cuda').
        max_length: int
            Maximum length for generated translations.
    Returns:
        tuple: Averaged precision, recall, F1, and METEOR scores.
    """
    precision = 0
    recall = 0
    f1 = 0
    meteor_metric = 0

    for src, target in test_loader:
        # Use translate method to evaluate our model
        results_bert = [translate(model, src_sentence, tokenizer) for src_sentence in src]
        results_meteor = results_bert  # Using the same results for METEOR

        # Decode target sentences (if target is tokenized, we can pass directly; else, we tokenize it)
        target_sentences = [tokenizer.decode(tgt, skip_special_tokens=True) if isinstance(tgt, list) else tgt for tgt in target]

        if len(results_bert) != len(target_sentences):
            continue

        # Compute BERTScore metrics
        bert_results = bertscore.compute(
            predictions=results_bert,
            references=target_sentences,
            lang="fr"  # Setting French as the target language
        )

        # Compute METEOR metric
        meteor_metric += np.mean([
            meteor.compute(predictions=[pred], references=[ref])["meteor"]
            for pred, ref in zip(results_meteor, target_sentences)
        ])

        # Calculate precision, recall, f1 (using BERTScore metrics)
        precision += np.mean(bert_results["precision"])
        recall += np.mean(bert_results["recall"])
        f1 += np.mean(bert_results["f1"])

    # Return averaged precision, recall, F1, and METEOR
    return precision / len(test_loader), recall / len(test_loader), f1 / len(test_loader), meteor_metric / len(test_loader)

# Run the test function
test(test_set[:10], model, tokenizer, device)


We did not get any output here and the above cell ran until disconnection. Even the losses were more with each iteration indicating the parameters for training the model were not good.

##T5 - training set = 50000, epochs = 5, test set = 100

In [None]:
!pip install transformers datasets torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
from datasets import load_dataset
from torch.utils.data import DataLoader
from tqdm import tqdm
import torch

# Load dataset
dataset = load_dataset("IWSLT/iwslt2017", "iwslt2017-en-fr")
train_data = dataset['train']['translation'][:100000]
val_data = dataset['validation']['translation']

# Preprocess function for T5
def preprocess_t5(dataset, source_lang, target_lang):
    inputs, targets = [], []
    for data in dataset:
        source = f"translate English to French: {data[source_lang]}"
        target = data[target_lang]
        inputs.append(source)
        targets.append(target)
    return inputs, targets

train_inputs, train_targets = preprocess_t5(train_data, 'en', 'fr')
val_inputs, val_targets = preprocess_t5(val_data, 'en', 'fr')

# Tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-small")

def tokenize_data(inputs, targets, tokenizer, max_length=512):
    input_encodings = tokenizer(inputs, padding=True, truncation=True, max_length=max_length, return_tensors="pt")
    target_encodings = tokenizer(targets, padding=True, truncation=True, max_length=max_length, return_tensors="pt")
    return input_encodings.input_ids, input_encodings.attention_mask, target_encodings.input_ids

# Tokenize datasets
train_input_ids, train_attention_masks, train_labels = tokenize_data(train_inputs, train_targets, tokenizer)
val_input_ids, val_attention_masks, val_labels = tokenize_data(val_inputs, val_targets, tokenizer)

# DataLoader
train_dataset = torch.utils.data.TensorDataset(train_input_ids, train_attention_masks, train_labels)
val_dataset = torch.utils.data.TensorDataset(val_input_ids, val_attention_masks, val_labels)

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=8)

# T5 model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = T5ForConditionalGeneration.from_pretrained("t5-small").to(device)

# Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

# Training function
def train_t5(model, train_loader, optimizer, tokenizer):
    model.train()
    total_loss = 0
    for batch in tqdm(train_loader):
        input_ids, attention_masks, labels = [x.to(device) for x in batch]
        labels[labels == tokenizer.pad_token_id] = -100  # Ignore padding in loss computation
        optimizer.zero_grad()
        outputs = model(input_ids=input_ids, attention_mask=attention_masks, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(train_loader)

# Evaluation function
def evaluate_t5(model, val_loader, tokenizer):
    model.eval()
    total_loss = 0
    with torch.no_grad():
        for batch in tqdm(val_loader):
            input_ids, attention_masks, labels = [x.to(device) for x in batch]
            labels[labels == tokenizer.pad_token_id] = -100
            outputs = model(input_ids=input_ids, attention_mask=attention_masks, labels=labels)
            loss = outputs.loss
            total_loss += loss.item()
    return total_loss / len(val_loader)

# Training loop
epochs = 2
for epoch in range(epochs):
    train_loss = train_t5(model, train_loader, optimizer, tokenizer)
    val_loss = evaluate_t5(model, val_loader, tokenizer)
    print(f"Epoch {epoch + 1}: Train Loss = {train_loss:.4f}, Val Loss = {val_loss:.4f}")

!pip install nltk
from nltk.translate.bleu_score import sentence_bleu

# BLEU Evaluation
def calculate_bleu(model, inputs, targets, tokenizer):
    model.eval()
    references = []
    hypotheses = []
    for input_text, target_text in zip(inputs, targets):
        input_ids = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512).input_ids.to(device)
        with torch.no_grad():
            output_ids = model.generate(input_ids)
        hypothesis = tokenizer.decode(output_ids[0], skip_special_tokens=True)
        references.append([target_text.split()])
        hypotheses.append(hypothesis.split())
    return sum([sentence_bleu(ref, hyp) for ref, hyp in zip(references, hypotheses)]) / len(references)

# T5 BLEU
t5_bleu = calculate_bleu(model, val_inputs[:100], val_targets[:100], tokenizer)
print(f"T5 BLEU Score: {t5_bleu:.4f}")





tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

  0%|          | 0/12500 [00:00<?, ?it/s]Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
100%|██████████| 12500/12500 [45:19<00:00,  4.60it/s]
100%|██████████| 112/112 [00:05<00:00, 19.58it/s]


Epoch 1: Train Loss = 1.1428, Val Loss = 1.0796


100%|██████████| 12500/12500 [45:22<00:00,  4.59it/s]
100%|██████████| 112/112 [00:05<00:00, 19.70it/s]


Epoch 2: Train Loss = 1.0618, Val Loss = 1.0664




T5 BLEU Score: 0.1093


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Retraining after disconnection as model was not saved before

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
from datasets import load_dataset
from evaluate import load
import torch
from torch.utils.data import DataLoader
from tqdm import tqdm
import numpy as np

# Load the T5 model and tokenizer
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model_name = "t5-small"  # You can use "t5-base" or "t5-large" for better performance
t5_tokenizer = T5Tokenizer.from_pretrained(model_name)
t5_model = T5ForConditionalGeneration.from_pretrained(model_name).to(device)

# Load dataset
dataset = load_dataset("IWSLT/iwslt2017", 'iwslt2017-en-fr')
train_data = dataset['train']['translation'][:50000]
validation_data = dataset['validation']['translation']
test_data = dataset['test']['translation']

# Preprocessing function
def preprocess_translation(dataset, source_lang, target_lang):
    return [(example[source_lang], example[target_lang]) for example in dataset]

training_set = preprocess_translation(train_data, 'en', 'fr')
validation_set = preprocess_translation(validation_data, 'en', 'fr')
test_set = preprocess_translation(test_data, 'en', 'fr')

# Tokenization function
def tokenize_batch(source_texts, target_texts, tokenizer, max_length=128):
    inputs = tokenizer(source_texts, return_tensors="pt", padding=True, truncation=True, max_length=max_length)
    targets = tokenizer(target_texts, return_tensors="pt", padding=True, truncation=True, max_length=max_length)
    return inputs.input_ids.to(device), inputs.attention_mask.to(device), targets.input_ids.to(device)

# Define data loader
batch_size = 8
train_loader = DataLoader(training_set, batch_size=batch_size, shuffle=True)
validation_loader = DataLoader(validation_set, batch_size=batch_size, shuffle=False)

# Optimizer and loss function
optimizer = torch.optim.AdamW(t5_model.parameters(), lr=3e-5)

# Training function
def train_t5_epoch(model, dataloader, tokenizer, optimizer, device):
    model.train()
    epoch_loss = 0

    for source_texts, target_texts in tqdm(dataloader):
        input_ids, attention_mask, target_ids = tokenize_batch(source_texts, target_texts, tokenizer)
        labels = target_ids.clone()
        labels[labels == tokenizer.pad_token_id] = -100  # Ignore padding tokens in loss

        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        optimizer.zero_grad()
        epoch_loss += loss.item()

    return epoch_loss / len(dataloader)

# Evaluation function
def evaluate_t5(model, dataloader, tokenizer, device):
    model.eval()
    epoch_loss = 0

    with torch.no_grad():
        for source_texts, target_texts in tqdm(dataloader):
            input_ids, attention_mask, target_ids = tokenize_batch(source_texts, target_texts, tokenizer)
            labels = target_ids.clone()
            labels[labels == tokenizer.pad_token_id] = -100  # Ignore padding tokens in loss

            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            epoch_loss += outputs.loss.item()

    return epoch_loss / len(dataloader)

# Training loop
num_epochs = 5
for epoch in range(num_epochs):
    print(f"Epoch {epoch + 1}/{num_epochs}")
    train_loss = train_t5_epoch(t5_model, train_loader, t5_tokenizer, optimizer, device)
    val_loss = evaluate_t5(t5_model, validation_loader, t5_tokenizer, device)
    print(f"Training Loss: {train_loss:.4f} | Validation Loss: {val_loss:.4f}")
    torch.save(t5_model.state_dict(), "/content/drive/MyDrive/my_model_t5.pth")

Epoch 1/5


100%|██████████| 6250/6250 [10:14<00:00, 10.17it/s]
100%|██████████| 112/112 [00:03<00:00, 31.56it/s]


Training Loss: 1.2811 | Validation Loss: 1.1286
Epoch 2/5


100%|██████████| 6250/6250 [10:14<00:00, 10.16it/s]
100%|██████████| 112/112 [00:03<00:00, 34.83it/s]


Training Loss: 1.1930 | Validation Loss: 1.1116
Epoch 3/5


100%|██████████| 6250/6250 [10:14<00:00, 10.18it/s]
100%|██████████| 112/112 [00:03<00:00, 35.24it/s]


Training Loss: 1.1549 | Validation Loss: 1.1018
Epoch 4/5


100%|██████████| 6250/6250 [10:14<00:00, 10.17it/s]
100%|██████████| 112/112 [00:03<00:00, 30.07it/s]


Training Loss: 1.1258 | Validation Loss: 1.0919
Epoch 5/5


100%|██████████| 6250/6250 [10:13<00:00, 10.19it/s]
100%|██████████| 112/112 [00:03<00:00, 33.24it/s]


Training Loss: 1.1045 | Validation Loss: 1.0889


In [None]:
from transformers import T5ForConditionalGeneration, T5Tokenizer
from evaluate import load
import torch
import numpy as np

# Load pretrained T5 model and tokenizer
model_name = "t5-small"  # You can choose larger versions like "t5-base" or "t5-large"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name).to(device)

# Define BERTScore and METEOR metrics
bertscore = load("bertscore")
meteor = load("meteor")

# Translation function using T5
def translate_t5(model, src_sentence, tokenizer, device, max_length=200):
    """
    Translate a single sentence using T5 model.
    Arguments:
        model: T5 model for translation.
        src_sentence: str, Source sentence to translate.
        tokenizer: Tokenizer for T5 model.
        device: torch.device, Device to run the model on.
        max_length: int, Maximum length for generated sequence.
    Returns:
        str: Translated sentence.
    """
    model.eval()
    # Preprocess input sentence for T5
    input_text = f"translate English to French: {src_sentence}"
    input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)

    # Generate translation
    outputs = model.generate(input_ids, max_length=max_length, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print(translate_t5(model, "Hello how are you today", tokenizer, device))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Bonjour, comment êtes-vous aujourd'hui


In [None]:
# Testing function for T5 model
def test_t5(test_loader, model, tokenizer, device, max_length=200):
    """
    Test T5 model using precision, recall, F1, and METEOR metrics.
    Arguments:
        test_loader: DataLoader, DataLoader for the test set.
        model: T5 model.
        tokenizer: T5 tokenizer.
        device: torch.device, Device to run the model on.
        max_length: int, Maximum length for generated translations.
    Returns:
        tuple: Averaged precision, recall, F1, and METEOR scores.
    """
    precision, recall, f1, meteor_metric = 0, 0, 0, 0

    for src, target in test_loader:
        # Generate translations for each source sentence
        results_bert = [translate_t5(model, src_sentence, tokenizer, device, max_length) for src_sentence in src]
        results_meteor = results_bert

        # Decode target sentences if necessary
        target_sentences = [tokenizer.decode(tgt, skip_special_tokens=True) if isinstance(tgt, list) else tgt for tgt in target]

        if len(results_bert) != len(target_sentences):
            continue

        # Compute BERTScore metrics
        bert_results = bertscore.compute(
            predictions=results_bert,
            references=target_sentences,
            lang="fr"
        )

        # Compute METEOR metric
        meteor_metric += np.mean([
            meteor.compute(predictions=[pred], references=[ref])["meteor"]
            for pred, ref in zip(results_meteor, target_sentences)
        ])

        # Calculate precision, recall, F1 (using BERTScore metrics)
        precision += np.mean(bert_results["precision"])
        recall += np.mean(bert_results["recall"])
        f1 += np.mean(bert_results["f1"])

    # Return averaged precision, recall, F1, and METEOR scores
    return precision / len(test_loader), recall / len(test_loader), f1 / len(test_loader), meteor_metric / len(test_loader)

# Define a subset of the test set for evaluation
subset_size = 100
trimmed_test_set = test_set[:subset_size]

# Convert test set into DataLoader
test_loader = torch.utils.data.DataLoader(trimmed_test_set, batch_size=8, shuffle=False)

# Evaluate T5 model
precision, recall, f1, meteor_score = test_t5(test_loader, model, tokenizer, device)
print(f"Precision: {precision:.4f}, Recall: {recall:.4f}, F1: {f1:.4f}, METEOR: {meteor_score:.4f}")


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Precision: 0.8695, Recall: 0.8710, F1: 0.8699, METEOR: 0.5976
