#**Using Transformers for language modeling**

In this assignment, you will experiment with using transformers to solve two different language modeling roblems: Text generation and translation.

- Some packages you may need. You are free to use alternative ones, but this should make your task simpler.

In [1]:
# You only need to run this once when you load the notebook to install required packages. You can comment this cell out once you run it.

# !pip install torch
!pip install datasets
!pip install apache_beam mwparserfromhell
!pip install transformers[torch]
!pip install sentence_transformers
!pip install evaluate
!pip install accelerate -U

Collecting dill<0.3.2,>=0.3.1.1 (from apache_beam)
  Using cached dill-0.3.1.1-py3-none-any.whl
Installing collected packages: dill
  Attempting uninstall: dill
    Found existing installation: dill 0.3.7
    Uninstalling dill-0.3.7:
      Successfully uninstalled dill-0.3.7
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
multiprocess 0.70.15 requires dill>=0.3.7, but you have dill 0.3.1.1 which is incompatible.[0m[31m
[0mSuccessfully installed dill-0.3.1.1
Collecting dill (from evaluate)
  Using cached dill-0.3.7-py3-none-any.whl (115 kB)
Installing collected packages: dill
  Attempting uninstall: dill
    Found existing installation: dill 0.3.1.1
    Uninstalling dill-0.3.1.1:
      Successfully uninstalled dill-0.3.1.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the

Connect to Google Drive

In [71]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


- Check if GPU is available. If so, it should print `cuda`

In [3]:
import torch

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


##**Part 1: Using a Transformer to model Wikipedia text**

You will use a GPT2 Transformer to model the data [simple Wikipedia dataset](https://huggingface.co/datasets/wikipedia/viewer/20220301.simple/train). Our goal is to generate Wikipedia-sounding articles that sound novel but also believable.

- Load the dataset

In [72]:
from datasets import load_dataset

wikipedia_simple_dataset = load_dataset("wikipedia", "20220301.simple")

print("dataset structure is", wikipedia_simple_dataset)

print("an example of a training sequence is: ", wikipedia_simple_dataset["train"]["text"][0])

dataset structure is DatasetDict({
    train: Dataset({
        features: ['id', 'url', 'title', 'text'],
        num_rows: 205328
    })
})
an example of a training sequence is:  April is the fourth month of the year in the Julian and Gregorian calendars, and comes between March and May. It is one of four months to have 30 days.

April always begins on the same day of week as July, and additionally, January in leap years. April always ends on the same day of the week as December.

April's flowers are the Sweet Pea and Daisy. Its birthstone is the diamond. The meaning of the diamond is innocence.

The Month 

April comes between March and May, making it the fourth month of the year. It also comes first in the year out of the four months that have 30 days, as June, September and November are later in the year.

April begins on the same day of the week as July every year and on the same day of the week as January in leap years. April ends on the same day of the week as December every yea

- Split the dataset into a training set (the first 300 articles) and a the test set (the last 60 articles)

In [73]:
# Check the total number of rows in the dataset
total_rows = len(wikipedia_simple_dataset["train"])

# Define the indices for splitting the dataset
train_end_idx = 300  # The end index for the training set
test_start_idx = total_rows - 60  # The start index for the test set

# Ensure the dataset has at least 12k rows
if total_rows < 360:
    raise ValueError("The dataset has fewer than 360 rows.")

# Split the dataset into training and test sets
train_dataset = wikipedia_simple_dataset["train"].select(range(train_end_idx))
test_dataset = wikipedia_simple_dataset["train"].select(range(test_start_idx, total_rows))

1. **(1 point)** Start from a *pretrained* GPT2 transformer with a context of 512 tokens with padding, such that:
  - Print the training and test losses every epoch. **(0.25 points)**
  - Save the model that performs best on the **test set** as `best_model`  **(0.25 points)**
  - Train for 10 epochs **(0.5 points)**

Step 1: Create the tokenizer and tokenize the dataset



In [74]:
context_len = 512

In [75]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, DataCollatorForLanguageModeling, Trainer, TrainingArguments

#Tokenization
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token # Alternatively, other special tokens could be used. These tokens are just used to pad the context up to the context length incase your text is short.

def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, max_length=context_len, padding='max_length')

train_dataset_tokenized = train_dataset.map(tokenize_function)
test_dataset_tokenized = test_dataset.map(tokenize_function)

In [76]:
# Taking a look at the headers of the tokenized dataset
# We only really care about 'text' and 'input_ids', which is the tokenized text.
test_dataset_tokenized

Dataset({
    features: ['id', 'url', 'title', 'text', 'input_ids', 'attention_mask'],
    num_rows: 60
})

Step 2: Create the model

In [77]:
model = GPT2LMHeadModel.from_pretrained("gpt2").to(device)

Step 3: Create a perplexity metric and a `compute_metric` function to measure the perplexity

In [78]:
#NOTE: The function below calculates perplexity for each iteration. It is not meant to be used for calculating the complexity on the test set at the end.

from evaluate import load
import numpy as np

perplexity = load("perplexity", module_type="metric")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    # Decode the predictions to get the predicted text
    predicted_text = [tokenizer.decode(p) for p in predictions]

    return {"perplexity": perplexity.compute(predictions=predicted_text, model_id='gpt2')['mean_perplexity']}


Step 4: Train the model

In [79]:
#training

training_args = TrainingArguments(
    per_device_train_batch_size=2, # Setting the batch size low helps with memory issues.
    per_device_eval_batch_size=2, # Setting the batch size low helps with memory issues.
    logging_dir='./pre_trained/logs',
    evaluation_strategy="epoch", # Setting this to "epoch" instead of "step" speeds up the training because evaluations are not made for every batch.
    logging_strategy="epoch", # Setting this to "epoch" instead of "step" speeds up the training because logging is not made for every batch.
    save_strategy="epoch", # Model saving happens at the epoch level, which is more efficient than at the batch level.
    save_total_limit=1,  # Only save the best model
    output_dir="./pre_trained/results",
    overwrite_output_dir=True,
    do_train=True,
    do_eval=True,
    # learning_rate=5e-4,
    num_train_epochs=10, # Sets number of epochs to 10.
    load_best_model_at_end=True, # IMPORTANT: this is what loads the best model at the end of training.
    metric_for_best_model="eval_loss", # Optional as "eval_loss" is the default value, but emphasizes that we care about best on "eval" set, not train set.
)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False) # Handles padding the different texts to make them all same lengths and batchable.

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset_tokenized,
    eval_dataset=test_dataset_tokenized,
    # compute_metrics=compute_metrics  # This line calls compute_metrics to compute the perplexity per epoch. However, in case you are having memory issues, it could be commented out for better efficiency.
)


In [80]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,2.9774,3.270028
2,2.6686,3.298128
3,2.4628,3.363056
4,2.3049,3.433841
5,2.1834,3.501757
6,2.0731,3.573758
7,1.9881,3.630997
8,1.9201,3.66801
9,1.872,3.703483
10,1.827,3.730242


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


TrainOutput(global_step=1500, training_loss=2.2277340698242187, metrics={'train_runtime': 656.8152, 'train_samples_per_second': 4.567, 'train_steps_per_second': 2.284, 'total_flos': 783876096000000.0, 'train_loss': 2.2277340698242187, 'epoch': 10.0})

Step 5: Save the best model to your Google Drive to path `/content/drive/MyDrive/IS883_HW2/best_model_wiki`

In [81]:
saved_model_name = "/content/drive/MyDrive/IS883_HW2/wiki_best_model"

trainer.save_model(saved_model_name)

Step 6: Now load the model back and assign it to `best_model`

In [83]:
best_model = GPT2LMHeadModel.from_pretrained(saved_model_name).to(device)

2. **(1 point)** Write a function that generates text using `best_model`. This function takes the following parameters **(0.25 points)**:

  - *temperature*: has a default value 1.0.
  - *max_gen_tokens*: specifies the maximum number of tokens in the generated text. Default value is 40.
  - *prefix*: default value `tokenizer.bos_token` (i.e., beginning of sentence token).

Each time the function is called, it generates 5 possible unique texts. Also, use sampling to avoid generating identical texts. **(0.25 points)**

Use the function and generate some texts with different temperatures and prefixes. Comment on the quality of the model. **(0.5 points)**


In [67]:
def gen_text(model, temperature=1.0, max_gen_tokens=40, prefix=tokenizer.bos_token):

  # Tokenize input text
  inputs = tokenizer(prefix, return_tensors="pt", truncation=True, max_length=context_len, return_special_tokens_mask=True)

  # Move inputs to GPU
  inputs = {name: tensor.to(model.device) for name, tensor in inputs.items()}

  # Generate text using the best model
  output_ids = model.generate(inputs["input_ids"],
                              # no_repeat_ngram_size=5,
                              num_return_sequences=5, # Returns 5 texts as asked above
                              attention_mask=inputs["attention_mask"],
                              max_new_tokens=max_gen_tokens,
                              temperature=temperature,
                              pad_token_id=tokenizer.pad_token_id,
                              do_sample=True) # IMPORTANT: You need to turn sampling on. Otherwise, you will always get the same generated text.

  # print the sentences out
  for i in range(5):
    print('----')

    # Decode the generated text
    generated_text = tokenizer.decode(output_ids[i].tolist(), skip_special_tokens=True)

    print(generated_text)

Call the function here to generate 5 different texts. The texts should not be identical.

In [24]:
gen_text(best_model)

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


----
'The idea of this would have to have been completely different…' he told KAIT (Kabila), explaining the 'how a new person would find that same motivation in a new person
----
A year-long investigation by The New York Times found that the number of people arrested for sex crimes increased for every court-ordered misdemeanor count, up $12.5 million compared to the same period
----

The U.S. Defense Information Agency, which provides the spy agency with information about military threats, has recently been criticized for revealing sensitive information about American civilian casualties during the Syrian war, but has
----
- Updated for July 22-1922

After years of debate on how the state must comply with the State Constitution, California Supreme Court judge Barry Lee ruled last July 22 that the state did have
----
There had been several changes of the state of Arkansas last year, especially with our current government.

I know we are moving a lot of money, but if they can provide som

In [50]:
gen_text(best_model, temperature=1, max_gen_tokens=100, prefix="Robert Walpole")

----
Robert Walpole / Staff Photographer A large tent with a plastic-wrapped table rests atop the makeshift bed.

A large tent with a plastic-wrapped table rests atop the makeshift bed.

A massive tent is mounted on a large wooden crate, and there is also a smaller bed for sleeping under.

These tents have a small wooden seat up top for the tent.

A large tent is mounted on a large wooden crate and there is also a smaller bed for sleeping under.

----
Robert Walpole, author of "The End of the World": Obama and Globalism by Stephen Walt Jr.

H/T Daily Beast
----
Robert Walpole/Getty Images

The Seahawks' offense played solid but didn't quite come close to reaching the point of being a true playoff contender at least in the way Washington's defense played the previous month. The Seahawks won a game by 13 points in a 20-point loss to the New Vikings earlier in the day. However, they didn't have the opportunity they had in the NFC East to make an impact. They simply lost to the Vikings wit

In [43]:
gen_text(best_model, temperature=0.2, max_gen_tokens=100, prefix="Robert Walpole")

----
Robert Walpole, a former president of the American Civil Liberties Union, said the bill would "make it harder for people to get jobs."

"It's going to make it harder for people to get jobs," he said. "It's going to make it harder for people to get jobs."

The bill would also allow employers to deny benefits to people who have been convicted of a felony, and would require them to pay a penalty of up to $500 for each conviction.

The bill
----
Robert Walpole, a former U.S. ambassador to the United Nations, said the U.S. was "deeply concerned" about the situation in Syria.

"We are concerned about the situation in Syria and we are concerned about the situation in Iraq," Walpole said. "We are concerned about the situation in Syria and we are concerned about the situation in Iraq."

The U.S. has been sending troops to Syria to help fight the Islamic State, which has seized large sw
----
Robert Walpole, a former U.S. ambassador to the United Nations, said that the U.S. was "not going to

In [40]:
gen_text(best_model, temperature=1.7, max_gen_tokens=100)

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


----
, the state where it was used for public works

and construction. (It started its own project in 2012 after state officials failed and it ended this summer, with a $50 million purchase. That investment has yielded some new and "supervised labor." Now you got some bad bad ones here. And it takes years-term fixes as engineers add features and improvements before its day starts getting bigger and grayer.) Many cities around New England got a lot right. Other companies saw better pay—
----
Boom is just the latest example of many of the challenges associated with tackling "Big Blue Hearts." First and foremost, it helps ensure every child feels safe for the summer as this practice makes for great games to help kids be involved in activities which are culturally accepted, nurtured and encouraged in daily life. Bingo for young children from kindergarten to senior.

According to Keesmiller, and likely to get more traction these days than in past year so it'll also work much like other
----

**Answer:**

- Remember that the used model is pre-trained. So, it has seen texts that are not Wikipedia articles before and can already formulate sensical text even without training.
- When generating text with a prefix that appeared in the training data, the model does not simply repeat the training data (i.e., no overfitting, thanks to using a pretrained model).
- Notice that the test loss is going up while the training loss keeps going down (i.e., the model is overfitting). After all, the training set is quite small and is not expected to actually capture what "wikipedia article" really means. So, the best model on the *test set* is really the model from epoch 1. Later models (after training for an adequate number of epochs) would be better at producing text that matches the  training set. Still, that would not mean generating generic wikipedia texts.
- The impact of *temperature*, *prefix* and *context length* can be seen from the experiments above.
- Note that even if training is perfect, this does not guarantee the generated text to be factual. Sounding like Wikipedia and being factual are not the same thing.


3. **(1 point)** Calculate the perplexity of `best_model` on the test set **(0.5 point)**.

Generally, a perplexity lower than 30 is desired. Have you been able to achieve it? If not, would you expect more hyper-parameter tuning to solve the issue? Elaborate and reflect on your answers. **(0.5 point)**.


In [84]:
# Since the answer leaves it open, there are multiple ways to calculate the perplexity.

# The simplest and easiest way to calculate the perplexity is to exponentiate the loss. Several places provide code for this through a Google search:
# https://discuss.huggingface.co/t/guide-the-best-way-to-calculate-the-perplexity-of-fixed-length-models/193/8
# https://github.com/huggingface/transformers/blob/0baa9246cb1ddac355db1df7824a521426599eb7/docs/source/en/perplexity.md?plain=1#L122

import numpy as np

results = trainer.evaluate()
print(f"Test Loss: {results['eval_loss']}")
print(f"Perplexity (method 1): {np.e ** results['eval_loss']}")


# Other more technically complex ways exist. But, since they are iterative in nature, they may provide a difference answer (i.e., different implementations may lead to different numbers):
# https://huggingface.co/docs/transformers/perplexity
# https://github.com/huggingface/transformers/blob/0baa9246cb1ddac355db1df7824a521426599eb7/docs/source/en/perplexity.md?plain=1#L122

from torch.nn import functional as F
import torch

def calculate_perplexity(model, test_dataset, device='cuda'):
    model = model.to(device)
    model.eval()

    total_loss = 0.0
    total_count = 0

    with torch.no_grad():
        for batch in test_dataset:
            inputs = torch.tensor(batch['input_ids']).to(device)
            targets = torch.tensor(batch['input_ids']).to(device)
            mask = torch.tensor(batch['attention_mask']).to(device) if 'attention_mask' in batch else None

            outputs = model(inputs, attention_mask=mask)
            logits = outputs.logits

            # Shift the logits and labels to compute loss
            shift_logits = logits[..., :-1, :].contiguous()
            shift_labels = targets[..., 1:].contiguous()

            loss = F.cross_entropy(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
            total_loss += loss.item() * inputs.size(0)
            total_count += inputs.size(0)

    return torch.exp(torch.tensor(total_loss / total_count)).item()

print("Perplexity (method 2): ", calculate_perplexity(best_model, test_dataset_tokenized))






Test Loss: 3.2700276374816895
Perplexity (method 1): 26.312066532474006
Perplexity (method 2):  1394.543701171875


4. **(1.5 point)** Now, train a new GPT2. This model `model_from_scratch` is identical to `best_model`, except that it is trained **from scratch**. **(0.5 point)**
Once done:

  - Calculate the perplexity on the test set. **(0.25 point)**
  - Generate some texts. **(0.25 point)**
  - Which model is better `best_model` or `model_from_scratch`? Justify and reflect on your answers. **(0.5 point)**

Create the model and train.

In [85]:
from transformers import GPT2Config, GPT2LMHeadModel

# IMPORTANT: Initializing a GPT2 configuration from scratch
configuration = GPT2Config(
    vocab_size=len(tokenizer),  # vocabulary size is the same used in the previous model
    n_ctx=context_len, # Setting the same context length.
)

# Initializing a model (with random weights) from the configuration
model = GPT2LMHeadModel(configuration).to(device)

training_args = TrainingArguments(
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    logging_dir='./from_scratch/logs',
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=1,
    output_dir="./from_scratch/results",
    overwrite_output_dir=True,
    do_train=True,
    do_eval=True,
    num_train_epochs=10,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
)

trainer2 = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset_tokenized,
    eval_dataset=test_dataset_tokenized,
)


trainer2.train()

Epoch,Training Loss,Validation Loss
1,7.6995,7.806784
2,6.4594,7.6607
3,6.1385,7.612943
4,5.8934,7.550513
5,5.6717,7.553468
6,5.4728,7.539848
7,5.2992,7.539625
8,5.1578,7.54989
9,5.0194,7.583177
10,4.9345,7.580673


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


TrainOutput(global_step=1500, training_loss=5.774607055664062, metrics={'train_runtime': 641.6236, 'train_samples_per_second': 4.676, 'train_steps_per_second': 2.338, 'total_flos': 783876096000000.0, 'train_loss': 5.774607055664062, 'epoch': 10.0})

In [86]:
model_from_scratch = trainer2.model

In [87]:
saved_model_name2 = "/content/drive/MyDrive/IS883_HW2/wiki_from_scratch"

trainer2.save_model(saved_model_name2)
model_from_scratch = GPT2LMHeadModel.from_pretrained(saved_model_name2).to(device)

Calculate perplexity

In [88]:
results = trainer2.evaluate(test_dataset_tokenized)

results = trainer2.evaluate()
print(f"Test Loss: {results['eval_loss']}")
print(f"Perplexity (method 1): {np.e ** results['eval_loss']}")

print("Perplexity (method 2): ", calculate_perplexity(model_from_scratch, test_dataset_tokenized))

Test Loss: 7.53962516784668
Perplexity (method 1): 1881.124786943412
Perplexity (method 2):  99111.6796875


Generate Texts

In [53]:
gen_text(model_from_scratch, temperature=1, max_gen_tokens=100, prefix="Robert Walpole")

----
Robert Walpole is a part of living in the word in the Kingdom. When people can be a person's world.

History 


|-the people who have always called a number, a computer into two in the same as a type of different, it as a "to-speaking part of these countries.


Ath century
The word "the person's part of the United people have a lot of a planet, a person when a lot of the same as a "
----
Robert Walpole is the make of the first. It is a country in the world's is the same day of the planet. It is the other called country in the Latin of the year. It was a one of the country in the same day of the city of the most of a "which or it was an country to all the other people who have a country in the following by the first day of the world. In the week as the same day of the same day of the week as the same day
----
Robert Walpole is a kind of the study of people. It is, the language. 

Ar example, a long time, or a person does not be in other person or any country or group of the person 

**Answer:**
- Obviously, the fine-tuned model is better based on the loss and perplexity. However, you can also see that the trained-from-scratch model produces non-sensical text. This is because it has little linguistic knowledge as it has never been trained on any text data before, as opposed to the pretrained model that has some linguistic knowledge while we just fine-tuned it on wikipedia text.
-  Regardless of which perplexity implementation you have used, the pre-trained model should have yielded a relatively smaller perplexity value than the from-scratch-model.

Delete your model and clear `cuda` cache for next experiment.

In [52]:
# While optional, cleaning things up after using them always helps with memory issues

del model_from_scratch
del best_model
del model
del wikipedia_simple_dataset
del train_dataset
del test_dataset

import gc
gc.collect()

30

In [53]:
# Clear GPU cache
torch.cuda.empty_cache()

##**Part 2: Using an language models for translation**

Here, you will use an *appropriate* language model of your choice and train it on a dataset that has English-to-French song translations.

In [4]:
from datasets import load_dataset

# Load the dataset
dataset = load_dataset("Nicolas-BZRD/Original_Songs_Lyrics_with_French_Translation")

# Define a function to check if either 'original_version' or 'french_version' are None
def filter_rows(example):
    return example['original_version'] is not None and example['french_version'] is not None

# Filter the dataset
dataset = dataset.filter(filter_rows)

print("An example row from this dataset")
dataset['train'][0]

An example row from this dataset


{'artist_name': 'The Beatles',
 'album_name': 'Beatles For Sale',
 'year': 1964,
 'title': 'Rock and Roll Music',
 'number': 4,
 'original_version': "chorus\nJust let me hear some of that rock and roll music\nAny old way you choose it\nIt's got a back beat, you can't lose it\nAny old time you use it\nIt's gotta be rock and roll music\nIf you wanna dance with me\nIf you wanna dance with me\nI've got no kick against modern jazz\nUnless they try to play it too darn fast\nAnd lose the beauty of the melody\nUntil they sound just like a symphony\nThat's why I go for that that rock and roll music\nAny old way you choose it\nIt's got a back beat, you can't lose it\nAny old time you use it\nIt's gotta be rock and roll music\nIf you wanna dance with me\nIf you wanna dance with me\nI took my loved one over across the tracks\nSo she can hear my man awail a sax\nI must admit they have a rocking band\nMan, they were blowing like a hurricane\nThat's why I go for that that rock and roll music\nAny old

  - Split the dataset into a training set (the first 300 songs) and a test set (the last 60 songs).


In [5]:
# Check the total number of rows in the dataset
total_rows = len(dataset["train"])

# Ensure the dataset has at least 22k rows
if total_rows < 660:
    raise ValueError("The dataset has fewer than 360 rows.")

# Define the indices for splitting the dataset
train_end_idx = 300  # The end index for the training set
test_start_idx = total_rows - 60  # The start index for the test set

# Split the dataset into training and test sets
train_dataset = dataset["train"].select(range(train_end_idx))
test_dataset = dataset["train"].select(range(test_start_idx, total_rows))

# Print the number of rows in training and test sets
print(f"Number of rows in training set: {len(train_dataset)}")
print(f"Number of rows in test set: {len(test_dataset)}")


Number of rows in training set: 300
Number of rows in test set: 60


In [6]:
train_dataset

Dataset({
    features: ['artist_name', 'album_name', 'year', 'title', 'number', 'original_version', 'french_version', 'language'],
    num_rows: 300
})

1. **(1.5 point)** Choose a good **pre-trained** model for this task **(0.5 point)**. Explain your criteria for choosing this model. **(0.5 point)** It is highly recommended to select one from [HuggingFace official pre-trained models](https://huggingface.co/docs/transformers/index) or [HuggingFace user pre-trained models](https://huggingface.co/models)

Create the tokenizer. Use a `max_length` of 512. Remove all columns unnecessary for the translation.

In [10]:
max_length=512

In [11]:
from transformers import BartTokenizer, Seq2SeqTrainingArguments, Seq2SeqTrainer
from transformers import BartConfig, BartForConditionalGeneration
from datasets import load_dataset


# Define the tokenizer
model_name = 'facebook/bart-base' #Bart, a widely known encoder-decoder model that is suitable for translation.
tokenizer = BartTokenizer.from_pretrained(model_name)

# Tokenize the dataset
def tokenize_data(example):
    input_text = example['original_version']  # Assuming this is the source sequence
    target_text = example['french_version']  # Assuming this is the target sequence

    # Tokenize the source text (i.e., English)
    inputs = tokenizer(
        input_text,
        max_length=max_length,
        truncation=True,
        return_tensors='pt',
        padding='max_length',
        return_attention_mask=True
    )

    # Tokenize the target text (i.e., French)
    with tokenizer.as_target_tokenizer():
        targets = tokenizer(
            target_text,
            max_length=max_length,
            truncation=True,
            return_tensors='pt',
            padding='max_length',
            return_attention_mask=True
        )

    # Ensure the attention masks are 2D tensors
    inputs['attention_mask'] = inputs['attention_mask'].squeeze()
    targets['attention_mask'] = targets['attention_mask'].squeeze()

    # Prepare the model inputs
    inputs['labels'] = targets['input_ids'].squeeze() # The labels for us are the tokens of the target language (French)
    inputs['input_ids'] = inputs['input_ids'].squeeze()

    return inputs


# Tokenize the datasets
tokenized_train_dataset = train_dataset.map(tokenize_data, batched=True)
tokenized_test_dataset = test_dataset.map(tokenize_data, batched=True)





Map:   0%|          | 0/300 [00:00<?, ? examples/s]



Map:   0%|          | 0/60 [00:00<?, ? examples/s]

In [12]:
tokenized_train_dataset

Dataset({
    features: ['artist_name', 'album_name', 'year', 'title', 'number', 'original_version', 'french_version', 'language', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 300
})

Create the model

In [13]:
model = BartForConditionalGeneration.from_pretrained(model_name).to(device)

**Answer:** Here, we chose Bart, an encode-decoder model, because that is best for translation. A decode-only model would have also work.

Train the model. **(0.5 point)**

  2. **(0.5 point)** You might find that your notebook runs out of memory or takes too long to train. What hyper-parameter could you change to address that?

**Answer:** Generally, the two best ways to gain memory are to
- decrease batch size.
- choose a smaller model.

There are other hyper-parameters that could help, such as [gradient accumulation](https://towardsdatascience.com/what-is-gradient-accumulation-in-deep-learning-ec034122cfa).

In [20]:

# Define training arguments
training_args = Seq2SeqTrainingArguments(
    per_device_train_batch_size=6,
    per_device_eval_batch_size=6,
    logging_dir='./translation/logs',
    output_dir='./translation/results',
    save_total_limit=1,
    num_train_epochs=10,
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    predict_with_generate=True,
    load_best_model_at_end=True,
)


# Define trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    tokenizer=tokenizer,
)

# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,2.0086,2.154631
2,1.8063,2.03986
3,1.6444,1.991508
4,1.5322,1.902242
5,1.4316,1.845371
6,1.351,1.821475
7,1.2973,1.778309
8,1.2495,1.773595
9,1.2117,1.760024
10,1.194,1.76048


There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.decoder.embed_tokens.weight', 'lm_head.weight'].


TrainOutput(global_step=500, training_loss=1.4726555862426758, metrics={'train_runtime': 666.4027, 'train_samples_per_second': 4.502, 'train_steps_per_second': 0.75, 'total_flos': 914604687360000.0, 'train_loss': 1.4726555862426758, 'epoch': 10.0})

In [21]:
best_model_q2 = trainer.model

3. **(1 point)** Translate the following two sentences **(0.5 point)**. Would your model make a good English-to-French translator? Justify your answer **(0.5 point)**.

  - "Just let me hear some of that rock and roll music"
  - "If you wanna dance with me\nI've got no kick against modern jazz"

In [22]:
import torch

def translate_sentence(sentence, model, tokenizer):
    # Ensure the model is in evaluation mode
    model.eval()

    # Tokenize the input sentence
    inputs = tokenizer(
        text=sentence,  # Source sequence
        return_tensors='pt',  # Return PyTorch tensors
        max_length=max_length,  # Max length for the source sequence
        truncation=True,  # Truncate the sequence if it's too long
        padding=False,
    )

    # Move tensors to the same device as the model
    inputs = {name: t.to('cuda') for name, t in inputs.items()}

    # Generate translation using the model
    with torch.no_grad():  # No need to track the gradients
        outputs = model.generate(**inputs, max_length=100, num_beams=4, early_stopping=True)

    # Decode the generated IDs to get the translated sentence
    translated_sentence = tokenizer.decode(outputs[0], skip_special_tokens=True)

    print(sentence, '->', translated_sentence)


In [29]:
translate_sentence("Just let me hear some of that rock and roll music", best_model_q2, tokenizer)

Just let me hear some of that rock and roll music -> Just let me hear some of that rock and roll music


In [25]:
translate_sentence("If you wanna dance with me\nI've got no kick against modern jazz", best_model_q2, tokenizer)

If you wanna dance with me
I've got no kick against modern jazz -> Si tu veux dance avec moi
Je n'ai jamais quelque chose de modern jazz


**Answer:** It seems the translation is not working fully. Let's train a bit more.

In [31]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,1.2211,1.75835
2,1.1359,1.732679
3,1.065,1.712574
4,0.9948,1.693031
5,0.9357,1.698196
6,0.8876,1.692624
7,0.8462,1.685015
8,0.819,1.691711
9,0.7866,1.686779
10,0.7726,1.69182


There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.decoder.embed_tokens.weight', 'lm_head.weight'].


TrainOutput(global_step=500, training_loss=0.9464482650756836, metrics={'train_runtime': 689.429, 'train_samples_per_second': 4.351, 'train_steps_per_second': 0.725, 'total_flos': 914604687360000.0, 'train_loss': 0.9464482650756836, 'epoch': 10.0})

In [32]:
translate_sentence("Just let me hear some of that rock and roll music", best_model_q2, tokenizer)

Just let me hear some of that rock and roll music -> Juste laissez-moi entendre cette musique rock et roll.


In [33]:
translate_sentence("If you wanna dance with me\nI've got no kick against modern jazz", best_model_q2, tokenizer)

If you wanna dance with me
I've got no kick against modern jazz -> Si tu veux danser avec moi
Je n'ai plus kick de modern jazz


**Answer:** Looks great. No?! Let's try some other sentences.

In [37]:
translate_sentence("Drink some water.", best_model_q2, tokenizer)
translate_sentence("Got to work.", best_model_q2, tokenizer)
translate_sentence("Sleep early.", best_model_q2, tokenizer)

Drink some water. -> Drink some water.
Got to work. -> Got to work.
Sleep early. -> Sleep early.


**Answer:**
- The model translates the first two sentences fairly well (if you check Google translate). However, the model fails translating any other simple sentences.
- If we look closely at the first two sentences, they were extracted from the training set (You can actually find them in the example song at the beginning of Q2).
- So, our model is not actually good at translation. Rather, it just means it has "memorized" the training songs.
- The model does not work well with other sentences because it was not pre-trained for translation and has only seen the provided training translations.
- However, if the model was pre-trained for translation, then it would give some good translations either way. Though the returned would differ translations from the training lyrics.

4. **(0.5 point)** What would be a good metric for measuring the performance of this model? Could you calculate it for this pair of model and dataset? If yes, show your results and discuss them. If no, elaborate on the reason and how you would go about solving it.

**Answer:** One could use metrics such as [BLEU or ROUGE](https://stackoverflow.com/questions/38045290/text-summarization-evaluation-bleu-vs-rouge). However, to use these, you would need *human reference translations* available as part of the dataset, which we do not have here. So, while you could certainly go and generate these translations and calculate the metrics ([See this example](https://huggingface.co/spaces/evaluate-metric/bleu)), it is beyond the scope of this assignment.