<a href="https://colab.research.google.com/github/elhamod/IS883/blob/main/Assignment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Using Transformers for language modeling**

In this assignment, you will experiment with using transformers to solve two different language modeling roblems: Text generation and translation.

- Some packages you may need. You are free to use alternative ones, but this should make your task simpler.

In [None]:
# You only need to run this once when you load the notebook to install required packages. You can comment this cell out once you run it.

# !pip install torch
!pip install datasets
!pip install apache_beam mwparserfromhell
!pip install transformers[torch]
!pip install sentence_transformers
!pip install evaluate
!pip install accelerate -U

 - Connect to Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

- Check if GPU is available. If so, it should print `cuda`

In [None]:
import torch

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

##**Part 1: Using a Transformer to model Wikipedia text**

You will use a GPT2 Transformer to model the data [simple Wikipedia dataset](https://huggingface.co/datasets/wikipedia/viewer/20220301.simple/train). Our goal is to generate Wikipedia-sounding articles that sound novel but also believable.

- Load the dataset

In [None]:
from datasets import load_dataset

wikipedia_simple_dataset = load_dataset("wikipedia", "20220301.simple")

print("dataset structure is", wikipedia_simple_dataset)

print("an example of a training sequence is", wikipedia_simple_dataset["train"]["text"][0])

- Split the dataset into a training set (the first 300 articles) and a the test set (the last 60 articles)

In [None]:
# Check the total number of rows in the dataset
total_rows = len(wikipedia_simple_dataset["train"])

# Define the indices for splitting the dataset
train_end_idx = 300  # The end index for the training set
test_start_idx = total_rows - 60  # The start index for the test set

# Ensure the dataset has at least 12k rows
if total_rows < 360:
    raise ValueError("The dataset has fewer than 360 rows.")

# Split the dataset into training and test sets
train_dataset = wikipedia_simple_dataset["train"].select(range(train_end_idx))
test_dataset = wikipedia_simple_dataset["train"].select(range(test_start_idx, total_rows))

1. **(1 point)** Start from a *pretrained* GPT2 transformer with a context of 512 tokens with padding, such that:
  - Print the training and test losses every epoch.
  - Save the model that performs best on the **test set** as `best_model`
  - Train for 10 epochs

Step 1: Create the tokenizer and tokenize the dataset

In [None]:
context_len = 512

Step 2: Create the model

Step 3: Create a perplexity metric and a `compute_metric` function to measure the perplexity.

Step 4: Train the model

Step 5: Save the best model to your Google Drive to path `/content/drive/MyDrive/IS883_HW2/best_model_wiki`



Step 6: Now load the model back and assign it to best_model

2. **(1 point)** Write a function that generates text using `best_model`. This function takes the following parameters:

  - *temperature*: has a default value 1.0.
  - *max_gen_tokens*: specifies the maximum number of tokens in the generated text. Default value is 40.
  - *prefix*: default value `tokenizer.bos_token` (i.e., beginning of sentence token).

Each time the function is called, it generates 5 possible unique texts. Also, use sampling to avoid generating identical texts.

Use the function and generate some texts with different temperatures and prefixes. Comment on the quality of the model.


Call the function here to generate 5 different texts. The texts should not be identical.

3. **(1 point)** Calculate the perplexity of `best_model` on the test set.

Generally, a perplexity lower than 30 is desired. Have you been able to achieve it? If not, would you expect more hyper-parameter tuning to solve the issue? Elaborate and reflect on your answers.

4. **(1.5 point)** Now, train a new GPT2. This model `model_from_scratch` is identical to `best_model`, except that it is trained **from scratch**.
Once done:

  - Calculate the perplexity on the test set.
  - Generate some texts.
  - Which model is better `best_model` or `model_from_scratch`? Justify and reflect on your answers.

Create the model and train.

Calculate perplexity

Generate texts

Delete your model and clear `cuda` cache for next experiment.




In [None]:
# Clear GPU cache
torch.cuda.empty_cache()

##**Part 2: Using an language models for translation**

Here, you will use an *appropriate* language model of your choice and train it on a dataset that has English-to-French song translations.

In [None]:
from datasets import load_dataset

# Load the dataset
dataset = load_dataset("Nicolas-BZRD/Original_Songs_Lyrics_with_French_Translation")

# Define a function to check if either 'original_version' or 'french_version' are None
def filter_rows(example):
    return example['original_version'] is not None and example['french_version'] is not None

# Filter the dataset
dataset = dataset.filter(filter_rows)

print("An example row from this dataset")
dataset['train'][0]

  - Split the dataset into a training set (the first 300 songs) and a test set (the last 60 songs).



In [None]:
# Check the total number of rows in the dataset
total_rows = len(dataset["train"])

# Ensure the dataset has at least 22k rows
if total_rows < 660:
    raise ValueError("The dataset has fewer than 360 rows.")

# Define the indices for splitting the dataset
train_end_idx = 300  # The end index for the training set
test_start_idx = total_rows - 60  # The start index for the test set

# Split the dataset into training and test sets
train_dataset = dataset["train"].select(range(train_end_idx))
test_dataset = dataset["train"].select(range(test_start_idx, total_rows))

# Print the number of rows in training and test sets
print(f"Number of rows in training set: {len(train_dataset)}")
print(f"Number of rows in test set: {len(test_dataset)}")



1. **(1.5 point)** Choose a good **pre-trained** model for this task. Explain your criteria for choosing this model. It is highly recommended to select one from [HuggingFace official pre-trained models](https://huggingface.co/docs/transformers/index) or [HuggingFace user pre-trained models](https://huggingface.co/models)

Create the tokenizer. Use a `max_length` of 512. Remove all columns unnecessary for the translation.


In [None]:
max_length=512

Create the model

Train the model.

  2. **(0.5 point)** You might find that your notebook runs out of memory or takes too long to train. What hyper-parameter could you change to address that?

3. **(1 point)** Translate the following two sentences. Would your model make a good English-to-French translator? Justify your answer.

  - "Just let me hear some of that rock and roll music"
  - "If you wanna dance with me\nI've got no kick against modern jazz"

4. **(0.5 point)** What would be a good metric for measuring the performance of this model? Could you calculate it for this pair of model and dataset? If yes, show your results and discuss them. If no, elaborate on the reason and how you would go about solving it.