# Pre-trained Language Models: SubTask B
## [6 Marks]

In this assignment, you will work on the [ComVE](https://competitions.codalab.org/competitions/21080) shared task that was part of SemEval-2020. The task aims to evaluate whether a system can distinguish if a natural language statement makes sense to humans or not and provide a reason. **ConVE** includes three subtasks that require models to acquire and apply commonsense knowledge. In this notebook you will focus on **SubTask B**:

- Given a statement that does not make sense and three possible reasons, select which reason explains why the given statement is against common sense. For example, for the following nonsensical statement the correct answer is *Reason A*:

     *Statement*: He put an elephant into the fridge.  
     *Reason A*: An elephant is much bigger than a fridge.  
     *Reason B*: Elephants are usually white while fridges are usually white.  
     *Reason C*: An elephant cannot eat a fridge.
     
     This subtask can be approached as a Multiple Choice problem where the input is the nonsensical statement and the three possible explanations, and the output is a label indicating which of the reasons is the correct one.

You will fine-tune a Pre-trained Language Model with [Transformers](https://huggingface.co/docs/transformers/index) library that provides a set of tools for fine-tunning and deploying a wide variety of Pre-trained Language Models. The [Hugging Face Hub](https://huggingface.co/models) allows you to explore all the models supported by **Transformers** and even share your own models with the community. In this assignment, you will work with [RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta), a model that uses **BERT**'s architecture but has been pre-trained with more data and a more carefully selected set of hyperparameters.

Fine-tuning a Pre-trained Language Model usually requires a great amount of time and computational resources. Your personal computer will not be probably enough. In order to complete the assignment, you can work with a reduced version of the dataset and the base version of **RoBERTa**:

In [1]:
shrink_dataset = False
base_model = False
colab = True

Although the value of these variables do not affect the tests that will evaluate your code, the output examples distributed throughout this notebook are based on a `shrink_dataset` and a `base_model` variables set as `True`, and a `colab` variable set as `False`.

If you want to perform a full training of the model to obtain its real performance, you can use a cloud service like [Google Colab](https://colab.research.google.com/). **Colab** is a **Jupyter** notebook environment that supports both GPU and TPU instances, allowing training large scale Deep Learning models. Set the `shrink_dataset` and a `base_model` variables to `False`, the `colab` variable to `True`, and follow the instructions provided to you to run the notebook in **Colab**.

> **Note!** To run this notebook in **Colab** you will need to upload the `datacollator.py` file included in the repository of the assignment.

In [4]:
if colab:
    ! pip install transformers datasets evaluate
    import os
    if not os.path.exists("SemEval2020-Task4-Data/ALL data/Training  Data/subtaskA_data_all.csv"):
        ! git clone https://github.com/wangcunxiang/SemEval2020-Task4-Commonsense-Validation-and-Explanation.git SemEval2020-Task4-Data

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m547.8/547.8 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.1/84.1 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-16.1.0-cp310-cp310-manylinux_2_28_x86_64.whl (40.8 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m40.8/40.8 MB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ

In [3]:
!pip install transformers




You will use the following objects and functions:

In [2]:
import pandas as pd
import evaluate
from datasets import Dataset
from transformers import (AutoTokenizer, AutoModelForMultipleChoice,
                          TrainingArguments, Trainer,
                          enable_full_determinism)


When working with Neural Networks, there are a large number of random operations such as initializing the weights of the network, shuffling the data for training, or choosing samples. This causes that different training runs of the same model can lead to different results. To ensure reproducibility, i.e. obtaining the same results in the different runs, the random number generator must be initialized with a fixed value known as seed. In Transformers, this can be done as follows:

In [3]:
enable_full_determinism(seed=42)

> **Note!** With models as complex as Neural Networks, reproducibility is susceptible to factors such as software versions or the hardware on which the models are run. Even with seed initialization, there may be slight differences in the results.

Working with Neural Networks also involves defining a number of hyperparameters that set the configuration of the model. Finding the appropriate hyperparameter values requires training the model with different combinations and testing them on the development set. This hyperparameter tuning is a costly process that needs multiple rounds of experimentation. However, for this assignments, you will use the following values:

In [4]:
epochs = 3  # Number of epochs to train the model
train_batch_size = 8  # Number of examples used per gradient update
learning_rate = 1e-5  # The learning rate for the optimizer
max_length = 50  # Maximum lenght of the input sequence
output_dir = "modelB"  # The output directory where the model will be written to

## Loading the Pre-trained Model - [1 Mark]

The first step you must perform in this assignment is to load the model and its corresponding tokenizer. **Transformers** provides support for a wide variety of pre-trained models via specific classes. However, the library also allows automatically retrieving a model given jut the name or path using [AutoClasses](https://huggingface.co/docs/transformers/v4.27.2/en/model_doc/auto). To fine-tune a pre-trained model for a downstream task, it is necessary to replace the original top layer of the model with a new specific output layer. **AutoClasses** also allows you to do this automatically for various types of Natural Language Processing tasks. For instance, `AutoModelForMultipleChoice` instantiates the model with a top layer for Multiple Choice.

You must complete the code for the `load_model` function. This functions takes the name of the pre-trained model and should load and return both the model, initialized for Text Classification, and its corresponding tokenizer. You can get some tips from [Transformers documentation](https://huggingface.co/docs/transformers/autoclass_tutorial).

In [5]:
from transformers import AutoTokenizer, AutoModelForMultipleChoice

def load_model(model_name):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForMultipleChoice.from_pretrained(model_name)
    return model, tokenizer

In [24]:
model_name = "roberta-base" if base_model else "roberta-large"
model, tokenizer = load_model(model_name)

Some weights of RobertaForMultipleChoice were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.bias', 'classifier.weight', 'roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Data Pre-processing - [1 Mark]

The **ComVE** dataset consists of 9997 nonsensical statements with their corresponding 3 possible reasons for the train set, 997 statements for development and 1000 for test. Each nonsensical statements is annotated with a `A`, `B` or `C` label depending on which is the correct reason. The dataset can be loaded into three `DataFrames` as follows:

In [7]:
def load_data(data_csv, answers_csv, labels):
    data = pd.read_csv(data_csv).dropna()
    answers = pd.read_csv(answers_csv, header=None).rename(columns={0: "id", 1: "label"})
    answers["label"] = answers["label"].apply(lambda x: labels.index(x))
    return pd.merge(data, answers, on="id")

In [8]:
labels = ["A", "B", "C"]
train_data_csv = "SemEval2020-Task4-Data/ALL data/Training  Data/subtaskB_data_all.csv"
train_answers_csv = "SemEval2020-Task4-Data/ALL data/Training  Data/subtaskB_answers_all.csv"
train_data = load_data(train_data_csv, train_answers_csv, labels)
dev_data_csv = "SemEval2020-Task4-Data/ALL data/Dev Data/subtaskB_dev_data.csv"
dev_answers_csv = "SemEval2020-Task4-Data/ALL data/Dev Data/subtaskB_gold_answers.csv"
dev_data = load_data(dev_data_csv, dev_answers_csv, labels)
test_data_csv = "SemEval2020-Task4-Data/ALL data/Test Data/subtaskB_test_data.csv"
test_answers_csv = "SemEval2020-Task4-Data/ALL data/Test Data/subtaskB_gold_answers.csv"
test_data = load_data(test_data_csv, test_answers_csv, labels)
if shrink_dataset:
    train_data = train_data.sample(n=100, random_state=42)
    dev_data = dev_data.sample(n=100, random_state=42)
    test_data = test_data.sample(n=100, random_state=42)
train_data

Unnamed: 0,id,FalseSent,OptionA,OptionB,OptionC,label
0,0,He poured orange juice on his cereal.,Orange juice is usually bright orange.,Orange juice doesn't taste good on cereal.,Orange juice is sticky if you spill it on the ...,1
1,1,He drinks apple.,Apple juice are very tasty and milk too,Apple can not be drunk,Apple cannot eat a human,1
2,2,"Jeff ran 100,000 miles today","100,000 miles is way to long for one person to...","Jeff is a four letter name and 100,000 has six...","100,000 miles is longer than 100,000 km.",0
3,3,I sting a mosquito,A human is a mammal,A human is omnivorous,A human has not stings,2
4,4,A giraffe is a person.,Giraffes can drink water from a lake.,A giraffe is not a human being.,.Giraffes usually eat leaves.,1
...,...,...,...,...,...,...
9992,9995,Mark ate a big bitter cherry pie,Mark is bad at making cherry pie,a cherry pie should be big,a cherry pie should be sweet,2
9993,9996,Gloria wears a cat on her head,a hat cannot be worn on a cat's head,a cat cannot be worn on a person's head,the cat is too heavy to be worn on her head,1
9994,9997,Harry went to the barbershop to have his glass...,a barbershop usually don't provide the service...,a barbershop usually repairs computers instead...,the barbershop lacked the necessary tools to r...,0
9995,9998,Reilly is sleeping on the window,the window is open and a person cannot lay on it,the window is too cold to sleep on it,a person cannot sleep on a window,2


Notice that the `load_data` function translates the labels into their corresponding numerical index: `0`, `1` and `2`.

[Datasets](https://huggingface.co/docs/datasets/index) is a library for dataset management that provides a set of tools to manipulate data in a easy and efficient way. Since it is fully integrated with **Transformers**, it is very convenient to use both libraries together. **Datasets** allows accessing and sharing datasets through the [Hugging Face Hub](https://huggingface.co/datasets). The core component of this library is the [Dataset](https://huggingface.co/docs/datasets/v2.10.0/en/package_reference/main_classes#datasets.Dataset) class that implements an [Apache Arrow table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html). Similar to a **pandas** `DataFrame`, a `Dataset` object stores a table where each row corresponds to an example of the dataset and each column contains a different type of data. There are different ways to load the data into a `Dataset`, for example, from a `Dataframe`:

In [9]:
train_dataset = Dataset.from_pandas(train_data)
dev_dataset = Dataset.from_pandas(dev_data)
test_dataset = Dataset.from_pandas(test_data)
train_dataset[0]


{'id': 0,
 'FalseSent': 'He poured orange juice on his cereal.',
 'OptionA': 'Orange juice is usually bright orange.',
 'OptionB': "Orange juice doesn't taste good on cereal.",
 'OptionC': 'Orange juice is sticky if you spill it on the table.',
 'label': 1}

One of the most powerful **Datasets** tools is the [map](https://huggingface.co/docs/datasets/v2.10.0/en/nlp_process#map) function which allows pre-processing the dataset in batches. The function takes another callable as argument and applies it to every row in the `Dataset`. The goal of the next exercise is to implement a function to tokenize the statement pairs that will be used as a parameter of the `map` function.

You must complete the code for the `preprocess_data` function. This function takes a batch of examples from a `Dataset`, the tokenizer returned by `load_model` and the `max_length` hyperparameter. The function should make three copies of each statement in the `FalseSent` field and pair them with each of the possible reasons in `OptionA`, `OptionB` and `OptionC`. Then, the statement-reason pairs must be tokenized jointly. The tokenizer must pad and truncate the sequences to the `max_length` value. You can use the [Preprocessing](https://huggingface.co/docs/transformers/v4.27.2/en/preprocessing) and the [Tokenizer](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/tokenizer) documentation as reference.

The `tokenizer` should return a [BatchEncoding](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/tokenizer#transformers.BatchEncoding) object with two fields for each data example:
* *input_ids*: A list of token indices that will be used as the input of the model.
* *attention_mask*: A list of indices masking out which tokens the model should not attend to.

After running the tokenizer, `preprocess_data` should unflatten the `input_ids` and `attention_mask` corresponding to the same statement, i.e., for each example, the value of `input_ids`  should be a list of three lists of token indices and, similarly, the value of `attention_mask` should be a list of three lists of masking indices. The **Transformers** documentation provides a [guide for Multiple Choice](https://huggingface.co/docs/transformers/tasks/multiple_choice) problems that you can use as reference.  The `preprocess_data` should return the output of the unflattening step.

The `map` function takes the `input_ids` and `attention_mask` fields and inserts them into the `Dataset` as new two columns. For example, the result for the first row in the `Dataset` should look like:

> <pre>
{'id': 4122, 'FalseSent': 'You are likely to find a computer in the bathroom', 'OptionA': 'The computer needs to take a shower in the bathroom', 'OptionB': 'The computer may be broken in the bathroom', 'OptionC': "The computer won't walk into the bathroom", 'label': 1, '__index_level_0__': 4122, 'input_ids': [[0, 1185, 32, 533, 7, 465, 10, 3034, 11, 5, 8080, 2, 2, 133, 3034, 782, 7, 185, 10, 9310, 11, 5, 8080, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 1185, 32, 533, 7, 465, 10, 3034, 11, 5, 8080, 2, 2, 133, 3034, 189, 28, 3187, 11, 5, 8080, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 1185, 32, 533, 7, 465, 10, 3034, 11, 5, 8080, 2, 2, 133, 3034, 351, 75, 1656, 88, 5, 8080, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}
</pre>

The `input_ids` field contains three lists, one for each statement-reason pair. Each value in each list of `input_ids` represents a sub-word of the `tokenizer` vocabulary. For the example above, `input_ids` corresponds to the following thee sequences of sub-words:

> <pre>
> ['&lt;s&gt;', 'You', 'ƒ†are', 'ƒ†likely', 'ƒ†to', 'ƒ†find', 'ƒ†a', 'ƒ†computer', 'ƒ†in', 'ƒ†the', 'ƒ†bathroom', '&lt;/s&gt;', '&lt;/s&gt;', 'The', 'ƒ†computer', 'ƒ†needs', 'ƒ†to', 'ƒ†take', 'ƒ†a', 'ƒ†shower', 'ƒ†in', 'ƒ†the', 'ƒ†bathroom', '&lt;/s&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;']
>
> ['&lt;s&gt;', 'You', 'ƒ†are', 'ƒ†likely', 'ƒ†to', 'ƒ†find', 'ƒ†a', 'ƒ†computer', 'ƒ†in', 'ƒ†the', 'ƒ†bathroom', '&lt;/s&gt;', '&lt;/s&gt;', 'The', 'ƒ†computer', 'ƒ†may', 'ƒ†be', 'ƒ†broken', 'ƒ†in', 'ƒ†the', 'ƒ†bathroom', '&lt;/s&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;']
>
> ['&lt;s&gt;', 'You', 'ƒ†are', 'ƒ†likely', 'ƒ†to', 'ƒ†find', 'ƒ†a', 'ƒ†computer', 'ƒ†in', 'ƒ†the', 'ƒ†bathroom', '&lt;/s&gt;', '&lt;/s&gt;', 'The', 'ƒ†computer', 'ƒ†won', "'t", 'ƒ†walk', 'ƒ†into', 'ƒ†the', 'ƒ†bathroom', '&lt;/s&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;']
</pre>


Notice that the **Hugging Face** implementation of **RoBERTa**'s tokenizer uses the `<s>` token equivalently to **BERT**'s `[CLS]` token and the `</s>` token to mark both the end and the separation of the sentences. The `ƒ†` character indicates when there is a blank space before the token in the original text. This helps to know which tokens are the first sub-words of the words.

In [15]:
def preprocess_data(examples, tokenizer, max_length):
    # Create statement-reason pairs
    first_sentences = [[example] * 3 for example in examples["FalseSent"]]
    second_sentences = list(zip(examples["OptionA"], examples["OptionB"], examples["OptionC"]))

    # Flatten the lists to prepare for tokenization
    first_sentences = sum(first_sentences, [])
    second_sentences = sum(second_sentences, [])

    # Tokenize the pairs
    tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True, padding='max_length', max_length=max_length)

    # Un-flatten the lists to restore the multiple choice structure
    input_ids = [tokenized_examples["input_ids"][i:i + 3] for i in range(0, len(tokenized_examples["input_ids"]), 3)]
    attention_mask = [tokenized_examples["attention_mask"][i:i + 3] for i in range(0, len(tokenized_examples["attention_mask"]), 3)]

    # Return the tokenized inputs as a dictionary
    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask
    }

In [10]:
def preprocess_data(examples, tokenizer, max_length):
    # Create statement-reason pairs
    first_sentences = [[statement] * 3 for statement in examples["FalseSent"]]
    second_sentences = [[examples["OptionA"][i], examples["OptionB"][i], examples["OptionC"][i]] for i in range(len(examples["FalseSent"]))]

    # Flatten the lists to tokenize them
    first_sentences = sum(first_sentences, [])
    second_sentences = sum(second_sentences, [])

    # Tokenize the sequences
    tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True, padding='max_length', max_length=max_length)

    # Unflatten the sequences
    input_ids = [tokenized_examples['input_ids'][i:i + 3] for i in range(0, len(tokenized_examples['input_ids']), 3)]
    attention_mask = [tokenized_examples['attention_mask'][i:i + 3] for i in range(0, len(tokenized_examples['attention_mask']), 3)]

    return {'input_ids': input_ids, 'attention_mask': attention_mask}

In [11]:
train_dataset = train_dataset.map(lambda x: preprocess_data(x, tokenizer, max_length), batched=True)
dev_dataset = dev_dataset.map(lambda x: preprocess_data(x, tokenizer, max_length), batched=True)
test_dataset = test_dataset.map(lambda x: preprocess_data(x, tokenizer, max_length), batched=True)
print(train_dataset[0])
print("")
for seq in train_dataset[0]["input_ids"]:
    print(tokenizer.convert_ids_to_tokens(seq))
    print("")

Map:   0%|          | 0/9997 [00:00<?, ? examples/s]

Map:   0%|          | 0/997 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

{'id': 0, 'FalseSent': 'He poured orange juice on his cereal.', 'OptionA': 'Orange juice is usually bright orange.', 'OptionB': "Orange juice doesn't taste good on cereal.", 'OptionC': 'Orange juice is sticky if you spill it on the table.', 'label': 1, 'input_ids': [[0, 894, 13414, 8978, 10580, 15, 39, 25629, 4, 2, 2, 37264, 10580, 16, 2333, 4520, 8978, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 894, 13414, 8978, 10580, 15, 39, 25629, 4, 2, 2, 37264, 10580, 630, 75, 5840, 205, 15, 25629, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 894, 13414, 8978, 10580, 15, 39, 25629, 4, 2, 2, 37264, 10580, 16, 25247, 114, 47, 10923, 24, 15, 5, 2103, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

## Fine-tuning - [4 Marks]

Although it is possible to write customized training loops for the **Transormers** models using **keras** or **pytorch**, **Transformers** provides a [Trainer](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/trainer) API that allows fine-tuning efficiently with a few simple steps. The training is highly customizable through with a wide range of options and hyperparameters that are handled by the [TrainingArguments](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/trainer#transformers.TrainingArguments) class. Your next goal is to create both the `TrainingArguments` and `Trainer` objects that will be used to fine-tune **RoBERTa**. See the [documentation](https://huggingface.co/docs/transformers/training) for an introduction on how to perform these steps.

You must complete the code for the `create_training_arguments` function. This function takes as arguments the `epochs`, `train_batch_size` and `learning_rate` hyperparameters along with the `output_dir`. The function should use these arguments to create and return a `TrainingArguments` object. During the training, the model must be evaluated on the development test after every epoch. `TrainingArguments` should include this strategy.

> **Important!** By default, `Trainer` saves a checkpoint of the model every 500 training steps. For this assignment, avoid this behavior by setting `save_strategy="no"` when creating the `TrainingArguments`.

In [None]:
def create_training_arguments(epochs, train_batch_size, learning_rate, output_dir):   # [1 Mark]
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    #
    pass

In [18]:
!pip uninstall transformers accelerate -y
!pip install transformers[torch]
!pip install accelerate


Found existing installation: transformers 4.41.2
Uninstalling transformers-4.41.2:
  Successfully uninstalled transformers-4.41.2
[0mCollecting transformers[torch]
  Downloading transformers-4.42.3-py3-none-any.whl (9.3 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m9.3/9.3 MB[0m [31m44.0 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate>=0.21.0 (from transformers[torch])
  Downloading accelerate-0.31.0-py3-none-any.whl (309 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m309.4/309.4 kB[0m [31m42.7 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->transformers[torch])
  Using cach

[31mERROR: Operation cancelled by user[0m[31m
[0m

In [12]:
from transformers import TrainingArguments

def create_training_arguments(epochs, train_batch_size, learning_rate, output_dir):
    return TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=epochs,
        per_device_train_batch_size=train_batch_size,
        learning_rate=learning_rate,
        evaluation_strategy="epoch",
        save_strategy="no",
        logging_dir=f"{output_dir}/logs",
        logging_steps=10,
    )

In [13]:
train_args = create_training_arguments(epochs, train_batch_size, learning_rate, output_dir)



Next, you will create a `Trainer` object with the training arguments. When the input format of a task has some special characteristic, the `Trainer` must be created with a data collator that can handle the batches of examples accordingly during the training. This is the case of Multiple Choice problems since the input of each example is a list of sequences. **Transformers** provides a set of [DataCollator](https://huggingface.co/docs/transformers/main_classes/data_collator) objects for different tasks, but not for Multiple Choice. However, a `DataCollatorForMultipleChoice` is provided along with this notebook.

You must complete the code for the `create_trainer` function. The function takes as input the model returned by `load_model`, the `TrainingArguments` created by `create_training_arguments` and the train and development `Datasets`. The function also takes the `tokenizer` returned by `load_model` that is required to initialize `DataCollatorForMultipleChoice`. The `create_trainer` function must create and return a `Trainer` object with the model, the training arguments and a `DataCollatorForMultipleChoice` object. The `Trainer` must be set up so that the train `Dataset` is used for training and the development `Dataset` is used to evaluate the model during the training.

In [None]:
def create_trainer(model, train_args, train_dataset, dev_dataset, tokenizer):   # [1 Mark]
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    #
    pass

In [16]:
!pip install transformers




In [21]:
!pip show transformers
!pip show accelerate


Name: transformers
Version: 4.42.3
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: 
Name: accelerate
Version: 0.31.0
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: zach.mueller@huggingface.co
License: Apache
Location: /usr/local/lib/python3.10/dist-packages
Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch
Required-by: 


In [30]:
from transformers import Trainer
from datacollator import DataCollatorForMultipleChoice

def create_trainer(model, train_args, train_dataset, dev_dataset, tokenizer):
    data_collator = DataCollatorForMultipleChoice(tokenizer, max_length=train_args.per_device_train_batch_size)

    trainer = Trainer(
        model=model,
        args=train_args,
        train_dataset=train_dataset,
        eval_dataset=dev_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
    )

    return trainer

In [31]:
trainer = create_trainer(model, train_args, train_dataset, dev_dataset, tokenizer)

The `trainer` object created by `create_trainer` is ready to fine-tune the model by just running:

In [35]:
trainer.train()



Epoch,Training Loss,Validation Loss
1,0.2919,0.341576
2,0.169,0.337499
3,0.1314,0.379803


TrainOutput(global_step=3750, training_loss=0.2385765739383797, metrics={'train_runtime': 3335.7905, 'train_samples_per_second': 8.991, 'train_steps_per_second': 1.124, 'total_flos': 8188318090403100.0, 'train_loss': 0.2385765739383797, 'epoch': 3.0})

After training, the model can be used to make predictions on unlabeled data using the [predict](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/trainer#transformers.Trainer.predict) method of the `Trainer` class.

You must complete the code for the `make_predictions` function. The function takes as input the `Trainer` object and test `Dataset`. The function must run the `predict` method on the input data. The `predict` method will return a `NamedTuple` including a **numpy** array with the predictions. For each statement in the input, the array contains a vector with the logits (the values used as input of the softmax) predicted for every label corresponding to a possible reason. The output of `make_predictions` must include only the index of the label with the highest logit value. For example, if the prediction for one statement is `[-0.856213458, 1.39899943, -0.703246286e]`, the output for that example should be `1`. For this, you can apply the [argmax](https://numpy.org/doc/stable/reference/generated/numpy.argmax.html) method along the last axis of the **numpy** array.

In [36]:
import numpy as np
import evaluate

def make_predictions(trainer, test_dataset):
    predictions = trainer.predict(test_dataset)
    logits = predictions.predictions
    predicted_labels = np.argmax(logits, axis=-1)
    return predicted_labels

In [37]:
predictions = make_predictions(trainer, test_dataset)
test_data["prediction"] = predictions
test_data

Unnamed: 0,id,FalseSent,OptionA,OptionB,OptionC,label,prediction
0,1175,He loves to stroll at the park with his bed,A bed is too heavy to carry with when strollin...,walking at a park is good for health,Some beds are big while some are smaller,0,0
1,452,The inverter was able to power the continent.,An inverter is smaller than a car,An inverter is incapable of powering an entire...,An inverter is rechargeable.,1,1
2,275,The chef put extra lemons on the pizza.,Many types of lemons are to sour to eat.,Lemons and pizzas are both usually round.,Lemons are not a pizza topping.,2,2
3,869,sugar is used to make coffee sour,sugar is white while coffee is brown,sugar can dissolve in the coffee,sugar usually is used as a sweetener,2,2
4,50,There are beautiful planes here and there in t...,A plane flies upon the garden,You can have a small garden in your private plane,A plane can never be seen in garden,2,2
...,...,...,...,...,...,...,...
995,1114,"If it is a sunny day, you would got wet.",Usually a sunny day don't cause to wet.,People prefer to walk during sunny day.,People feel mess if they are wet.,0,0
996,8,ice hockey is a financial institution,Children's playing ice hockey requires financi...,Playing ice hockey well can bring you money,There are no relationships between ice hockey ...,2,2
997,1945,He put water without a container in the freeze...,Water and containers are two different element...,water cannot be in the freezer without a conta...,Water more deep in a container cannot always b...,1,1
998,1053,The desert has sand that you can drink.,Water is not the same color as the sand.,Sand is solid and inedible.,The desert has lots of sand.,1,1


The **Subtasks B** of **ComVE** is evaluated using accuracy. The [evaluate](https://huggingface.co/docs/evaluate/index) library provides support to apply this and other metrics. The `evaluate_prediction` function takes the test `DataFrame` and calculates the accuracy comparing the `prediction` and `label` columns. With `shrink_dataset` and `base_model` set to `True` the model is not able to learn the task so the expected score is only *0.51*. With a full training run, i.e. with `shrink_dataset` and `base_model` set to `False`, the score should be around *0.928*.

In [38]:
def evaluate_prediction(test_data):
    accuracy = evaluate.load("accuracy")
    return accuracy.compute(predictions=test_data["prediction"].values, references=test_data["label"].values)
evaluate_prediction(test_data)

{'accuracy': 0.925}