# Introduction to Natural Language Processing 2 Lab02

## HuggingFace Transformers

Follow the [HuggingFace transformers course](https://huggingface.co/course/chapter0), from chapter 0 to 4.

## **(11 points)** Sentence classification

Use the HuggingFace transformer library to fine-tune a model on the IMDB library dataset and then evaluate it on the test set. As you do not necessarily have access to a good GPU, and Google Colab is not always providing well, you do not have to fine-tune the model for more than one epoch. There is a fine-tuned model available for steps 2 onward.

Go through the following steps.

1. (5 points) Fine-tune the model on the training data.
   * Again, at least for one epoch, to make sure your code works.
   * If you want to make sure your model is loaded on GPU, after creating the `Trainer` object (see HuggingFace [course](https://huggingface.co/course/chapter3/3?fw=pt)), you can look at `model.device`. It should tell you it's on a `cuda` device.
   * We recommend using [distilbert](https://huggingface.co/distilbert-base-uncased) as pre-trained model, as it is light and will fine-tune fast. **Beware** some models have already been fine-tuned with the IMDB dataset, so make sure you do not use any of those (the data used to train/fine-tune a model should be visible on the model's card). Other potential models are:
      * [BERT](https://huggingface.co/bert-base-uncased)
      * [RoBERTa](https://huggingface.co/roberta-base)
      * [DeBERTa](https://huggingface.co/microsoft/deberta-base)
   * Provide the notebook used to train your model with adequate comments.
   * You can save your model on HuggingFace model hub (totally optional). If you do, please fill up the model's card.
   
For what follow, you can either use a model you fully fine-tuned, or [this one](https://huggingface.co/mvonwyl/distilbert-base-uncased-imdb).

2. (2 points) Evaluate the model in term of accuracy on the test data.
3. (2 point) For at least 2 samples which have been wrongly classified in the test set, try explaining why the model could have been wrong.
4.  (2 point) What are the advantages and inconvenient of using this model in production compared to the naive Bayes we implemented in the first part of the course?
5. **\[Bonus\]** Fine-tune your model using the accuracy as evaluation instead of the loss (default). You can use the base `Trainer` class, create your own custom trainer class, or even not use `Trainer` at all. Return the model with the best results on the validation set instead of the last one. (many points)

## Evaluation

The assignment will be evaluated on the following criteria

* A report answering the questions above, describing your technical choices, and analysing your results.
* The quality of your code (modularity, efficiency, comments, coding standards).

For coding standards, please respect the following guidelines
* Use [docstring](https://www.programiz.com/python-programming/docstrings) format to describe your functions and their arguments
* Use typing
* Have clear and verbatim variable names (not x, x1, x2, xx, another_x, ...)
* Make your results reproducible (force random seeds values)
* Don't hesitate commenting in details part of the code you consider complex or hard to read

Provide a `README.md` file with 
* A short description of the project
* A description of the file/module architecture

This part provides 9 points + 3 points on coding standards: naming, typing, comments, and docstring. You can earn extra points by answering the bonus questions, and by packaging your code in extra python files. At the end of the module, all project points are summed and projected on a grade between 0 and 16. The last 4 points can be earned by answering the bonus questions, and presenting a language.

All projects have to be send back at `marc.von-wyl` at `epita` dot `fr` before Thursday 17th of November 2022 at midnight. Thought is is advised to send them progressively.

In [None]:
!pip install transformers

In [None]:
!pip install datasets

In [None]:
import numpy
from datasets import load_dataset
import torch

SEED=42

In [None]:
# Data loading
dataset = load_dataset("imdb").shuffle(seed=SEED)
# We do not need the "unsupervised" split.
dataset.pop("unsupervised")

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Dataset({
    features: ['text', 'label'],
    num_rows: 50000
})

## Train/validation split

In [None]:
# Spliting train split to train and validation

train_valid_split = dataset["train"].train_test_split(train_size=0.8, seed=42, stratify_by_column="label")
dataset["validation"] = train_valid_split["test"]
dataset["train"] = train_valid_split["train"]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 20000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})

In [None]:
# Check here that the dataset is stratified.

train_positive_review_rate = len(train_valid_split["train"].filter(lambda x : x["label"] == 1)) / len(train_valid_split["train"])
validation_positive_review_rate = len(train_valid_split["test"].filter(lambda x : x["label"] == 1)) / len(train_valid_split["test"])
print("Positive review rate for 'test' split:", train_positive_review_rate)
print("Positive review rate for 'validation' split:", train_positive_review_rate)

if (train_positive_review_rate == validation_positive_review_rate):
  print("Splits stratified !")

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

Positive review rate for 'test' split: 0.5
Positive review rate for 'validation' split: 0.5
Splits stratified !


## Loading pretrained model

In [None]:
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer

# Define which checkpoint to use.
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

# Loading model's tokenizer.
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Loading model's weights and architecture.
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

## Dataset encoding

In [None]:
# Tokenizing the dataset.
tokenized_dataset = dataset.map(lambda x: tokenizer(x['text'], truncation=True, padding='max_length'), batched=True)
tokenized_dataset.column_names

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

{'train': ['text', 'label', 'input_ids', 'attention_mask'],
 'test': ['text', 'label', 'input_ids', 'attention_mask'],
 'validation': ['text', 'label', 'input_ids', 'attention_mask']}

## 1. (5 points) Fine-tune the model on the training data.

### Creation of the Trainer

In [None]:
from transformers import Trainer
from transformers import TrainingArguments

# Defining trainer's directory, number of epochs and setting the seed to make training reproducible.
training_args = TrainingArguments("trainer", num_train_epochs=1, seed=SEED)

# Instenciation of the trainer.
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
)

### Evaluation of the model before the fine-tuning

In [None]:
from sklearn.metrics import accuracy_score
import numpy as np

# Compute prediction
predictions = trainer.predict(tokenized_dataset["test"])
preds = np.argmax(predictions.predictions, axis=-1)
print("Accuracy:", accuracy_score(tokenized_dataset["test"]["label"], preds))

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 25000
  Batch size = 8
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Accuracy: 0.89072


In [None]:
# Fine-tuning
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 20000
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 2500
  Number of trainable parameters = 66955010


Step,Training Loss
500,0.3565
1000,0.3116
1500,0.2782
2000,0.2587
2500,0.2268


Saving model checkpoint to trainer/checkpoint-500
Configuration saved in trainer/checkpoint-500/config.json
Model weights saved in trainer/checkpoint-500/pytorch_model.bin
tokenizer config file saved in trainer/checkpoint-500/tokenizer_config.json
Special tokens file saved in trainer/checkpoint-500/special_tokens_map.json
Saving model checkpoint to trainer/checkpoint-1000
Configuration saved in trainer/checkpoint-1000/config.json
Model weights saved in trainer/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in trainer/checkpoint-1000/tokenizer_config.json
Special tokens file saved in trainer/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to trainer/checkpoint-1500
Configuration saved in trainer/checkpoint-1500/config.json
Model weights saved in trainer/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in trainer/checkpoint-1500/tokenizer_config.json
Special tokens file saved in trainer/checkpoint-1500/special_tokens_map.json
Saving model check

TrainOutput(global_step=2500, training_loss=0.2863808288574219, metrics={'train_runtime': 981.1461, 'train_samples_per_second': 20.384, 'train_steps_per_second': 2.548, 'total_flos': 2649347973120000.0, 'train_loss': 0.2863808288574219, 'epoch': 1.0})

## 2. (2 points) Evaluate the model in term of accuracy on the test data.

In [None]:
predictions = trainer.predict(tokenized_dataset["test"])
preds = np.argmax(predictions.predictions, axis=-1)
print("Accuracy:", accuracy_score(tokenized_dataset["test"]["label"], preds))

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 25000
  Batch size = 8


Accuracy: 0.92792


As we can see the model won almost 3% accuracy with only 1 epoch of fine-tuning. The accuracy is now very good and way better than all the models we previously made.

## 3. (2 point) For at least 2 samples which have been wrongly classified in the test set, try explaining why the model could have been wrong.

In [None]:
# Fetching classification errors.
error_mask = tokenized_dataset["validation"]["label"] != preds
wrong_classification = np.array(dataset["validation"])[error_mask]

# Printing classification errors.
print("Wrong classification:")
for pred in wrong_classification[0][:2]:
  print("Label:", pred["label"], "| Text:", pred["text"])

  


Wrong classification
Label: 0 | Text: I had a lot of hopes for this movie and so watched it with a lot of expectations; basically because of Kamal Hassan. He is an amazing actor who has marked his foot steps in the sands of time forever. But this movie proved to be one of the worst movies i have ever seen. After watching this the movie the brutality and violence in tenebra and clockwork orange looks far better. <br /><br />The Protagonist, Raghavan, is a very daring police officer. Who is assigned to a investigate brutal serial murders. Raghavan efficiently finds the connecting thread in this case and is close to solve the murders and put the psycho killers, two psychologically disturbed but brilliant medical students, behind bars but they escape and again get into a killing spree. Finally Raghavan kills them both after sparing many innocent lives.<br /><br />THese two psycho-killers are the ones who are going to keep the audiences from going to the theaters. The murders and sexual har

Output: <br /><br />
Wrong classification:<br />
Label: 0 (Should have been 1) <br /> Text: I had a lot of hopes for this movie and so watched it with a lot of expectations; basically because of Kamal Hassan. He is an amazing actor who has marked his foot steps in the sands of time forever. But this movie proved to be one of the worst movies i have ever seen. After watching this the movie the brutality and violence in tenebra and clockwork orange looks far better. <br />The Protagonist, Raghavan, is a very daring police officer. Who is assigned to a investigate brutal serial murders. Raghavan efficiently finds the connecting thread in this case and is close to solve the murders and put the psycho killers, two psychologically disturbed but brilliant medical students, behind bars but they escape and again get into a killing spree. Finally Raghavan kills them both after sparing many innocent lives.<br />THese two psycho-killers are the ones who are going to keep the audiences from going to the theaters. The murders and sexual harassments and rapes are shown very explicitly, which the movie could have survived without.<br />To even imagine that teenagers and kids are going to be watching this movie in the theater and kind of picture it is bound to paint in their minds are certainly not pretty. The director, Gautham, should realize that he also has some obligation to the society and his audience.Certainly i am never going to the movies looking like Gautham's name on the production list.<br /><br />


Label: 1 (Should have been 0) <br /> Text: Another fun, witty, frothy RKO musical with Astaire and Rogers, FOLLOW THE FLEET is a charming film. While it lacks the stand-out great tunes of SWNG TIME (although the final number "Face the music and dance" is one of the team's best and rightfully so), it is hugely enjoyable, as these two could virtually do no wrong together in the 1930's. Once again, the plot is lightweight and forgettable, but watching Fred and Ginger dance together is sheer heaven. These two conveyed more romance and magic in dance than many couples in films do in a huge love scene. While there are better Fred and Ginger musicals, this is certainly a must-see.<br /><br /><br />

The first classification error is clearly a negative review as we can see with this sentance "this movie proved to be one of the worst movies i have ever seen.", but the model predicted the review as positive. This might be because a big portion of the review is a summary of the movie and not the opinion of the author. Furthermore, the compliment that the author of the review  has made about the actor might have confused the model. 

The second review is clearly positive but the sentences "the plot is lightweight and forgettable" and "While it lacks the stand-out great tunes of SWNG TIME" might have confused the model because they can be considered as negative.




## 4. (2 point) What are the advantages and inconvenient of using this model in production compared to the naive Bayes we implemented in the first part of the course?

This model take context into account and has an accuracy way better than the naive bayes but this model is huge and takes time and computational resource. Therefore, using this model is more expensive than using the naive bayes. Furthermore, if we need to do real time prediction or if we have a lot of preditions to do in a short amount of time, this model can be overwhelmed more easily than the naive bayes. However if the accuracy is more important than prediction speed, this model should be preferred.

## 5. **\[Bonus\]** Fine-tune your model using the accuracy as evaluation instead of the loss (default). You can use the base `Trainer` class, create your own custom trainer class, or even not use `Trainer` at all. Return the model with the best results on the validation set instead of the last one. (many points)

In [None]:
# Defining trainer's directory, number of epochs and setting the seed to make training reproducible.
training_args = TrainingArguments(
    "trainer",
    metric_for_best_model="accuracy",
    num_train_epochs=1,
    seed=SEED
    )

# Instenciation of the trainer.
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
)

# Fine-tuning
trainer.train()

# Evaluation
predictions = trainer.predict(tokenized_dataset["test"])
preds = np.argmax(predictions.predictions, axis=-1)
print("Accuracy:", accuracy_score(tokenized_dataset["test"]["label"], preds))

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 20000
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 2500
  Number of trainable parameters = 66955010
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a m

Step,Training Loss
500,0.3499
1000,0.3109
1500,0.2898
2000,0.268
2500,0.2423


Saving model checkpoint to trainer/checkpoint-500
Configuration saved in trainer/checkpoint-500/config.json
Model weights saved in trainer/checkpoint-500/pytorch_model.bin
tokenizer config file saved in trainer/checkpoint-500/tokenizer_config.json
Special tokens file saved in trainer/checkpoint-500/special_tokens_map.json
Saving model checkpoint to trainer/checkpoint-1000
Configuration saved in trainer/checkpoint-1000/config.json
Model weights saved in trainer/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in trainer/checkpoint-1000/tokenizer_config.json
Special tokens file saved in trainer/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to trainer/checkpoint-1500
Configuration saved in trainer/checkpoint-1500/config.json
Model weights saved in trainer/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in trainer/checkpoint-1500/tokenizer_config.json
Special tokens file saved in trainer/checkpoint-1500/special_tokens_map.json
Saving model check

NameError: ignored