#  Assignment 2 - Transfer Learning and Data Augmentation 💬

Welcome to the **second assignment** for the **CS-552: Modern NLP course**!

> - 😀 Name: **Aziz Laadhar**
> - ✉️ Email: **aziz.laadhar@epfl.ch**
> - 🪪 SCIPER: **315196**

<div style="padding:15px 20px 20px 20px;border-left:3px solid green;background-color:#e4fae4;border-radius: 20px;color:#424242;">

## **Assignment Description**
- In the first part of this assignment, you will need to implement training (finetuning) and evaluation of a pre-trained language model ([RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)) on a **Sentiment Analysis (SA)** task, which aims to determine whether a product review's emotional tone is positive or negative.

- For part-2, following the first finetuning task, you will need to identify the shortcuts (i.e. some salient or toxic features) that the model learnt for the specific task.

- For part-3, you are supposed to annotate 80 randomly assigned new datapoints as ground-truth labels. Additionally, the cross annotation should be conducted by another one or two annotators, and you will learn about how to calculate the agreement statistics as a significant characteristic reflecting the quality of a collected dataset.

- For part-4, since the human annotation is quite time- and effort-consuming, there are plenty of ways to get silver-labels from automatic labeling to augment the dataset scale, e.g., paraphrasing each text input in different words without changing its meaning. You will use a [T5](https://huggingface.co/docs/transformers/en/model_doc/t5) paraphrase model to expand the training data of sentiment analysis, and evaluate the improvement of data augmentation.

For Parts 1 and Part 2, you will need to complete the code in the corresponding `.py` files (`sa.py` for Part 1, `shortcut.py` for Part 2). You will be provided with the function descriptions and detailed instructions about the code snippet you need to write.


### Table of Contents
- **PART 1: Sentiment Analysis (33 pts)**
    - 1.1 Dataset Processing (10 pts)
    - 1.2 Model Training and Evaluation (18 pts)
    - 1.3 Fine-Grained Validation (5 pts)
- **PART 2: Identify Model Shortcuts (22 pts)**
    - 2.1 N-gram Pattern Extraction (6 pts)
    - 2.2 Distill Potentially Useful Patterns (8 pts)
    - 2.3 Case Study (8 pts)
- **PART 3: Annotate New Data (25 pts)**
    - 3.1 Write an Annotation Guideline (5 pts)
    - 3.2 Annotate Your Datapoints with Partner(s) (8 pts)
    - 3.3 Agreement Measure (12 pts)
- **PART 4: Data Augmentation (20 pts)**
    - 4.1 Data Augmentation with Paraphrasing (15 pts)
    - 4.2 Retrain RoBERTa Model with Data Augmentation (5 pts)
    
### Deliverables

- ✅ This jupyter notebook: `assignment2.ipynb`
- ✅ `sa.py` and `shortcut.py` file
- ✅ Checkpoints for RoBERTa models finetuned on original and augmented SA training data (Part 1 and Part 4), including:
    - `models/lr1e-05-warmup0.3/`
    - `models/lr2e-05-warmup0.3/`
    - `models/augmented/lr1e-05-warmup0.3/`
- ✅ Model prediction results on each domain data (Part 1.3 Fine-Grained Validation): `predictions/`
- ✅ Cross-annotated new SA data (Part 3), including:
    - `data/<your_assigned_dataset_id>-<your_sciper_number>.jsonl`
    - `data/<your_assigned_dataset_id>-<your_partner_sciper_number>.jsonl`
    - (for group of 3) `data/<your_assigned_dataset_id>-<your_second_partner_sciper_number>.jsonl`
- ✅ Paraphrase-augmented SA training data (Part 4), including:
    - `data/augmented_train_sa.jsonl`
- ✅ `./tensorboard` directory with logs for all trained/finetuned models, including:
    - `tensorboard/part1_lr1e-05/`
    - `tensorboard/part1_lr2e-05/`
    - `tensorboard/part4_lr1e-05/`

### How to implement this assignment

Please read carefully the following points. All the information on how to read, implement and submit your assignment is explained in details below:

1. For this assignment, you will need to implement and fill in the missing code snippets for both the **Jupyter Notebook `assignment2.ipynb`** and the **`sa.py`**, **`shortcut.py`** python files.

2. Along with above files, you need to additionally upload model files under the **`models/`** dir, regarding the following models:
    - finetuned RoBERTa models on original SA training data (PART 1)  
    - finetuned RoBERTa model on augmented SA training data (PART 4)

3. You also need to upload model prediction results in Part 1.3 Fine-Grained Validation, saved in **`predictions/`**.

4. You also need to upload new data files under the **`data/`** dir (along with our already provided data), including:
    - new SA data with your and your partner's annotations (Part 3)
    - paraphrase-augmented SA training data (Part 4)

5. Finally, you will need to log your training using Tensorboard. Please follow the instructions in the `README.md` of the **``tensorboard/``** directory.

**Note**: Large files such as model checkpoints and logs should be pushed to the repository with Git LFS. You may also find that training the models on a GPU can speed up the process, we recommend using Colab's free GPU service for this. A tutorial on how to use Git LFS and Colab can be found [here](https://github.com/epfl-nlp/cs-552-modern-nlp/blob/main/Exercises/tutorials.md).
    
</div>

<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

## **Environment Setup**

### **Option 1: creating your own environment**

```
conda create --name mnlp-a2 python=3.10
conda activate mnlp-a2
pip install -r requirements.txt
```

**Note**: If some package versions in our suggested environment do not work, feel free to try other package versions suitable for your computer, but remember to update ``requirements.txt`` and explain the environment changes in your notebook (no penalty for this if necessary).

### **Option 2: using Google Colab**
If you are using Google Colab notebook for this assignment, you will need to run a few commands to set up our environment on Google Colab, as shown below:
    
</div>

In [1]:
# This cell makes sure modules are auto-loaded when you change external python files
%load_ext autoreload
%autoreload 2

In [None]:
# # If you are working in Colab, then consider mounting your assignment folder to your drive
# from google.colab import drive
# drive.mount('/content/drive')

# # Direct to your assignment folder.
# %cd /content/drive/MyDrive/path-to-your-assignment-folder

Install packages that are not included in the Colab base envrionemnt:

In [3]:
import os
os.environ["WANDB_DISABLED"] = "true"
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0" # limiting to one GPU

# Install dependencies
!pip install -r requirements.txt



In [1]:
import numpy as np
import jsonlines
import random
import os
import torch
from transformers import RobertaTokenizer, RobertaForSequenceClassification

# TODO: Enter your Sciper number
SCIPER = '315196'
seed = int(SCIPER)
torch.backends.cudnn.deterministic = True

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

  from .autonotebook import tqdm as notebook_tqdm


<torch._C.Generator at 0x10d14f1f0>

In [2]:
# Check the availability of GPU (proceed only it returns True!)
if torch.backends.mps.is_available():
  print('Good to go!')
else:
  print('Please set GPU via Edit -> Notebook Settings.')

Good to go!


<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">
    
# PART 1: Sentiment Analysis (33 pts)

In this part, we will finetune a pretrained language model (Roberta) on sentiment analysis(SA) task. 

> Specifically, we will focus on a binary sentiment classification task for multi-domain product reviews. It requires the model to **classify a given paragraph of review by its sentiment polarity (positive or negative)**. 

</div>

### Load Training Dataset (`train_sa.jsonl`) 

**You can run the following cell to have the first glance at your data**. Each data sample is a python dictionary, which consists of following components:
- input review (*'review'*): a natural language sentence or a paragraph commenting about a product.
- domain (*'domain'*): describing the type of product being reviewed.
- label of sentiment (*'label'*): indicating whether the review states positive or negative views about the product.

In [3]:
data_dir = 'data'
data_train_path = os.path.join(data_dir, 'train_sa.jsonl')
with jsonlines.open(data_train_path, "r") as reader:
    for sid, sample in enumerate(reader.iter()):
        if sid % 200 == 0:
            print(sample)

{'review': "THis book was horrible.  If it was possible to rate it lower than one star i would have.  I am an avid reader and picked this book up after my mom had gotten it from a friend.  I read half of it, suffering from a headache the entire time, and then got to the part about the relationship the 13 year old boy had with a 33 year old man and i lit this book on fire.  One less copy in the world...don't waste your money. I wish i had the time spent reading this book back so i could use it for better purposes.  THis book wasted my life", 'domain': 'books', 'label': 'negative'}
{'review': 'Sphere by Michael Crichton is an excellant novel. This was certainly the hardest to put down of all of the Crichton novels that I have read. The story revolves around a man named Norman Johnson. Johnson is a phycologist. He travels with 4 other civilans to a remote location in the Pacific Ocean to help the Navy in a top secret misssion. They quickly learn that under the ocean is a half mile long sp

In [4]:
# We use the following pretrained tokenizer and model
model_name = "FacebookAI/roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 🎯 Q1.1: **Dataset Processing (10 pts)**

Our first step is to constructing a Pytorch Dataset for SA task. Specifically, we will need to implement **tokenization** and **padding** using a HuggingFace pre-trained tokenizer.

**TODO🔻: Complete `SADataset` class following the instructions in `sa.py`, and test by running the following cell.**

In [5]:
from sa import SADataset
model_name = "FacebookAI/roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
dataset = SADataset("data/train_sa.jsonl", tokenizer)

Building SA Dataset...


1600it [00:00, 2034.90it/s]


In [6]:
from testA2 import test_SADataset
test_SADataset(dataset)

SADataset test correct ✅


[nltk_data] Downloading package stopwords to /Users/aziz/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 🎯 Q1.2: **Model Training and Evaluation (18 pts)**

Next, we will implement the training and evaluation process to finetune the model. 

- For training: you will need to calculate the **loss** and update the model weights by using **Adam optimizer**. Additionally, we add a **learning rate schedular** to adopt an adaptive learning rate during the whole training process.

- For evaluation: you will need to compute the **confusion matrix** and **F1 scores** to assess the model performance.

**TODO🔻: Complete the `compute_metrics()`, `train()` and `evaluate()` functions following the instructions in the `sa.py` file, you can test compute_metrics() by running the following cell.**

In [7]:
from sa import compute_metrics, train, evaluate

from testA2 import test_compute_metrics
test_compute_metrics(compute_metrics)

[[3. 1.]
 [3. 1.]]
0.6
compute_metric test correct ✅


#### **Start Training and Validation!**

TODO🔻: (1) [coding question] Train the model with the following two different learning rates (other hyperparameters should be kept consistent). 

> A. learning_rate = 1e-5

> B. learning_rate = 2e-5

**Note:** *Each training will take ~7-10 minutes using a T4 Colab GPU.*

In [11]:
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

model_name = "FacebookAI/roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaForSequenceClassification.from_pretrained(model_name, num_labels=2)
model.to(device)


batch_size = 8
epochs = 4
max_grad_norm = 1.0
warmup_percent = 0.3

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
learning_rate = 1e-5  # play around with this hyperparameter

dev_dataset = SADataset("data/test_sa.jsonl", tokenizer)
train(train_dataset=dataset,dev_dataset=dev_dataset, model=model, device=device, batch_size=batch_size, epochs= epochs, learning_rate=learning_rate, warmup_percent=warmup_percent, max_grad_norm=max_grad_norm,
      model_save_root='models/', tensorboard_path="./tensorboard/part1_lr{}".format(learning_rate))

Building SA Dataset...


207it [00:00, 2059.06it/s]

6400it [00:02, 2918.48it/s]
Training: 100%|██████████| 200/200 [01:10<00:00,  2.83it/s]
Evaluation: 100%|██████████| 800/800 [00:55<00:00, 14.40it/s]


Epoch: 0 | Training Loss: 0.054 | Validation Loss: 0.889
Epoch 0 SA Validation:
Confusion Matrix:
[[2806.  394.]
 [ 401. 2799.]]
F1: (87.59%, 87.56%) | Macro-F1: 87.58%
Model Saved!


Training: 100%|██████████| 200/200 [01:07<00:00,  2.96it/s]
Evaluation: 100%|██████████| 800/800 [00:56<00:00, 14.29it/s]


Epoch: 1 | Training Loss: 0.054 | Validation Loss: 0.710
Epoch 1 SA Validation:
Confusion Matrix:
[[2811.  389.]
 [ 245. 2955.]]
F1: (89.87%, 90.31%) | Macro-F1: 90.09%
Model Saved!


Training: 100%|██████████| 200/200 [01:07<00:00,  2.95it/s]
Evaluation: 100%|██████████| 800/800 [00:55<00:00, 14.34it/s]


Epoch: 2 | Training Loss: 0.024 | Validation Loss: 1.041
Epoch 2 SA Validation:
Confusion Matrix:
[[2474.  726.]
 [ 118. 3082.]]
F1: (85.43%, 87.96%) | Macro-F1: 86.69%


Training: 100%|██████████| 200/200 [01:07<00:00,  2.96it/s]
Evaluation: 100%|██████████| 800/800 [00:55<00:00, 14.47it/s]

Epoch: 3 | Training Loss: 0.038 | Validation Loss: 0.686
Epoch 3 SA Validation:
Confusion Matrix:
[[2832.  368.]
 [ 279. 2921.]]
F1: (89.75%, 90.03%) | Macro-F1: 89.89%





In [49]:
learning_rate = 2e-5  # play around with this hyperparameter

dev_dataset = SADataset("data/test_sa.jsonl", tokenizer)
train(train_dataset=dataset,dev_dataset=dev_dataset, model=model, device=device, batch_size=batch_size, epochs= epochs, learning_rate=learning_rate, warmup_percent=warmup_percent, max_grad_norm=max_grad_norm,
      model_save_root='models/', tensorboard_path="./tensorboard/part1_lr{}".format(learning_rate))

Building SA Dataset...


0it [00:00, ?it/s]

6400it [00:02, 3045.47it/s]
Training: 100%|██████████| 200/200 [01:10<00:00,  2.85it/s]
Evaluation: 100%|██████████| 800/800 [00:55<00:00, 14.38it/s]


Epoch: 0 | Training Loss: 0.163 | Validation Loss: 0.583
Epoch 0 SA Validation:
Confusion Matrix:
[[2929.  271.]
 [ 571. 2629.]]
F1: (87.43%, 86.20%) | Macro-F1: 86.81%
Model Saved!


Training: 100%|██████████| 200/200 [01:07<00:00,  2.97it/s]
Evaluation: 100%|██████████| 800/800 [00:55<00:00, 14.32it/s]


Epoch: 1 | Training Loss: 0.126 | Validation Loss: 0.481
Epoch 1 SA Validation:
Confusion Matrix:
[[2780.  420.]
 [ 393. 2807.]]
F1: (87.24%, 87.35%) | Macro-F1: 87.30%
Model Saved!


Training: 100%|██████████| 200/200 [01:07<00:00,  2.96it/s]
Evaluation: 100%|██████████| 800/800 [00:56<00:00, 14.15it/s]


Epoch: 2 | Training Loss: 0.121 | Validation Loss: 0.976
Epoch 2 SA Validation:
Confusion Matrix:
[[2311.  889.]
 [ 117. 3083.]]
F1: (82.13%, 85.97%) | Macro-F1: 84.05%


Training: 100%|██████████| 200/200 [01:07<00:00,  2.97it/s]
Evaluation: 100%|██████████| 800/800 [00:55<00:00, 14.33it/s]


Epoch: 3 | Training Loss: 0.118 | Validation Loss: 0.760
Epoch 3 SA Validation:
Confusion Matrix:
[[2716.  484.]
 [ 244. 2956.]]
F1: (88.18%, 89.04%) | Macro-F1: 88.61%
Model Saved!


TODO🔻: (2) [textual question] compare and discuss the results. 

- Which learning rate is better? Explain your answers.

The first learning rate (1e-5) is better since it achieved higher Macro-F1 score than the second (2e-5)

## 🎯 Q1.3: **Fine-Grained Validation (5 pts)**

TODO🔻: (1) [coding question] Use the model checkpoint trained from the first learning_rate setting (lr=1e-5), check the model performance on each domain subsets of the validation set. You should report **the validation loss**, **confusion matrix**, **F1 scores** and **Macro-F1 on each domain**. 

In [8]:
# Split the test sets into subsets with different domains
# Save the subsets under 'data/'
# Replace "..." with your code
domain_data = {}

# split to subsets
with jsonlines.open("data/test_sa.jsonl", mode="r") as reader:
    for sample in reader:
        domain = sample["domain"]
        if domain not in domain_data:
            domain_data[domain] = []
        domain_data[domain].append(sample)

for domain, samples in domain_data.items():
    with jsonlines.open("data/test_sa_"+domain+".jsonl", mode="w") as writer:
        for sd in samples:
            writer.write(sd)

In [13]:
learning_rate = 2e-5
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
tokenizer = RobertaTokenizer.from_pretrained(model_name)

model = RobertaForSequenceClassification.from_pretrained('models/lr1e-05-warmup0.3')
model.to(device)

results_save_dir = 'predictions/'

# Evaluate and save prediction results in each domain
# Replace "..." with your code
for domain in ['books', 'dvd', 'electronics', 'housewares']:
    
    test_dataset = SADataset("data/test_sa_"+domain+".jsonl", tokenizer)
    dev_loss, confusion, f1_pos, f1_neg = evaluate(test_dataset, model, device, batch_size=batch_size,
                                                   result_save_file='predictions/test_'+domain+'.jsonl')
    macro_f1 = (f1_pos + f1_neg) / 2

    print(f'Domain: {domain}')
    print(f'Validation Loss: {dev_loss:.3f}')
    print(f'Confusion Matrix:')
    print(confusion)
    print(f'F1: ({f1_pos*100:.2f}%, {f1_neg*100:.2f}%) | Macro-F1: {macro_f1*100:.2f}%')

Building SA Dataset...


125it [00:00, 1243.69it/s]

1600it [00:01, 1508.97it/s]
Evaluation:   0%|          | 0/200 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
Evaluation: 100%|██████████| 200/200 [00:14<00:00, 13.95it/s]


Domain: books
Validation Loss: 0.520
Confusion Matrix:
[[739.  61.]
 [102. 698.]]
F1: (90.07%, 89.54%) | Macro-F1: 89.81%
Building SA Dataset...


1600it [00:00, 2075.67it/s]
Evaluation: 100%|██████████| 200/200 [00:14<00:00, 13.90it/s]


Domain: dvd
Validation Loss: 0.633
Confusion Matrix:
[[715.  85.]
 [112. 688.]]
F1: (87.89%, 87.48%) | Macro-F1: 87.68%
Building SA Dataset...


1600it [00:00, 3334.37it/s]
Evaluation: 100%|██████████| 200/200 [00:14<00:00, 13.62it/s]


Domain: electronics
Validation Loss: 0.548
Confusion Matrix:
[[723.  77.]
 [ 92. 708.]]
F1: (89.54%, 89.34%) | Macro-F1: 89.44%
Building SA Dataset...


1600it [00:00, 3926.11it/s]
Evaluation: 100%|██████████| 200/200 [00:14<00:00, 13.56it/s]

Domain: housewares
Validation Loss: 0.438
Confusion Matrix:
[[746.  54.]
 [ 77. 723.]]
F1: (91.93%, 91.69%) | Macro-F1: 91.81%





TODO🔻: (2) [textual question] compare and discuss the results. 

**Questions:**
- On which domain does the model perform the best? the worst?
- Give some possible explanations of why the model's best-performed domain is easier, and why the model's worst-performed domain is more challenging. Use some examples to support your explanations.

**Note:** To find examples for supporting your discussion, save the model prediction results on each domain under the `predictions/` folder, by specifying the `result_save_file` parameter in the *evaluate* function.

**The model performs the best in houseware and the worst in dvd.**
**This can be due to the fact that we use a larger specter to review movies compared to houseware and houseware reviews are geenrally concise and straight to the point in comparision to the movies reviews where we can find mixed feelings about several aspects.**
{"review": "The flight into New York had been a long one; but I was gently revived when Kirsten Dunst met me at the airport.  She was to be my escort that night at the Lincoln Center where the President would present me the Distinguished Writer's Cross-the highest award this nation can bestow on an author.  I quickly slipped my only suitcase in the backseat and sat next to her as she drove her Mercedes S class sedan to the Waldorf Astoria downtown. \"I have to admit that I didn't you who you were until I read STARLESS GRASSLANDS.  I don't mean to embarrass you; but that book changed my life.\" I tried to be polite and gracious toward the compliment.  For the remainder of the trip we had a nice conversation while I approved the fluid motion of her hands and legs as she maneuvered the controls of the automobile.  I very much approved of her legs. As I departed the car, Kirsten leaned toward the open door to catch my eyes. \"Don't forget.  I'll be by at seven so we can go to the ceremony together.  Please be ready.\" I expressed gratitude for her generosity and promised to be waiting for her seven sharp.  The hotel staff greeted me and soon I entered the elevator for a quiet ride to the eighty sixth floor.  As I entered the suite, the steward took my request to have my suit blushed, prepared and returned in a few hours. Finally, a bit of peace.  I poured a small measure of whiskey into a chilled tumbler and then walked over to the large window looking out over the city below.  For a few moments I contemplated the remarks I was to make at this evening's formal ceremonies, then I heard the hushed movement of stocking feet behind me across the room. I turned around and out of the bedroom to my left in walked Sandy Bullock dressed in a black sheer see through cat suit. \"Well, Crabby, aren't you glad to see me?\" Before I could recover from my surprise another set of feet wisped across the floor from the sitting room on the right.  It was Jennifer Aniston dressed in the same black sheet cat suit. \"Sweetheart, don't look at her.  Come here to me.\" \"Get out!\" screamed Sandy. \"No, you get out.  It's not fair.  I left Brad for him so I get him!\" The two of them stood in angry silence staring at each other for a tense moment.  Then Sandy leapt at Jennifer and instantly they were in a bitter fight hissing and scratching each other.  It was a ferocious sight as pulled hair, shrieks of pain, and nylon tearing could be heard.  Then almost a quickly as it began, the fighting pair disappeared into the sitting room.  Several sharp punches were heard then a definite thump as a body fell to the floor. Sandy marched back into the room victorious.  I ran to the door of the sitting room only to see Jennifer laid out on the floor out cold, spread eagle and naked.  Before I could think of what to do Sandy tapped me on my shoulder.  I turned around and there Sandy stood.  A little worse for wear.  Trying to catch her breath.  Nude with her torn cat suit draped across her feet. \"Forget her.  She won't bother us now.\" Sandy brushed the stray hairs out of her face, smiled and stepped out of her nylon costume.  She posed standing and anticipating my admiration. \"Now I'm all sweaty and warm.  I need a shower.\"  Sandy turned her lithe body and walked into the bedroom and into the bathroom beyond. I turned around and wondered what I should do for poor Jennifer still laying unconscious on the floor.  Then I heard Sandy calling. \"Crabby, why don't you come and help me?  You could soap up my......\" OOPS!!!!  Sorry!!!  Err...just a little fantasy of mine.  I meant to honestly review this movie but my more realistic nature took over.  I swear my pitiful fantasy is actually much better than the story told here.  You see, the cover of this DVD displays Ms. Bullock prominently with her name in big font letters.  But in truth this movie focuses on Tate Donovan as the main character and only on Ms. Bullock as his unlikely love interest. As much as I \"love\" Sandra Bullock, I have to admit her filmography has not been kind to her.  Her best films have been WHILE YOU WERE SLEEPING, WRESTLING ERNEST HEMINGWAY and the much-unappreciated HOPE FLOATS.  Unfortunately for each of these gems are several terrible movies like FIRE ON THE AMAZON, HANGMAN, SPEED 2 and this movie. The plot revolves around two \"losers in love\" who are unappealing both in looks and personality.  Donovan comes across a love potion from a gypsy woman (an unrecognizable Anne Bancroft) and soon both Donovan and Bullock have the most desirable members of the opposite sex at their feet.  All goes well until others find out their secret and Donovan and Bullock discover that their true love is actually between each other. The story is predictable and unfortunately not very funny or all that sweet.  Chances are you can find a copy of this DVD fairly cheap and it is not a bad way to kill some time while you're waiting for your own boyfriend or girlfriend to get off work-or your friends to wake up from their naps-so you can really do something worth wild.", "domain": "dvd", "label": "negative", "prediction": "positive"} **This review invokes several aspects of a movie and describes his whole experience, in comparision to this review :** {"review": "First the plastic top broke off at 6 months.  Then it stopped working at 9. I have one word: Junk", "domain": "housewares", "label": "negative", "prediction": "negative"} **where we can see that it is straight to the point and direct.**

<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

# PART 2: Identify Model Shortcuts (22 pts)

In this part, We aim to find out the shortcut features learnt by the sentiment analysis model we have trained in Part1. We will be using the model checkpoint trained with `learning rate=1e-5`.

</div>

## 🎯 Q2.1: **N-gram Pattern Extraction (6 pts)**
We hypothesize that `n-gram`s could be the potential shortcut features learnt by the SA model. An `n-gram` is defined as a sequence of n consecutive words appeared in a natural language sentence or paragraph. 

Thus, we aim to extract that an n-gram that appears in a review may serve as a key indicator of the polarity of the review's sentiment, for example:

>- **Review 1**: This book was **horrible**. If it was possible to rate it **lower than one star** I would have.
>- **Review 2**: **Excellent** book, **highly recommended**. Helps to put a realistic perspective on millionaires.

For Review 1, the `1-gram "horrible"` and the `4-gram "lower than one star"` serve as two key indicators of negative sentiment. While for Review 2, the `1-gram "excellent"` and the `2-gram "highly recommended"` obviously indicate positive sentiment.

TODO🔻: (1) [coding question] Complete `ngram_extraction()` function in `shortcut.py` file.

The returned *ngrams* contains a **list** of dictionaries. The `n-th` **dictionary** corresponds the `n-grams` (n=1,2, 3, 4).

The keys of each dictionary should be a **unique n-gram string** appeared in reviews, and the value of each n-gram key records the frequency of positive/negative predictions **made by the model** when the n-gram appears in the review, i.e., `\[#positive_predictions, #negative_predictions\]`.

> Example: **`ngrams`[0]['horrible'][0]** should return the number of the positive predictions made by the model when the 1-gram token 'horrible' appear in the given review. i.e., \[#positive_predictions, #negative_predictions\].

**Note:** (1) All the sequences contain punctuations should NOT be counted as a n-gram (e.g. `it is great .` is NOT a 4-gram, but `it is great` is a 3-gram); (2) All stop-words should NOT be counted as 1-grams, but can appear in other n-gram sequences (e.g. `is` is NOT a 1-gram token, but `it is great` can be a 3-gram token.)

## 🎯 Q2.2: **Distill Potentially Useful Patterns (8 pts)**

TODO🔻: (2) [coding question] For each group of n-grams (n=1,2,3,4), find and **print** the **top-100 n-gram sequences** with the **greatest frequency of appearance**, which could contain frequent semantic features and would be used as our feature list.

In [9]:
from shortcut import ngram_extraction

In [10]:
# all your saved model prediction results from 1.3 Fine-Grained Validation
prediction_files = ['predictions/test_books.jsonl', 'predictions/test_dvd.jsonl', 'predictions/test_electronics.jsonl', 'predictions/test_housewares.jsonl']

# TODO: Define your tokenizer
tokenizer = RobertaTokenizer.from_pretrained(model_name)
ngrams = ngram_extraction(prediction_files, tokenizer)

top_100 = {}
for n, counts in enumerate(ngrams):
    # TODO: find top-100 n-grams (n=1,2,3 or 4) associated with the greatest frequency of appearance
    top_100_freq = sorted(counts.items(), key=lambda x: x[1], reverse=True)[:100]

    print(f'Top-100 most frequent {n+1}-grams:')
    print(top_100_freq)

    top_100[n] = top_100_freq

100%|██████████| 1600/1600 [00:01<00:00, 927.43it/s] 
100%|██████████| 1600/1600 [00:01<00:00, 970.44it/s] 
100%|██████████| 1600/1600 [00:01<00:00, 1557.93it/s]
100%|██████████| 1600/1600 [00:00<00:00, 2369.79it/s]


Top-100 most frequent 1-grams:
[("'s", [3842, 3078]), ("'t", [2309, 2687]), ('one', [2143, 1844]), ('book', [1974, 1777]), ('great', [1347, 527]), ('like', [1293, 1269]), ('good', [1176, 954]), ('well', [1118, 601]), ('time', [1011, 896]), ('movie', [943, 978]), ('would', [932, 1286]), ('use', [925, 601]), ('get', [903, 1004]), ('also', [863, 521]), ('film', [789, 631]), ('j', [766, 514]), ('much', [752, 751]), ('even', [722, 870]), ('first', [719, 687]), ('read', [715, 645]), ('really', [712, 700]), ('k', [687, 373]), ('love', [685, 262]), ('many', [639, 459]), ("'ve", [618, 473]), (').', [601, 461]), ('best', [601, 260]), ('l', [597, 381]), ('way', [585, 561]), ('little', [579, 394]), ('b', [572, 453]), ('new', [571, 465]), ('work', [562, 583]), ('people', [544, 478]), ('see', [543, 414]), ('two', [540, 494]), ('r', [537, 411]), ('...', [535, 641]), ('g', [533, 404]), ('man', [532, 386]), ('us', [528, 339]), ('vd', [526, 422]), ('story', [520, 442]), ('better', [507, 559]), ('make', 

**Among each type of top-100 frequent n-grams above**, we aim to further find out the n-grams which **most likely** lead to *positive*/*negative* predictions (positive/negative shortcut features). 

TODO🔻: (3) [coding&text question] Design **two different methods to re-rank** the top-100 n-grams to extract shortcut features. For each method, you should extract **1** feature in each of n-grams group (n=1, 2, 3, 4) for positve and negative prediction (1\*4\*2=8 features in total for 1 method).

Explain each of your design choices in natural language, and compare which method finds more reasonable patterns.


**METHOD1** The method rank n-grams based on their positivity by computing the ratio of positive counts to negative counts. 

In [11]:
# Method 1

def positive_negative_words_ratio(top_100):
    top_100_reranked = {}
    top_100_tmp = top_100.copy()
    for n, counts in top_100_tmp.items():
        # to avoid division by zero add 1 to every count
        counts = [(ngram, (pos + 1, neg + 1)) for ngram, (pos, neg) in counts]
        top_100_reranked[n] = sorted(counts, key=lambda x: x[1][0] / x[1][1], reverse=True)

    return top_100_reranked


top_100_reranked1 = positive_negative_words_ratio(top_100)
for n, counts in top_100_reranked1.items():
    print(f'Top-1 most positive {n+1}-grams:')
    print(counts[:1])

for n, counts in top_100_reranked1.items():
    print(f'Top-1 most negative {n+1}-grams:')
    print(counts[-1:])

Top-1 most positive 1-grams:
[('easy', (496, 121))]
Top-1 most positive 2-grams:
[('easy to', (309, 68))]
Top-1 most positive 3-grams:
[('highly recommend this', (56, 2))]
Top-1 most positive 4-grams:
[('this is an excellent', (21, 1))]
Top-1 most negative 1-grams:
[('would', (933, 1287))]
Top-1 most negative 2-grams:
[("didn 't", (247, 352))]
Top-1 most negative 3-grams:
[('i had to', (55, 105))]
Top-1 most negative 4-grams:
[("i 'm going to", (19, 35))]


**METHOD2** This method balances the positivity by subtracting the negativities from the positive counts so that the ranking consider more nuanced cases where an n-gram has slightly fewer positive sentiments but significantly fewer negative sentiments, scoring higher than if they had many positives but also many negatives.
The exponential denominator moderates against very high sentiment counts, balancing out so that n-grams with large counts of sentiments don't overly weigh up the rankings. 

In [31]:
# Method 2 
import math
def pos_neg_nuanced(top_100):
    top_100_reranked = {}
    top_100_tmp = top_100.copy()
    
    for n, counts in top_100_tmp.items():
        # divide pos by exponential negative
        counts = [(ngram, (pos, neg)) for ngram, (pos, neg) in counts]
        top_100_reranked[n] = sorted(counts, key=lambda x: (x[1][0]-x[1][1]) / 2**(x[1][0] + x[1][1])  , reverse=True)

    return top_100_reranked

top_100_reranked2 = pos_neg_nuanced(top_100)
for n, counts in top_100_reranked2.items():
    print(f'Top-1 most positive {n+1}-grams:')
    print(counts[:1])
for n, counts in top_100_reranked2.items():
    print(f'Top-1 most negative {n+1}-grams:')
    print(counts[-1:])
    

Top-1 most positive 1-grams:
[('price', (353, 173))]
Top-1 most positive 2-grams:
[('i love', (256, 76))]
Top-1 most positive 3-grams:
[('i love this', (51, 4))]
Top-1 most positive 4-grams:
[('this is a wonderful', (14, 0))]
Top-1 most negative 1-grams:
[('end', (326, 350))]
Top-1 most negative 2-grams:
[('so i', (226, 246))]
Top-1 most negative 3-grams:
[('i am a', (47, 52))]
Top-1 most negative 4-grams:
[('the back of the', (13, 14))]


TODO🔻: Compare and discuss the results from two methods above.

**first method**

It is clear that the first method performs good since the features extracted by this method are significant in the sentiment of a review. As we can see words like "easy", "recommend" and "excellent" show appreciation and are always in positive reviews. Also, "had to", "need to" and "'t" show frustration and are always present in negative reviews.

**second method**

Method 2 performs very well in the postive ngrams since 'i love' , 'i love this', 'this is a wonderful' are one of the most reliable indicators of a positive review. But in the negative ngrams performs very poor. 

**OVERALL COMPARISION** 

Eventhough, the second method performed better in the positive part, the first one is the better choice.

## 🎯 Q2.3: **Case Study (8 pts)**

TODO🔻: Among the shortcut features you found in 2.1, find out **4 representative** cases (pair of `\[review, n-gram feature\]`) where the shortcut feature **will lead to a wrong prediction**. 

For example, the 1-gram feature "excellent" has been considered as a shortcut for *positive* sentiment, while the ground-truth label of the given review containing "excellent" is *negative*.

**Questions:**
- Based on your case study, do you detect any limitations of the n-gram patterns?
- Which type of n-gram (1/2/3/4-gram) pattern is more robust to be used for sentiment prediction shortcut and why?

In [35]:
# TODO: you can fill your code for finding cases here
positive_ngrams = []
negative_ngrams = []

for n, counts in top_100_reranked1.items():
    positive_ngrams.append(counts[:1])
    negative_ngrams.append(counts[-1:])

for n, counts in top_100_reranked2.items():
    positive_ngrams.append(counts[:1])
    negative_ngrams.append(counts[-1:])

indexes_pos = [0,4]
indexes_neg = [0,4]
# find sentences that have a positive n-gram and a negative prediction
for pred_file in prediction_files:

    with jsonlines.open(pred_file, mode="r") as reader:
        preds = [pr for pr in reader.iter()]
        for pred in preds:

            review_words = [word.strip("Ġ") for word in tokenizer.tokenize(pred["review"].lower()) if word.strip("Ġ")]
            original_review = pred["review"]
            

            for idx in indexes_pos:
               if positive_ngrams[idx][0][0] in review_words and pred["prediction"] == "positive":
                    print("word :", positive_ngrams[idx][0][0], "|Negative :", original_review, "|Positive :")
                    indexes_pos.remove(idx)

            for idx in indexes_neg:
               if negative_ngrams[idx][0][0] in review_words and pred["prediction"] == "negative":
                    print("word :", negative_ngrams[idx][0][0], "|Positive :", original_review, "|Negative :")
                    indexes_neg.remove(idx)

            
                
         

word : would |Positive : Having read Clive Cussler adventures for over 10 years, i was totally disgusted with the redneck and narrow minded attitudes expressed in the book. The father, who does not stay with his son, decides to revenge his son fighting a war trying to kill people in someone else's country. That is perfectly justified! The book mentions Hindu mercenaries who have not been seen anywhere in the world in any century much less this one. This is a direct insult to all Hindus as being one I am astonished at the insensitivity of the author. How come "Hickman" is not a "Christian" mercenary or for that matter the Corporation as they seem to be in this for the money. The justifications are ridiculous. India and Hindus of India have been terrorized, jailed, colonized and robbed by Christian and Muslims alike over the last 1000 plus years. You don't find us calling people by religion and we have all major religions in our secular country. We have grown spiritually to accept everyo

TODO🔻: (Write your case study discussions and answers to the questions here.)
Three out of the 4 words that led to missclassification are generic words which can mean boath positive and negative meanings (would, end, price).. Hence we can conclude that the higher the n, the better we can understand the meaning of the n-gram and better we can classify it. 

-> The most robust is 4-gram

<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

## **Part 3: Annotate New Data (25 pts)**

In this part, you will **annotate** the gold labels of some **new** SA data samples, and measure the degree of **agreement** between your and **one or two partners'** annotations.
    
</div>

## 🎯 Q3.1: **Write an Annotation Guideline (5 pts)**

TODO🔻: Imagine that you are going to assign this annotation task to a crowdsourcing worker, who is completely not familiar with computer science and NLP. Think about how you are going to explain this annotation task to him in order to guide him do a decent job. Write an annotation guideline for such a worker who are going to do this task for you.

**Note:** You should come up with your own guideline without the help of your partner(s) in later Part 3.2

You will be given a list of texts (each text is called an "instance"). Your task is to read each instance carefully and decide if the sentiment expressed is **positive, negative** which translates to whether the writer is happy or frustrated about the item he is reviewing. To achieve this follow these guidelines:
- Read carefully the text especially the **beginning and the ending** since they might contain the feeling of the reviewer or a summary of the reviewer point of view.
- Look for positive adjectives and adverbs (e.g., "great", "beautifully") or negative ones (e.g., "aweful", "terrible").
- Recommendation or complaints (e.g., "I highly recommend this product" or "Don't buy this product").
- Use your judgement to decide which side is more present especially at the ending of the text

## 🎯 Q3.2: **Annotate Your Datapoints with Partner(s) (8 pts)**

TODO🔻: Annotate 80 datapoints (20 in each domain of "books", "dvd", "electronics" and "housewares") assigned to you and your partner(s), by editing the value of the key **"label"** in each datapoint. You and your partner(s) should annotate **independently of each other**, i.e., each of you provide your own 80 annotations.

Please find your assigned annotation dataset **ID** and **your partner(s)** according to this [list](https://docs.google.com/spreadsheets/d/1hOwBUb8XE8fitYa4hlAwq8mARZe3ZsL4/edit?usp=sharing&ouid=108194779329215429936&rtpof=true&sd=true). Your annotation dataset can be found [here](https://drive.google.com/drive/folders/1IHXU_v3PDGbZG6r9T5LdjKJkHQ351Mb4?usp=sharing).

**Name your annotated file as `<your_assigned_dataset_id>-<your_sciper_number>.jsonl`.**

**You should also submit your partner's annotated file `<assigned_dataset_id>-<your_partner_sciper_number>.jsonl`.**

## 🎯 Q3.3: **Agreement Measure (12 pts)**

TODO🔻: Based on your and your partner's annotations in 3.2, calculate the [Cohen's Kappa](https://scikit-learn.org/stable/modules/model_evaluation.html#cohen-kappa) or [Krippendorff's Alpha](https://github.com/pln-fing-udelar/fast-krippendorff) (if you are in a group of three students) between the annotators on **each domain** and **across all domains**.

**Note:** Cohen's Kappa or Krippendorff's Alpha interpretation

0: No Agreement

0 ~ 0.2: Slight Agreement

0.2 ~ 0.4: Fair Agreement

0.4 ~ 0.6: Moderate Agreement

0.6 ~ 0.8: Substantial Agreement

0.8 ~ 1.0: Near Perfect Agreement

1.0: Perfect Agreement

**Questions:**
- What is the overall degree of agreement between you and your partner(s) according to the above interpretation of score ranges?
- In which domain are disagreements most and least frequently happen between you and your partner(s)? Give some examples to explain why that is the case.
- Are there possible ways to address the disagreements between annotators?

In [41]:
# Fill your code for calculating agreement scores here.

from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import accuracy_score

prediction_files = ['data/74-315196.jsonl', 'data/74-353333.jsonl']

def calculate_agreement_scores(prediction_files):
    agreements = []

    with jsonlines.open(prediction_files[0], mode="r") as reader:
        preds1 = [pr for pr in reader.iter()]
        predictions1 = [pr["label"] for pr in preds1]

    with jsonlines.open(prediction_files[1], mode="r") as reader:
        preds2 = [pr for pr in reader.iter()]
        predictions2 = [pr["label"] for pr in preds2]

    # cohens kappa score
    kappa = cohen_kappa_score(predictions1, predictions2)

    return kappa


agreements = calculate_agreement_scores(prediction_files)
print(f'Score: {agreements}')

Score: 0.7761194029850746


In [40]:
def calculate_agreement_scores_per_domain(prediction_files):
    agreements = []

    with jsonlines.open(prediction_files[0], mode="r") as reader:
        preds1 = [pr for pr in reader.iter()]
        predictions1 = [pr["label"] for pr in preds1]

    with jsonlines.open(prediction_files[1], mode="r") as reader:
        preds2 = [pr for pr in reader.iter()]
        predictions2 = [pr["label"] for pr in preds2]

    # cohens kappa per domain
    kappa1 = cohen_kappa_score(predictions1[:20], predictions2[:20])
    kappa2 = cohen_kappa_score(predictions1[20:40], predictions2[20:40])
    kappa3 = cohen_kappa_score(predictions1[40:60], predictions2[40:60])
    kappa4 = cohen_kappa_score(predictions1[60:80], predictions2[60:80])

    return kappa1, kappa2, kappa3, kappa4

scores = calculate_agreement_scores_per_domain(prediction_files)
print(f'Scores per domain: {scores}')

Agreement Scores per domain: (0.7333333333333334, 0.7727272727272727, 0.6428571428571428, 0.7938144329896907)


- The third domain 'electronics' is the domain that had the lowest agreement score while 'houseware' had the highest score. So electronics had the most disagreement and houseware had the least this might be due to the fact that houseware reviews are straight to the point unlike electronics where they tend to be longer and non bianry.

- There are ways to address this issue like by implementing an adjudication process where a third annotator or program reviews the instances with disagreements and makes a final decision.

<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

## **Part 4: Data Augmentation (20 pts)**

Since we only used 20% of the whole dataset for training, which might limit the model performance. In the final part, we will try to enlarge the training set by **data augmentation**.  

Specifically, we will **`Rephrase`** some current training samples using pretrained paraphraser. So that the paraphrased synthetic samples would preserve the semantic similarity while change the surface format.

You can use the pretrained T5 paraphraser [here](https://huggingface.co/humarin/chatgpt_paraphraser_on_T5_base).

</div>

In [10]:
device="mps" if torch.backends.mps.is_available() else "cpu"

## 🎯 Q4.1: **Data Augmentation with Paraphrasing (15 pts)**
TODO🔻: Implement functions named `get_paraphrase_batch` and `get_paraphrase_dataset` with the details in the below two blocks. 

In [10]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from tqdm import tqdm
# get the given pretrained paraphrase model and the corresponding tokenizer (https://huggingface.co/humarin/chatgpt_paraphraser_on_T5_base)
paraphrase_tokenizer = AutoTokenizer.from_pretrained("humarin/chatgpt_paraphraser_on_T5_base")
paraphrase_model = AutoModelForSeq2SeqLM.from_pretrained("humarin/chatgpt_paraphraser_on_T5_base").to(device)

def get_paraphrase_batch(
    model,
    tokenizer,
    input_samples,
    n,
    repetition_penalty=10.0,
    diversity_penalty=3.0,
    no_repeat_ngram_size=2,
    temperature=0.7,
    max_length=256,
    device='mps'):
    '''
    Input
      model: paraphraser
      tokenizer: paraphrase tokenizer
      input_samples: a batch (list) of real samples to be paraphrased
      n: number of paraphrases to get for each input sample
      for other parameters, please refer to:
          https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationConfig
    Output: Tuple.
      synthetic_samples: a list of paraphrased samples
    '''

    # TODO: implement para phrasing on a batch of imput samples
    synthetic_samples = []
    # Setting the model to evaluation mode and to the specified device
    model.eval()
    model.to(device)

    # Processing each input sample
    for sample in input_samples:
        # Encode the input sample

        if len(sample["review"]) > max_length:
            sample["review"] = sample["review"][:max_length]
        input_ids = tokenizer.encode("paraphrase: " + sample["review"], return_tensors="pt").to(device)

        # Generating paraphrases
        paraphrases = model.generate(
            input_ids,
            max_length=max_length,
            num_return_sequences=n,
            no_repeat_ngram_size=no_repeat_ngram_size,
            repetition_penalty=repetition_penalty,
            temperature=temperature,
            diversity_penalty=diversity_penalty,
            do_sample=False,
            top_k=50,
            top_p=0.95,
            early_stopping=True,
            num_beams=2,
            num_beam_groups=2
        ) 

        # Decoding the generated ids to text and adding to the result list
        for paraphrase in paraphrases:
            synthetic_samples.append({"review": tokenizer.decode(paraphrase, skip_special_tokens=True), "domain": sample["domain"], "label": sample["label"]})    

    return synthetic_samples

  return self.fget.__get__(instance, owner)()


In [11]:
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

data_dir = 'data'
data_train_path = os.path.join(data_dir, 'train_sa.jsonl')
BATCH_SIZE = 8
N_PARAPHRASE = 2

def get_paraphrase_dataset(model, tokenizer, data_path, batch_size, n_paraphrase):
    '''
    Input
      model: paraphrase model
      tokenizer: paraphrase tokenizer
      data_path: path to the `jsonl` file of training data
      batch_size: number of input samples to be paraphrases in one batch
      n_paraphrase: number of paraphrased sequences for each sample
    Output:
      paraphrase_dataset: a list of all paraphrase samples. Do not include the original training data.
    '''
    
    paraphrase_dataset = []
    batch = []
    with jsonlines.open(data_path, mode='r') as reader:
        for obj in tqdm(reader):
            batch.append({"review": obj['review'], "domain": obj['domain'], "label": obj['label']})
            if len(batch) == batch_size:
                paraphrases = get_paraphrase_batch(
                    model=model,
                    tokenizer=tokenizer,
                    input_samples=batch,
                    n=n_paraphrase,
                    device=device
                )
                paraphrase_dataset.extend(paraphrases)
                batch = []


    return paraphrase_dataset


**Note:** run paraphrasing, which will take ~20-30 minutes using a T4 Colab GPU. But the running time could depend on various implementations.

In [12]:
paraphrase_dataset = get_paraphrase_dataset(paraphrase_model, paraphrase_tokenizer, data_train_path, BATCH_SIZE, N_PARAPHRASE)

0it [00:00, ?it/s]

1600it [46:27,  1.74s/it]


In [13]:
# Original training dataset
with jsonlines.open(data_train_path, "r") as reader:
    origin_data = [dt for dt in reader.iter()]

all_data = origin_data + paraphrase_dataset

# Write all the original and paraphrased data samples into training dataset
augmented_data_train_path = os.path.join(data_dir, 'augmented_train_sa.jsonl')
with jsonlines.open(augmented_data_train_path, "w") as writer:
    writer.write_all(all_data)

assert len(all_data) == 3 * len(origin_data)

1600 3200 4800


## 🎯 Q4.2: **Retrain RoBERTa Model with Data Augmentation (5 pts)** 
TODO🔻: Retrain the sentiment analysis model with the augmented (original+paraphrased), larger dataset :)

**Note:** *Training on the augmented data will take about 15 minutes using a T4 Colab GPU.*

In [12]:
# Re-train a RoBERTa SA model on the augmented training dataset
learning_rate = 1e-5
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

model_name = "FacebookAI/roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaForSequenceClassification.from_pretrained(model_name, num_labels=2)
model.to(device)

train(train_dataset= SADataset("data/augmented_train_sa.jsonl",tokenizer= tokenizer), dev_dataset=SADataset("data/test_sa.jsonl", tokenizer= tokenizer), model=model, device=device,
      batch_size=8, epochs=4, learning_rate=1e-5, warmup_percent=0.3, max_grad_norm=1.0,
      model_save_root='models/augmented/', tensorboard_path="./tensorboard/part4_lr{}".format(learning_rate))

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Building SA Dataset...


4800it [00:01, 4371.72it/s]


Building SA Dataset...


6400it [00:02, 2596.00it/s]
Training:   0%|          | 0/600 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
Training: 100%|██████████| 600/600 [03:24<00:00,  2.93it/s]
Evaluation: 100%|██████████| 800/800 [00:55<00:00, 14.48it/s]


Epoch: 0 | Training Loss: 0.554 | Validation Loss: 0.319
Epoch 0 SA Validation:
Confusion Matrix:
[[2881.  319.]
 [ 344. 2856.]]
F1: (89.68%, 89.60%) | Macro-F1: 89.64%
Model Saved!


Training: 100%|██████████| 600/600 [03:12<00:00,  3.11it/s]
Evaluation: 100%|██████████| 800/800 [00:54<00:00, 14.55it/s]


Epoch: 1 | Training Loss: 0.369 | Validation Loss: 0.286
Epoch 1 SA Validation:
Confusion Matrix:
[[2818.  382.]
 [ 248. 2952.]]
F1: (89.95%, 90.36%) | Macro-F1: 90.15%
Model Saved!


Training: 100%|██████████| 600/600 [03:10<00:00,  3.15it/s]
Evaluation: 100%|██████████| 800/800 [00:55<00:00, 14.53it/s]


Epoch: 2 | Training Loss: 0.192 | Validation Loss: 0.631
Epoch 2 SA Validation:
Confusion Matrix:
[[3012.  188.]
 [ 530. 2670.]]
F1: (89.35%, 88.15%) | Macro-F1: 88.75%


Training: 100%|██████████| 600/600 [07:36<00:00,  1.31it/s]  
Evaluation: 100%|██████████| 800/800 [00:55<00:00, 14.52it/s]

Epoch: 3 | Training Loss: 0.085 | Validation Loss: 0.652
Epoch 3 SA Validation:
Confusion Matrix:
[[2782.  418.]
 [ 236. 2964.]]
F1: (89.48%, 90.06%) | Macro-F1: 89.77%





TODO🔻: Discuss your results by answering the following questions

- Compare the performances of models in Part 1 and Part 4. Does the data augmentation help with the performance and why (give possible reasons)?
- No matter whether the data augmentation helps or not, list **three** possible ways to improve our current data augmentation method.

-The performance of the model in part 4 has slightly improved compared to the one in part 1. It achieved approx 0.06% increase in F1 score which was insignifanct. So we can conclude that the data augmentation didn't significantly help the model in our case.

- Ways to improve data augmentation model:

• Replace words in the text with their synonyms to create a new sentence with similar meaning. 

• Translate the text into another language and then back to english. This introduces linguistic variations and can significantly change the sentence structure.

• Create or generate using the previous model new sentences that contrast with the original text in terms of sentiment 


<div style="padding:15px 20px 20px 20px;border-left:3px solid orange;background-color:#fff5d6;border-radius: 20px;color:#424242;">

### **5 Upload Your Notebook, Data and Models**

Please upload your filled jupyter notebook in your GitHub Classroom repository, **with all cells run and output results shown**.

**Note:** We are **not** responsible for re-running the cells in your notebook.

Please also submit all your **datasets** **(anotated and augmented)**, as well as **all your trained models** in Part 1 and Part 4, in your GitHub Classroom repository.
    
</div>