In [None]:
! pip install transformers datasets
! pip install evaluate
! pip install sentence-transformers

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.wh

# Part A: Fine-tune a pretrained model

Language models are trained in two stages:

1. **Pre-training on large unlabelled datasets**:

   Pre-training is computationally very expensive, and that's why in practice we don't use it when we want to run a model on a new dataset. We can think of pre-training as the process of learning linguistic rules and concepts, which can later be used for various purposes.

2. **Fine-tuning on smaller labelled datasets**:

   Fine-tuning essentially leverages the properties of transfer learning to transfer the "knowledge" that has been stored in the language model during pre-training to specific tasks. Each task is served by targeted datasets. For example, some datasets relate to text classification, others contain questions that need to be answered (question answering), and many others.

Some classic natural language processing tasks include:
- Text classification  
- Question answering  
- Natural language inference  
- Fill mask  
- Semantic similarity  

You can find more information at the following link in the Natural Language Processing domain: [https://huggingface.co/models](https://huggingface.co/models)

In the first part of this lab exercise, we will use the pre-training/fine-tuning scenario to classify reviews.


## Pipelines

Using the **text-classification pipeline**, we can run language models for classification tasks.

The natural language inference (NLI) task is a classification task, since the relevant model (in this case, `roberta-large-mnli`) is required to classify a text into one of three categories: **[entailment / neutral / contradiction]**.

```python
from transformers import pipeline

classifier = pipeline("text-classification", model = "roberta-large-mnli")
classifier("A soccer game with multiple males playing. Some men are playing a sport.")
# [{'label': 'ENTAILMENT', 'score': 0.98}]

```

Another classification task involves evaluating whether a sentence is grammatically correct (acceptable) or not (unacceptable):

```
from transformers import pipeline

classifier = pipeline("text-classification", model = "textattack/distilbert-base-uncased-CoLA")
classifier("I will walk to home when I went through the bus.")
##  [{'label': 'unacceptable', 'score': 0.95}]
```

## Σύνολο δεδομένων Yelp polarity

## Yelp Polarity Dataset

We download the [Yelp Polarity](https://huggingface.co/datasets/yelp_polarity) dataset, which contains reviews expressing customer sentiments about restaurants.

The Yelp dataset was constructed by considering 1- and 2-star reviews as negative, and 3- and 4-star reviews as positive. Negative polarity corresponds to category 1, and positive polarity to category 2. These reviews are divided into these categories, and our goal is to classify new reviews into the correct categories.




In [None]:
from datasets import load_dataset

# insert your code here

dataset = load_dataset("yelp_polarity")
# print the first example of the dataset
print(dataset['train'][0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/8.93k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/256M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/17.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/560000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/38000 [00:00<?, ? examples/s]

{'text': "Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff.  It seems that his staff simply never answers the phone.  It usually takes 2 hours of repeated calling to get an answer.  Who has time for that or wants to deal with it?  I have run into this problem with many other doctors and I just don't get it.  You have office workers, you have patients with medical needs, why isn't anyone answering the phone?  It's incomprehensible and not work the aggravation.  It's with regret that I feel that I have to give Dr. Goldberg 2 stars.", 'label': 0}


Since the Yelp Polarity dataset contains many samples, in order to speed up the fine-tuning process, we recommend keeping 300 samples from the training set and 300 samples from the test set.

Check the number of categories that exist in the training and test sets, and ensure a balanced number of samples per category when selecting the 300 samples for each set.


In [None]:
# insert your code here

from collections import Counter
from datasets import concatenate_datasets
# Checking the number of the train and test set examples
train_labels = [example["label"] for example in dataset["train"]]
print(Counter(train_labels))

test_labels = [example["label"] for example in dataset["test"]]
print(Counter(test_labels))

# Choose randomly 300 balanced examples (150 labeled 0 and 150 labeled 1) from train and test set respectively
train_label_0 = dataset["train"].filter(lambda example: example["label"] == 0)
train_label_1 = dataset["train"].filter(lambda example: example["label"] == 1)
# select the first 150 from the shuffled set
train_label_0 = train_label_0.shuffle(seed=42).select(range(150)) # seed=42 so that we take the same random every run
train_label_1 = train_label_1.shuffle(seed=42).select(range(150))
train_set = concatenate_datasets([train_label_0, train_label_1]).shuffle(seed=42)

test_label_0 = dataset["test"].filter(lambda example: example["label"] == 0)
test_label_1 = dataset["test"].filter(lambda example: example["label"] == 1)
test_label_0 = test_label_0.shuffle(seed=42).select(range(150))
test_label_1 = test_label_1.shuffle(seed=42).select(range(150))
test_set = concatenate_datasets([test_label_0, test_label_1]).shuffle(seed=42)


Counter({0: 280000, 1: 280000})
Counter({1: 19000, 0: 19000})


Filter:   0%|          | 0/560000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/560000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/38000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/38000 [00:00<?, ? examples/s]

# Language Models

Text preprocessing is performed before feeding input into language models.

This process is carried out using **Tokenizers**, which convert input tokens into appropriate IDs from the pretraining vocabulary, thus transforming text into a format that can be processed by a Transformer model. The Huggingface library provides simple and high-level implementations of tokenization, which we recommend you follow.

Specifically, **we initialize the tokenization process using AutoTokenizer**. By selecting the **from_pretrained** method, we obtain a tokenizer that matches the architecture of the model we want to use, ensuring compatible tokenization.

More information about AutoTokenization can be found here:  
https://huggingface.co/docs/transformers/model_doc/auto

Regarding the BERT model, you can see the [tokenization and model initialization process here](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer):

```python
from transformers import AutoTokenizer, BertModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
```

As part of this exercise, you are required to perform the above procedure using *another model of your choice from Huggingface* that supports AutoTokenizer. The pre-trained model you choose must have an implementation with a sequence classification head (similar to the BertForSequenceClassification method).

In the next cell, load the selected model along with its corresponding tokenizer.

(You can ignore possible warnings such as: "Some weights of the model checkpoint at xxx were not used when initializing...")


In [None]:
# insert your code here

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "distilbert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We provide you with the function that performs tokenization by calling the tokenizer you selected. Apply it to both the training and the test set.


In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# insert your code here

train_tokenized = train_set.map(tokenize_function, batched=True)
test_tokenized = test_set.map(tokenize_function, batched=True)

print(train_tokenized[0])

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

{'text': 'Seriously one of my favorite places to eat at any one of the locations around the valley. I think I would eat here weekly if I could. They have awesome food, hummus and tabbouleh never cease to satisfy. I love their salads and the chicken schwarma rocks. The service is always good and the atmosphere is always fun with changing art pieces. The newest tapas/ happy hour is awesome and worth taking advantage of. Simply delicious!', 'label': 1, 'input_ids': [101, 5667, 2028, 1997, 2026, 5440, 3182, 2000, 4521, 2012, 2151, 2028, 1997, 1996, 5269, 2105, 1996, 3028, 1012, 1045, 2228, 1045, 2052, 4521, 2182, 4882, 2065, 1045, 2071, 1012, 2027, 2031, 12476, 2833, 1010, 14910, 7606, 1998, 21628, 5092, 9307, 2232, 2196, 13236, 2000, 13225, 1012, 1045, 2293, 2037, 16521, 2015, 1998, 1996, 7975, 8040, 18663, 17830, 5749, 1012, 1996, 2326, 2003, 2467, 2204, 1998, 1996, 7224, 2003, 2467, 4569, 2007, 5278, 2396, 4109, 1012, 1996, 14751, 11112, 3022, 1013, 3407, 3178, 2003, 12476, 1998, 4276, 

By printing the training or test set, you will see two additional fields: 'input_ids' and 'attention_mask'. Make sure they are present, which indicates that tokenization has been successfully performed.


In [None]:
train_dataset

## Using PyTorch Trainer for Fine-Tuning

The [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) class has been optimized by the creators of Huggingface, offering many conveniences and requiring less manual work. We recommend using it as an alternative to writing your own training loop.

Since the Trainer does not automatically evaluate the performance of the model during training, we provide an appropriate function to measure the model's accuracy at each epoch.


In [None]:
import numpy as np
import evaluate
import torch
from tqdm import tqdm
from transformers import pipeline

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)



The [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) class contains all the hyperparameters you can experiment with during the fine-tuning process.

You are asked to experiment with different hyperparameters such as learning rate, batch size, etc., and also to define an optimizer and scheduler for fine-tuning. We recommend performing fine-tuning for a small number of epochs (since the model is already pretrained).

1. Provide a markdown table listing the different hyperparameters you tested and the accuracy achieved in the final epoch.

2. Based on your experiments, how do different hyperparameters such as learning rate and batch size affect the fine-tuning of the model you selected? Comment and analyze.


##Run 1 - Testing 5 epochs

In [None]:
# Ορισμός seed
import random
import numpy as np
import torch
from transformers import set_seed

seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
set_seed(seed)

In [None]:
from transformers import TrainingArguments, Trainer



# insert your code here
from torch.optim import AdamW
from transformers import get_scheduler
from google.colab import drive

drive.mount('/content/drive')

args = TrainingArguments(
    output_dir="/content/drive/MyDrive/NN_Lab2/fine_tuning_1",
    evaluation_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    learning_rate=3e-5,
    weight_decay=0.01,   # prevent overfitting
    logging_steps=10,
    report_to="none"
)
# optimizer
optimizer = AdamW(model.parameters(), lr=args.learning_rate)

# scheduler

num_training_steps = len(train_tokenized) *  args.num_train_epochs

lr_scheduler = get_scheduler(
    name="linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)



# etc

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).




In [None]:
# We put this here so the model is clean and doesnt build on top of the old one
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_tokenized,
    eval_dataset=test_tokenized,
    compute_metrics=compute_metrics
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Next, fine-tune your model by calling the [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) method:


In [None]:
trained_model=trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.6915,0.593918,0.753333
2,0.5734,0.363843,0.866667
3,0.2343,0.254499,0.906667
4,0.0968,0.24207,0.92
5,0.0605,0.219978,0.913333


##Run 2 - Testing 10 epochs

In [None]:
from transformers import TrainingArguments, Trainer



# insert your code here
# seed
import random
import numpy as np
import torch
from transformers import set_seed

seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
set_seed(seed)

from torch.optim import AdamW
from transformers import get_scheduler
from google.colab import drive

drive.mount('/content/drive')

args = TrainingArguments(
    output_dir="/content/drive/MyDrive/NN_Lab2/fine_tuning_2",
    evaluation_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    learning_rate=3e-5,
    weight_decay=0.01,   # prevent overfitting
    logging_steps=10,
    report_to="none"
)
# optimizer
optimizer = AdamW(model.parameters(), lr=args.learning_rate)

# scheduler

num_training_steps = len(train_tokenized) *  args.num_train_epochs

lr_scheduler = get_scheduler(
    name="linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)



# etc
# We put this here so the model is clean and doesnt build on top of the old one
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_tokenized,
    eval_dataset=test_tokenized,
    compute_metrics=compute_metrics
)

trained_model=trainer.train()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6914,0.598312,0.733333
2,0.5786,0.328404,0.886667
3,0.1857,0.22627,0.92
4,0.0548,0.244534,0.92
5,0.0132,0.273481,0.923333
6,0.0061,0.291253,0.916667
7,0.0043,0.30128,0.93
8,0.0035,0.308632,0.923333
9,0.0031,0.312469,0.923333
10,0.003,0.313718,0.923333


##Run 3 - Testing epochs 15

In [None]:
from transformers import TrainingArguments, Trainer



# insert your code here
# seed
import random
import numpy as np
import torch
from transformers import set_seed

seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
set_seed(seed)

from torch.optim import AdamW
from transformers import get_scheduler
from google.colab import drive

drive.mount('/content/drive')

args = TrainingArguments(
    output_dir="/content/drive/MyDrive/NN_Lab2/fine_tuning_3",
    evaluation_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=15,
    learning_rate=3e-5,
    weight_decay=0.01,   # prevent overfitting
    logging_steps=10,
    report_to="none"
)
# optimizer
optimizer = AdamW(model.parameters(), lr=args.learning_rate)

# scheduler

num_training_steps = len(train_tokenized) *  args.num_train_epochs

lr_scheduler = get_scheduler(
    name="linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)



# etc
# We put this here so the model is clean and doesnt build on top of the old one
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_tokenized,
    eval_dataset=test_tokenized,
    compute_metrics=compute_metrics
)

trained_model=trainer.train()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6914,0.596783,0.733333
2,0.5768,0.316368,0.883333
3,0.1758,0.228427,0.91
4,0.053,0.25826,0.923333
5,0.0137,0.289813,0.923333
6,0.0046,0.307757,0.923333
7,0.003,0.319569,0.93
8,0.0023,0.33035,0.926667
9,0.0018,0.338779,0.93
10,0.0017,0.345355,0.93


## Συμπέρασμα μετά το τρέξιμο για διαφορετικό πλήθος εποχών

->Παρατηρούμε ότι με τις 15 εποχές παίρνουμε καλύτερο ελαφρώς αποτέλεσμα σε σύγκριση με τις 10 οπότε επιλέγουμε αυτό το πλήθος.

##Run 4 - 15 epochs / 1e-5 lr

In [None]:
from transformers import TrainingArguments, Trainer



# insert your code here
# seed
import random
import numpy as np
import torch
from transformers import set_seed

seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
set_seed(seed)

from torch.optim import AdamW
from transformers import get_scheduler
from google.colab import drive

drive.mount('/content/drive')

args = TrainingArguments(
    output_dir="/content/drive/MyDrive/NN_Lab2/fine_tuning_4",
    evaluation_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=15,
    learning_rate=1e-5,
    weight_decay=0.01,   # prevent overfitting
    logging_steps=10,
    report_to="none"
)
# optimizer
optimizer = AdamW(model.parameters(), lr=args.learning_rate)

# scheduler

num_training_steps = len(train_tokenized) *  args.num_train_epochs

lr_scheduler = get_scheduler(
    name="linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)



# etc
# We put this here so the model is clean and doesnt build on top of the old one
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_tokenized,
    eval_dataset=test_tokenized,
    compute_metrics=compute_metrics
)

trained_model=trainer.train()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6925,0.663827,0.676667
2,0.6486,0.595768,0.766667
3,0.531,0.469254,0.823333
4,0.3626,0.350146,0.883333
5,0.2193,0.271967,0.9
6,0.13,0.23878,0.906667
7,0.0667,0.252115,0.91
8,0.0343,0.248502,0.92
9,0.025,0.268263,0.9
10,0.0201,0.264947,0.926667


##Run 5 - 15 epochs / 5e-5 lr

In [None]:
from transformers import TrainingArguments, Trainer



# insert your code here
# seed
import random
import numpy as np
import torch
from transformers import set_seed

seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
set_seed(seed)

from torch.optim import AdamW
from transformers import get_scheduler
from google.colab import drive

drive.mount('/content/drive')

args = TrainingArguments(
    output_dir="/content/drive/MyDrive/NN_Lab2/fine_tuning_5",
    evaluation_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=15,
    learning_rate=5e-5,
    weight_decay=0.01,   # prevent overfitting
    logging_steps=10,
    report_to="none"
)
# optimizer
optimizer = AdamW(model.parameters(), lr=args.learning_rate)

# scheduler

num_training_steps = len(train_tokenized) *  args.num_train_epochs

lr_scheduler = get_scheduler(
    name="linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)



# etc
# We put this here so the model is clean and doesnt build on top of the old one
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_tokenized,
    eval_dataset=test_tokenized,
    compute_metrics=compute_metrics
)

trained_model=trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6899,0.484768,0.79
2,0.3424,0.255017,0.89
3,0.1593,0.214587,0.94
4,0.0246,0.427566,0.89
5,0.0219,0.363982,0.916667
6,0.0026,0.350904,0.933333
7,0.0015,0.496183,0.896667
8,0.0011,0.421333,0.923333
9,0.0008,0.459117,0.92
10,0.0008,0.530174,0.903333


## Conclusion from Testing with Learning Rate

We observe that 1e-5 achieves the lowest maximum/final accuracy and shows slow but steady improvement. The 3e-5 value is more balanced in terms of speed vs. quality. The 5e-5 learns very quickly, but validation accuracy starts to increase after epoch 4, which suggests possible overfitting as the training loss continues to decrease.

We choose to keep 3e-5 because it achieves high accuracy, does not show signs of overfitting, and is safer compared to 5e-5.


##Run 6- 15 epochs / 3e-5 lr / 8 batch size

In [None]:
from transformers import TrainingArguments, Trainer



# insert your code here
# seed
import random
import numpy as np
import torch
from transformers import set_seed

seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
set_seed(seed)

from torch.optim import AdamW
from transformers import get_scheduler
from google.colab import drive

drive.mount('/content/drive')

args = TrainingArguments(
    output_dir="/content/drive/MyDrive/NN_Lab2/fine_tuning_6",
    evaluation_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=15,
    learning_rate=3e-5,
    weight_decay=0.01,   # prevent overfitting
    logging_steps=10,
    report_to="none"
)
# optimizer
optimizer = AdamW(model.parameters(), lr=args.learning_rate)

# scheduler

num_training_steps = len(train_tokenized) *  args.num_train_epochs

lr_scheduler = get_scheduler(
    name="linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)



# etc
# We put this here so the model is clean and doesnt build on top of the old one
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_tokenized,
    eval_dataset=test_tokenized,
    compute_metrics=compute_metrics
)

trained_model=trainer.train()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.5568,0.362134,0.85
2,0.3831,0.38973,0.856667
3,0.177,0.29208,0.91
4,0.0082,0.415439,0.91
5,0.0046,0.365307,0.923333
6,0.0014,0.370375,0.916667
7,0.0009,0.383831,0.913333
8,0.0007,0.393776,0.913333
9,0.0006,0.402756,0.913333
10,0.0005,0.410155,0.913333


##Run 7 - 15 epochs / 3e-5 lr / 32 batch size

In [None]:
from transformers import TrainingArguments, Trainer



# insert your code here
# seed
import random
import numpy as np
import torch
from transformers import set_seed

seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
set_seed(seed)

from torch.optim import AdamW
from transformers import get_scheduler
from google.colab import drive

drive.mount('/content/drive')

args = TrainingArguments(
    output_dir="/content/drive/MyDrive/NN_Lab2/fine_tuning_7",
    evaluation_strategy="epoch",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=15,
    learning_rate=3e-5,
    weight_decay=0.01,   # prevent overfitting
    logging_steps=10,
    report_to="none"
)
# optimizer
optimizer = AdamW(model.parameters(), lr=args.learning_rate)

# scheduler

num_training_steps = len(train_tokenized) *  args.num_train_epochs

lr_scheduler = get_scheduler(
    name="linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)



# etc
# We put this here so the model is clean and doesnt build on top of the old one
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_tokenized,
    eval_dataset=test_tokenized,
    compute_metrics=compute_metrics
)

trained_model=trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6897,0.630204,0.733333
2,0.5618,0.438858,0.843333
3,0.3129,0.299704,0.9
4,0.1384,0.269099,0.903333
5,0.0501,0.252751,0.926667
6,0.0191,0.331939,0.913333
7,0.0108,0.306852,0.926667
8,0.0066,0.30806,0.926667
9,0.0051,0.318217,0.923333
10,0.0044,0.326667,0.923333


##Συμπεράσματα δοκιμών batch sizes
Παρατηρούμε ότι το μεγαλύτερο batch size δίνει λίγο χειρότερο accuracy από το μεσαίο αλλά καλύτερο από το μικρό αλλά κάνει λιγότερο overfitting. Το batch size 16 αποδίδει καλύτερα σε σχέση με τα μικρότερα ή μεγαλύτερα, πιθανώς λόγω του βέλτιστου συμβιβασμού μεταξύ noise στα gradients και ομαλής ενημέρωσης του μοντέλου.

In [None]:
from transformers import TrainingArguments, Trainer



# insert your code here
# seed
import random
import numpy as np
import torch
from transformers import set_seed

seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
set_seed(seed)

from torch.optim import AdamW
from transformers import get_scheduler
from google.colab import drive

drive.mount('/content/drive')

args = TrainingArguments(
    output_dir="/content/drive/MyDrive/NN_Lab2/fine_tuning_7",
    evaluation_strategy="epoch",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=15,
    learning_rate=5e-5,
    weight_decay=0.01,   # prevent overfitting
    logging_steps=10,
    report_to="none"
)
# optimizer
optimizer = AdamW(model.parameters(), lr=args.learning_rate)

# scheduler

num_training_steps = len(train_tokenized) *  args.num_train_epochs

lr_scheduler = get_scheduler(
    name="linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)



# etc
# We put this here so the model is clean and doesnt build on top of the old one
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_tokenized,
    eval_dataset=test_tokenized,
    compute_metrics=compute_metrics
)

trained_model=trainer.train()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6761,0.616839,0.596667
2,0.5051,0.376036,0.846667
3,0.2211,0.404798,0.843333
4,0.1743,0.246007,0.923333
5,0.0276,0.280213,0.916667
6,0.0087,0.305,0.906667
7,0.0045,0.35659,0.913333
8,0.0041,0.41315,0.913333
9,0.0023,0.49413,0.9
10,0.002,0.462493,0.906667


# **Overall Conclusion**

| Experiment | Epochs | Learning Rate | Batch Size | Final Accuracy | Observations |
|------------|--------|---------------|------------|----------------|--------------|
| 1          | 5      | 3e-5          | 16         | 91.33%         | Lower accuracy, short training |
| 2          | 10     | 3e-5          | 16         | ~92.33%        | Steady improvement |
| 3          | 15     | 3e-5          | 16         | ~92.67–93%     | Best performance, plateau after 10 epochs |
| 4          | 15     | 1e-5          | 16         | ~92.33%        | Slow convergence, similar to baseline |
| 5          | 15     | 5e-5          | 16         | ~93%           | Very fast learning, signs of overfitting |
| 6          | 15     | 3e-5          | 8          | ~91.33%        | Smaller batch, more instability |
| 7          | 15     | 3e-5          | 32         | ~92%           | Larger batch, smooth convergence but slightly lower performance |

### Analysis of Hyperparameter Impact on Fine-Tuning

Based on the experiments conducted, we studied the impact of the **learning rate** and **batch size** on the accuracy and overall performance of the `distilbert-base-uncased` model during fine-tuning.

---

#### 1. Learning Rate

- **Low Learning Rate (1e-5):**  
  Led to steady but slow learning. Final accuracy reached 92.33% with mild improvements from epoch to epoch. Convergence was stable but not impressive.

- **Medium Learning Rate (3e-5):**  
  Showed the best balance. The model learned at a fast yet steady pace and reached up to 93% accuracy. Validation loss stabilized with no signs of overfitting.

- **High Learning Rate (5e-5):**  
  Very fast convergence, reaching up to 94% accuracy early in training. However, validation loss increased in later epochs, indicating **overfitting** as training loss continued to decrease.

**Conclusion:** Learning rate directly affects the stability and speed of training. The value **3e-5** appears to be the optimal choice for this task.

---

#### 2. Batch Size

- **Small Batch Size (8):**  
  Training was noisier, with noticeable fluctuations in validation loss and lower final accuracy (~91.33%). Led to instability due to smaller gradient updates.

- **Medium Batch Size (16):**  
  The best overall combination. Stable training, high accuracy (~93%), and a good balance between stability and generalization. It was the most effective batch size.

- **Large Batch Size (32):**  
  Produced stable gradients and smooth training, but with slightly lower accuracy (~92%). Generalization may have been reduced due to excessive smoothing.

**Conclusion:** A **batch size of 16** seems to offer the best performance and stability for fine-tuning, balancing noise and computational efficiency.

---

### Final Conclusion

The hyperparameters **learning rate** and **batch size** significantly influence the quality and speed of fine-tuning. A **learning rate of 3e-5** and a **batch size of 16** proved to be the optimal values in this experiment, leading to stable convergence and high final accuracy without overfitting.

Careful selection of hyperparameters is crucial for successful fine-tuning, especially when working with limited dataset sizes.




# Part B: Using Fine-Tuned Models for New Tasks

In this part of the assignment, you do not need to train language models. Instead, we will leverage the capabilities of transfer learning to tackle more complex language tasks by reducing them to classic tasks such as text classification, natural language inference, question answering, and others.

For example, fine-tuned models for [text classification](https://huggingface.co/tasks/text-classification) can be used for tasks like:

- Are two sentences paraphrases of each other? [Paraphrase / No Paraphrase]  
- Does sentence X entail sentence Y? [Entail / Neutral / Contradict]  
- Is the given sentence grammatically correct? [Acceptable / Unacceptable]


## B1. Piqa Dataset

The [Piqa dataset](https://huggingface.co/datasets/piqa) includes sentences designed to evaluate how well language models possess commonsense knowledge. Specifically, it consists of prompts and possible endings that require commonsense reasoning to be completed correctly.

For example, given the prompt:  
"When boiling butter, when it's ready, you can"  
there are two candidate endings:
- "Pour it onto a plate"
- "Pour it into a jar"

A human can infer that the second ending is more appropriate, since melted butter is a liquid, making a jar a more suitable container than a plate.

To speed things up, select a random subset of 100 samples from the Piqa dataset.


In [None]:
# # insert your code here (load dataset)
from datasets import load_dataset
import random

dataset_dict = load_dataset("piqa")
# Select random subset of 100 (we choose from the validation set)
dataset = dataset_dict["validation"]
subset = dataset.shuffle(seed=42).select(range(100))
print(subset[0])

README.md:   0%|          | 0.00/8.41k [00:00<?, ?B/s]

piqa.py:   0%|          | 0.00/5.36k [00:00<?, ?B/s]

The repository for piqa contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/piqa.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] υ
The repository for piqa contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/piqa.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/815k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16113 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3084 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1838 [00:00<?, ? examples/s]

{'goal': 'How to make Strawberry Kiwi Sauce at home.', 'sol1': 'Boil 1 cup Kiwi (chopped), 1 cup chopped strawberries  with 3/4 cup water and  1 cup Olive pits for 30 min., stirring to keep from scorching over med. heat on the stove top.', 'sol2': 'Boil 1 cup Kiwi (chopped), 1 cup chopped strawberries  with 3/4 cup water and  1 cup sugar for 30 min., stirring to keep from scorching over med. heat on the stove top.', 'label': 1}


We can consider the above scenario as a multiple-choice problem, where there are two possible alternatives for the sentence ending. Therefore, by leveraging appropriate models, we can solve the task of selecting the correct ending given the prompt.

You are asked to record the accuracy of ending predictions for each prompt using language models. For comparison purposes, use at least 5 suitable models.


##**Model 1 - roberta-large-mnli**

In [None]:
# insert your code here (models)
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "roberta-large-mnli"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()   # to let the model know that we are not training, we only do prediction (no dropout layers, BatchNorm layers)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 1024, padding_idx=1)
      (position_embeddings): Embedding(514, 1024, padding_idx=1)
      (token_type_embeddings): Embedding(1, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-23): 24 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSdpaSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=Tru

In [None]:
# insert your code here (models)
import torch.nn.functional as F

def get_entailment_score(premise, hypothesis):
  inputs = tokenizer.encode_plus(premise, hypothesis, return_tensors="pt", truncation=True)

  with torch.no_grad():           # no training
    outputs = model(**inputs)
    logits = outputs.logits
    probs = F.softmax(logits, dim=-1) # convert to probabilities (dim=-1 we do softmax at the last dimension- classes)

  entailment_prob = probs[0][2].item()  # item: convert pytorch tensor to python float

  return entailment_prob
'''
example = subset[0]
goal = example["goal"]
sol1 = example["sol1"]
sol2 = example["sol2"]
label = example["label"]
score1 = get_entailment_score(goal, sol1)
score2 = get_entailment_score(goal, sol2)

print("Entailment score για sol1:", score1)
print("Entailment score για sol2:", score2)
print("True label:", label)
'''

def predict_choice(goal, sol1, sol2):
    score1 = get_entailment_score(goal, sol1)
    score2 = get_entailment_score(goal, sol2)
    return 0 if score1 > score2 else 1  # 0 = sol1, 1 = sol2

correct = 0
total = len(subset)

for example in subset:
    pred = predict_choice(example["goal"], example["sol1"], example["sol2"])
    if pred == example["label"]:
        correct += 1

accuracy = correct / total
print(f"Accuracy: {accuracy:.2%}")



Accuracy: 54.00%


## **Model 2 - facebook/bart-large-mnli**

In [None]:
# insert your code here (models)
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "facebook/bart-large-mnli"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

BartForSequenceClassification(
  (model): BartModel(
    (shared): BartScaledWordEmbedding(50265, 1024, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): BartScaledWordEmbedding(50265, 1024, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
      (layers): ModuleList(
        (0-11): 12 x BartEncoderLayer(
          (self_attn): BartSdpaAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=4096, out_features=1024, bias=True)
   

In [None]:
# insert your code here (models)
import torch.nn.functional as F

def get_entailment_score(premise, hypothesis):
  inputs = tokenizer.encode_plus(premise, hypothesis, return_tensors="pt", truncation=True)

  with torch.no_grad():           # no training
    outputs = model(**inputs)
    logits = outputs.logits
    probs = F.softmax(logits, dim=-1) # convert to probabilities (dim=-1 we do softmax at the last dimension- classes)

  entailment_prob = probs[0][2].item()  # item: convert pytorch tensor to python float

  return entailment_prob
'''
example = subset[0]
goal = example["goal"]
sol1 = example["sol1"]
sol2 = example["sol2"]
label = example["label"]
score1 = get_entailment_score(goal, sol1)
score2 = get_entailment_score(goal, sol2)

print("Entailment score για sol1:", score1)
print("Entailment score για sol2:", score2)
print("True label:", label)
'''
def predict_choice(goal, sol1, sol2):
    score1 = get_entailment_score(goal, sol1)
    score2 = get_entailment_score(goal, sol2)
    return 0 if score1 > score2 else 1  # 0 = sol1, 1 = sol2


correct = 0
total = len(subset)

for example in subset:
    pred = predict_choice(example["goal"], example["sol1"], example["sol2"])
    if pred == example["label"]:
        correct += 1

accuracy = correct / total
print(f"Accuracy: {accuracy:.2%}")


Accuracy: 47.00%


## **Model 3 - google/flan-t5-large**

In [None]:
# insert your code here (models)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "google/flan-t5-large"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model.eval()

T5ForConditionalGeneration(
  (shared): Embedding(32128, 1024)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 1024)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=1024, out_features=1024, bias=False)
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (o): Linear(in_features=1024, out_features=1024, bias=False)
              (relative_attention_bias): Embedding(32, 16)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=1024, out_features=2816, bias=False)
              (wi_1): Linear(in_features=1024, out_features=2816, bias=False)
       

In [None]:
# insert your code here (models)
def predict_with_flan_t5(goal, sol1, sol2):
    prompt = (
        f"Goal: {goal}\n"
        f"Option 1: {sol1}\n"
        f"Option 2: {sol2}\n"
        f"Which option makes more sense? Respond with '1' or '2'."
    )

    inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=5)

    response = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()

    # Convert answer to 0 or 1
    if "1" in response:
        return 0  # sol1
    elif "2" in response:
        return 1  # sol2
    else:
        # fallback (if it returns sth weird)
        return -1
correct = 0
skipped = 0
total = len(subset)

for i, example in enumerate(subset):
    pred = predict_with_flan_t5(example["goal"], example["sol1"], example["sol2"])

    if pred == -1:
        skipped += 1
        continue

    if pred == example["label"]:
        correct += 1

accuracy = correct / (total - skipped)
print(f"Accuracy: {accuracy:.2%} (Skipped: {skipped})")


Accuracy: 80.00% (Skipped: 0)


## **Model 4 - allenai/unifiedqa-t5-large**

In [None]:
# insert your code here (models)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "allenai/unifiedqa-t5-large"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model.eval()

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

T5ForConditionalGeneration(
  (shared): Embedding(32128, 1024)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 1024)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=1024, out_features=1024, bias=False)
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (o): Linear(in_features=1024, out_features=1024, bias=False)
              (relative_attention_bias): Embedding(32, 16)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=1024, out_features=4096, bias=False)
              (wo): Linear(in_features=4096, out_features=1024, bias=False)
              (d

In [None]:
# insert your code here (models)
def predict_with_unifiedqa(goal, sol1, sol2):
    prompt = (
        f"Question: {goal.strip()}\n"
        f"Options:"
        f"(A) {sol1.strip()}\n"  # strip to remove spaces from the start and the end of the sentence
        f"(B) {sol2.strip()}\n"
        f"Choose A or B."
    )

    inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=5) # max_new_tokens=5 because we just want an answer A or B

    response = tokenizer.decode(outputs[0], skip_special_tokens=True).strip().lower()  # convert output(token IDs) back to text
                                                                                       # we ignore tokens like <pad>,<eos>
    if "a" in response or "1" in response:
        return 0
    elif "b" in response or "2" in response:
        return 1
    else:
        return -1  # fallback in case of weird answer

correct = 0
skipped = 0
total = len(subset)

for example in subset:
    pred = predict_with_unifiedqa(example["goal"], example["sol1"], example["sol2"])

    if pred == -1:
        skipped += 1
        continue

    if pred == example["label"]:
        correct += 1

accuracy = correct / (total - skipped)
print(f"Accuracy: {accuracy:.2%} (Skipped: {skipped})")



Accuracy: 43.48% (Skipped: 8)


## **Model 5 - cross-encoder/nli-deberta-v3-large**

In [None]:
# insert your code here (models)
model_name = "cross-encoder/nli-deberta-v3-large"
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()


tokenizer_config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.65M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.74G [00:00<?, ?B/s]

DebertaV2ForSequenceClassification(
  (deberta): DebertaV2Model(
    (embeddings): DebertaV2Embeddings(
      (word_embeddings): Embedding(128100, 1024, padding_idx=0)
      (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): DebertaV2Encoder(
      (layer): ModuleList(
        (0-23): 24 x DebertaV2Layer(
          (attention): DebertaV2Attention(
            (self): DisentangledSelfAttention(
              (query_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (key_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (value_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (pos_dropout): Dropout(p=0.1, inplace=False)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): DebertaV2SelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNo

In [None]:
# insert your code here (models)
import torch.nn.functional as F

def get_entailment_score_deberta(goal, hypothesis):
    inputs = tokenizer.encode_plus(goal, hypothesis, return_tensors="pt", truncation=True)

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probs = F.softmax(logits, dim=-1)

    entailment_prob = probs[0][2].item()  # 0=contradiction, 1=neutral, 2=entailment
    return entailment_prob

def predict_choice_deberta(goal, sol1, sol2):
    score1 = get_entailment_score_deberta(goal, sol1)
    score2 = get_entailment_score_deberta(goal, sol2)
    return 0 if score1 > score2 else 1

correct = 0
total = len(subset)

for example in subset:
    pred = predict_choice_deberta(example["goal"], example["sol1"], example["sol2"])
    if pred == example["label"]:
        correct += 1

accuracy = correct / total
print(f"DeBERTa NLI Accuracy: {accuracy:.2%}")



DeBERTa NLI Accuracy: 67.00%


## Model Comparison for Commonsense Reasoning on the PIQA Dataset

A subset of 100 samples from the **PIQA dataset** was used, which evaluates the ability of language models to resolve simple everyday situations based on commonsense knowledge.

For each prompt, the task was to choose the most appropriate ending from two alternatives. The task was implemented as a **zero-shot multiple choice** problem, and 5 pretrained models were evaluated.

### Results Table

| Model                              | Model Type                 | Accuracy | Skipped |
|-----------------------------------|----------------------------|----------|---------|
| `roberta-large-mnli`              | NLI zero-shot              | 54.00%   | 0       |
| `google/flan-t5-large`            | Zero-shot reasoning        | **80.00%** | 0     |
| `facebook/bart-large-mnli`        | NLI zero-shot              | 47.00%   | 0       |
| `allenai/unifiedqa-t5-large`      | QA commonsense (seq2seq)   | 43.48%   | 8       |
| `cross-encoder/nli-deberta-v3-large` | NLI fine-tuned          | 67.00%   | 0       |

---

### Analysis & Conclusions

1. Traditional NLI models like **RoBERTa-large-MNLI** and **BART-large-MNLI** performed moderately, with results close to random (around 50%).

2. **UnifiedQA-T5-Large**, although designed for commonsense QA datasets, struggled to provide clear answers in many cases, showing low accuracy and skipped predictions.

3. **DeBERTa-v3-Large (NLI fine-tuned)** demonstrated that modern, well-trained NLI models are capable of solving commonsense reasoning tasks with high accuracy (67%).

4. **Flan-T5-Large** was by far the most effective model, achieving **80% accuracy with no failures or skipped cases**. This is attributed to:
   - Its training on multiple instruction-based datasets
   - Its ability to generalize and reason without additional fine-tuning

### Conclusion:

Using **instruction-tuned language models like Flan-T5-Large** appears to be the most effective approach for commonsense reasoning in zero-shot settings, outperforming both traditional NLI models and QA-specific ones.


## B2. Truthful QA

### Sentence Transformers

**Sentence transformers** are used to generate **sentence embeddings**, i.e., vector representations of sentences in a vector space. Thanks to the way they have been pre-trained, they are capable of placing semantically similar sentences close to one another, while distancing semantically unrelated sentences. Therefore, using the representations provided by sentence embeddings, we can evaluate how semantically close or distant two sentences are.

The comparison of these vector representations is typically performed using methods like cosine similarity, where higher values between vectors indicate greater similarity, and thus more semantically similar sentences. For this purpose, we provide a function for computing cosine similarity.


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
def get_cosine_similarity(feature_vec_1, feature_vec_2):
    return cosine_similarity(feature_vec_1.reshape(1, -1), feature_vec_2.reshape(1, -1))[0][0]

For example, run the following cell, which returns a similarity score in the range [0, 1] for two sentences ("This is an example sentence", "Each sentence is converted").  
You can also try running the same cell with different sentence pairs of your choice—whether similar or very different—and observe how the cosine similarity values change.


In [None]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
embeddings = model.encode(sentences)

get_cosine_similarity(embeddings[0], embeddings[1])

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

np.float32(0.4048847)

In [None]:
from sentence_transformers import SentenceTransformer
sentences = ["It's a sunny day", "There is sun outside"]

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
embeddings = model.encode(sentences)

get_cosine_similarity(embeddings[0], embeddings[1])

np.float32(0.87338126)

For the next part of the exercise, you are required to select at least 6 different [models for semantic similarity](https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=downloads) from the sentence transformers.


### Can Question Answering Models Distinguish Between True and False Statements?

This is the question we will address in this part of the exercise. For this purpose, we load the [Truthful QA generation](https://huggingface.co/datasets/truthful_qa/viewer/generation/validation) dataset, which includes the following options:

- best answer  
- correct answer  
- incorrect answer  

Often, the best answer and the correct answer are either the same or very close in meaning. This is where we will use semantic similarity to evaluate how closely they match.

We filter the dataset to contain a total of 100 samples for faster processing. Each sample should contain at least 2 correct answers. That gives us 4 candidate options:

1st option: best answer  
2nd option: 1st correct answer  
3rd option: 2nd correct answer  
4th option: incorrect answer  

These options, along with the question, are passed to a multiple-choice model—similar to those used in question B1. You can use the same models and extend them to handle 4 candidate answers.

Semantic similarity will influence what we consider an optimally correct answer, and therefore affect accuracy. Specifically, we will generate vector representations (embeddings) for the best answer and the two correct answers using a semantic similarity model. If the multiple-choice model predicts one of the correct answers, and the cosine similarity between that answer and the best answer is above a predefined similarity threshold, the answer is considered optimally correct. We set the similarity threshold at **0.95**.

For example, suppose the multiple-choice model chooses the second option (i.e., the 1st correct answer), and the cosine similarity between its embedding and that of the best answer is greater than 0.95. In that case, we consider the prediction optimally correct, and count it positively toward the accuracy.

You are required to write a function that computes the accuracy of finding optimally correct answers among the candidate options. You should evaluate at least 6 semantic similarity models, as well as the multiple-choice models you used in question B1.


In [None]:
# insert your code here (load dataset)
from datasets import load_dataset
import random

dataset_dict = load_dataset("truthful_qa", "generation")
# Select random subset of 100 (we choose from the validation set)
dataset = dataset_dict["validation"]
# We choose those that have at lest 2 correct and 1 incorrect
filtered = [
    ex for ex in dataset
    if len(ex["correct_answers"]) >= 2 and len(ex["incorrect_answers"]) >= 1
]

subset = random.sample(filtered, 100)

print(subset[0]["question"])
print("Best:", subset[0]["best_answer"])
print("Corrects:", subset[0]["correct_answers"][:2])
print("Incorrect:", subset[0]["incorrect_answers"][0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/9.59k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/223k [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/817 [00:00<?, ? examples/s]

How has the rate of smoking changed over time among Americans?
Best: The rate of smoking has declined over time among Americans
Corrects: ['The rate of smoking has declined over time among Americans', 'Smoking has gone down over time']
Incorrect: The rate of smoking has increased over time among Americans


In [None]:
# insert your code here (load models for semantic similarity and QA)
from sentence_transformers import SentenceTransformer
# for QA
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM
import torch

similarity_models = {
    "all-mpnet-base-v2": SentenceTransformer("sentence-transformers/all-mpnet-base-v2"),
    "all-MiniLM-L6-v2": SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2"),
    "paraphrase-MiniLM-L3-v2": SentenceTransformer("sentence-transformers/paraphrase-MiniLM-L3-v2"),
    "multi-qa-MiniLM-L6-cos-v1": SentenceTransformer("sentence-transformers/multi-qa-MiniLM-L6-cos-v1"),
    "paraphrase-multilingual-MiniLM-L12-v2": SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"),
    "all-distilroberta-v1": SentenceTransformer("sentence-transformers/all-distilroberta-v1"),
}

qa_models = {}

# RoBERTa-large-MNLI
qa_models["roberta"] = {
    "name": "roberta-large-mnli",
    "tokenizer": AutoTokenizer.from_pretrained("roberta-large-mnli"),
    "model": AutoModelForSequenceClassification.from_pretrained("roberta-large-mnli").eval()
}

# BART-large-MNLI
qa_models["bart"] = {
    "name": "facebook/bart-large-mnli",
    "tokenizer": AutoTokenizer.from_pretrained("facebook/bart-large-mnli"),
    "model": AutoModelForSequenceClassification.from_pretrained("facebook/bart-large-mnli").eval()
}

# UnifiedQA-T5-Large
qa_models["unifiedqa"] = {
    "name": "allenai/unifiedqa-t5-large",
    "tokenizer": AutoTokenizer.from_pretrained("allenai/unifiedqa-t5-large"),
    "model": AutoModelForSeq2SeqLM.from_pretrained("allenai/unifiedqa-t5-large").eval()
}

# Flan-T5-Large
qa_models["flan"] = {
    "name": "google/flan-t5-large",
    "tokenizer": AutoTokenizer.from_pretrained("google/flan-t5-large"),
    "model": AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large").eval()
}

# DeBERTa NLI
qa_models["deberta"] = {
    "name": "cross-encoder/nli-deberta-v3-large",
    "tokenizer": AutoTokenizer.from_pretrained("cross-encoder/nli-deberta-v3-large"),
    "model": AutoModelForSequenceClassification.from_pretrained("cross-encoder/nli-deberta-v3-large").eval()
}



Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


pytorch_model.bin:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.65M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.74G [00:00<?, ?B/s]

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import torch
import torch.nn.functional as F
import numpy as np

# Cosine similarity function
def get_cosine_similarity(vec1, vec2):
    return cosine_similarity(vec1.reshape(1, -1), vec2.reshape(1, -1))[0][0]

# QA prediction functions per category
# for RoBERTa, BART, DeBERTa (NLI models)
def predict_answer_nli(model, tokenizer, question, options):
    scores = []
    for opt in options:
        inputs = tokenizer.encode_plus(question, opt, return_tensors="pt", truncation=True)
        with torch.no_grad():
            logits = model(**inputs).logits
            probs = F.softmax(logits, dim=-1)
            scores.append(probs[0][2].item())  # entailment
    return int(torch.argmax(torch.tensor(scores)))
# for Flan & UnifiedQA (seq2seq models)
def predict_answer_seq2seq(model, tokenizer, question, options):
    prompt = f"Question: {question.strip()}\n"
    for i, opt in enumerate(options, 1):
        prompt += f"Option {i}: {opt.strip()}\n"
    prompt += "Which option is best? Respond with the option number."

    inputs = tokenizer(prompt, return_tensors="pt", truncation=True).to(model.device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=5)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()

    for i in range(1, 5):
        if str(i) in response:
            return i - 1
    return -1

# Matching prediction function
qa_predict_functions = {
    "roberta": predict_answer_nli,
    "bart": predict_answer_nli,
    "deberta": predict_answer_nli,
    "flan": predict_answer_seq2seq,
    "unifiedqa": predict_answer_seq2seq,
}

# Final evaluation function semantic similarity accuracy
def evaluate_semantic_accuracy(subset, qa_entry, predict_fn, similarity_model, threshold=0.95):
    tokenizer = qa_entry["tokenizer"]
    model = qa_entry["model"]

    correct = 0
    total = 0

    for ex in subset:
        question = ex["question"]
        options = [
            ex["best_answer"],
            ex["correct_answers"][0],
            ex["correct_answers"][1],
            ex["incorrect_answers"][0],
        ]

        pred_idx = predict_fn(model, tokenizer, question, options)
        if pred_idx == -1:
            continue

        predicted = options[pred_idx]
        if pred_idx in [1, 2]:  # correct positions
            vec_pred = similarity_model.encode(predicted)
            vec_best = similarity_model.encode(ex["best_answer"])
            sim = get_cosine_similarity(vec_pred, vec_best)

            if sim >= threshold:
                correct += 1
        total += 1

    accuracy = correct / total if total > 0 else 0
    return round(accuracy * 100, 2)


In [None]:
results = []

for qa_name, qa_entry in qa_models.items():
    predict_fn = qa_predict_functions[qa_name]
    print(f"\n🔍 QA Model: {qa_name.upper()}")

    for sim_name, sim_model in similarity_models.items():
        acc = evaluate_semantic_accuracy(
            subset=subset,
            qa_entry=qa_entry,
            predict_fn=predict_fn,
            similarity_model=sim_model,
            threshold=0.7
        )
        results.append((qa_name, sim_name, acc))
        print(f" - {sim_name}: {acc}%")



🔍 QA Model: ROBERTA
 - all-mpnet-base-v2: 19.0%
 - all-MiniLM-L6-v2: 16.0%
 - paraphrase-MiniLM-L3-v2: 15.0%
 - multi-qa-MiniLM-L6-cos-v1: 19.0%
 - paraphrase-multilingual-MiniLM-L12-v2: 13.0%
 - all-distilroberta-v1: 14.0%

🔍 QA Model: BART
 - all-mpnet-base-v2: 24.0%
 - all-MiniLM-L6-v2: 22.0%
 - paraphrase-MiniLM-L3-v2: 20.0%
 - multi-qa-MiniLM-L6-cos-v1: 24.0%
 - paraphrase-multilingual-MiniLM-L12-v2: 20.0%
 - all-distilroberta-v1: 22.0%

🔍 QA Model: UNIFIEDQA
 - all-mpnet-base-v2: 25.0%
 - all-MiniLM-L6-v2: 0.0%
 - paraphrase-MiniLM-L3-v2: 0.0%
 - multi-qa-MiniLM-L6-cos-v1: 0.0%
 - paraphrase-multilingual-MiniLM-L12-v2: 25.0%
 - all-distilroberta-v1: 25.0%

🔍 QA Model: FLAN
 - all-mpnet-base-v2: 31.11%
 - all-MiniLM-L6-v2: 31.11%
 - paraphrase-MiniLM-L3-v2: 30.0%
 - multi-qa-MiniLM-L6-cos-v1: 32.22%
 - paraphrase-multilingual-MiniLM-L12-v2: 30.0%
 - all-distilroberta-v1: 30.0%

🔍 QA Model: DEBERTA
 - all-mpnet-base-v2: 17.0%
 - all-MiniLM-L6-v2: 18.0%
 - paraphrase-MiniLM-L3-v2: 

## Comparison of QA Models and Semantic Similarity Models (TruthfulQA)

For the 100-sample subset of the **TruthfulQA (generation)** dataset, the performance of 5 different QA models (from question B1) was evaluated in terms of their ability to identify **optimally correct answers** among four choices, based on semantic similarity (cosine similarity > 0.70) with the "best answer".

Six sentence-transformer models were used to assess similarity.

### Results Table (Accuracy %)

| QA Model       | all-mpnet-base-v2 | all-MiniLM-L6-v2 | paraphrase-MiniLM-L3-v2 | multi-qa-MiniLM-L6-cos-v1 | paraphrase-multilingual-MiniLM-L12-v2 | all-distilroberta-v1 |
|----------------|-------------------|------------------|---------------------------|-----------------------------|-----------------------------------------|------------------------|
| **RoBERTa**     | 19.0%             | 16.0%            | 15.0%                     | 19.0%                       | 13.0%                                   | 14.0%                  |
| **BART**        | 24.0%             | 22.0%            | 20.0%                     | 24.0%                       | 20.0%                                   | 22.0%                  |
| **UnifiedQA**   | 25.0%             | 0.0%             | 0.0%                      | 0.0%                        | 25.0%                                   | 25.0%                  |
| **Flan-T5**     | **31.11%**        | **31.11%**       | 30.0%                     | **32.22%**                  | 30.0%                                   | 30.0%                  |
| **DeBERTa**     | 17.0%             | 18.0%            | 17.0%                     | 18.0%                       | 17.0%                                   | 16.0%                  |

---

### Observations

- **Flan-T5-Large** was consistently the best QA model across all semantic similarity variants, achieving a maximum accuracy of 32.22%.
- **UnifiedQA** showed unstable performance: very good with certain embeddings (e.g., multilingual) and zero with others.
- **RoBERTa** and **DeBERTa**, though strong NLI models, struggled with adapting to multiple-choice tasks without explicit inference mechanisms.
- The results show that **instruction-tuned models like Flan** perform **much better at zero-shot commonsense reasoning**.

---

### Conclusion

Combining QA models with semantic similarity checks is an effective strategy for assessing "optimally correct" answers. **Flan-T5-Large combined with sentence embeddings like multi-qa-MiniLM** provides the most reliable performance in this setup.


## B3. Winogrande Dataset

The [Winogrande dataset](https://huggingface.co/datasets/winogrande) consists of sentences with one missing word, and two possible options are given to fill in the blank. For example, given the sentence:  
"John moved the couch from the garage to the backyard to create space. The _ is small."  
there are two possible choices:

- "garage"  
- "backyard"  

The challenge lies in the fact that both words are mentioned in the sentence, so the model needs strong language understanding capabilities to select the semantically correct completion.

To speed things up, select a random subset of 100 samples from the training set of Winogrande.


In [None]:
# My code
# checking the versions of the dataset
from datasets import get_dataset_config_names
get_dataset_config_names("winogrande")


['winogrande_xs',
 'winogrande_s',
 'winogrande_m',
 'winogrande_l',
 'winogrande_xl',
 'winogrande_debiased']

In [None]:
# insert your code here (load dataset)
from datasets import load_dataset
import random
# load Winogrande XL
dataset_dict = load_dataset("winogrande", "winogrande_xl")
# Select random subset of 100 (we choose from the training set)
dataset = dataset_dict["train"]
subset = dataset.shuffle(seed=42).select(range(100))

# Sample
example = subset[0]
print("Sentence:", example["sentence"])
print("Option 1:", example["option1"])
print("Option 2:", example["option2"])
print("Answer:", example["answer"])

Downloading data:   0%|          | 0.00/3.40M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/40398 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1767 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1267 [00:00<?, ? examples/s]

Sentence: The phone of Donald is a lot better than Adam's because _ paid extra for his phone.
Option 1: Donald
Option 2: Adam
Answer: 1


With an appropriate [transformation](https://huggingface.co/DeepPavlov/roberta-large-winogrande) of the above input (a sentence with a blank and two filling options), you are asked to record the accuracy of relevant models that solve the problem by comparing the predicted label with the true label (1: first option, 2: second option). Essentially, you will need to frame the above problem as a more classic natural language processing task.

Try at least 3 suitable models from Hugging Face to approach the Winogrande problem. We recommend using pipelines for your convenience.


In [None]:
# insert your code here (load models)
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

# DeepPavlov RoBERTa trained for Winogrande
model_name_1 = "DeepPavlov/roberta-large-winogrande"
tokenizer_1 = AutoTokenizer.from_pretrained(model_name_1)
model_1 = AutoModelForSequenceClassification.from_pretrained(model_name_1)

# facebook/bart-large-mnli (zero-shot model)
model_name_2 = "facebook/bart-large-mnli"
tokenizer_2 = AutoTokenizer.from_pretrained(model_name_2)
model_2 = AutoModelForSequenceClassification.from_pretrained(model_name_2)

# roberta-large-mnli (NLI baseline)
model_name_3 = "roberta-large-mnli"
tokenizer_3 = AutoTokenizer.from_pretrained(model_name_3)
model_3 = AutoModelForSequenceClassification.from_pretrained(model_name_3)


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/820 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/688 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# insert your code here (create pipelines)
# Load pipelines for Text Classification
from transformers import pipeline

pipe_winogrande = pipeline("text-classification", model=model_1, tokenizer=tokenizer_1)
pipe_bart = pipeline("text-classification", model=model_2, tokenizer=tokenizer_2)
pipe_roberta = pipeline("text-classification", model=model_3, tokenizer=tokenizer_3)

Device set to use cpu
Device set to use cpu
Device set to use cpu


In [None]:
# insert your code here (function for predicting best fill)

# Prediciton function using pipeline
def predict_winogrande_answer(pipe, sentence, option1, option2):
    # Replace "_" with each choice
    sent1 = sentence.replace("_", option1)
    sent2 = sentence.replace("_", option2)

    # Ask for predictions
    pred1 = pipe(sent1)[0]["score"]
    pred2 = pipe(sent2)[0]["score"]

    return "1" if pred1 > pred2 else "2"

# Evaluating the prediction
def evaluate_winogrande(pipe, subset):
    correct = 0
    total = 0

    for ex in subset:
        pred = predict_winogrande_answer(pipe, ex["sentence"], ex["option1"], ex["option2"])
        if pred == ex["answer"]:
            correct += 1
        total += 1

    accuracy = correct / total if total > 0 else 0
    return round(accuracy * 100, 2)

print("DeepPavlov:", evaluate_winogrande(pipe_winogrande, subset), "%")
print("BART-MNLI:", evaluate_winogrande(pipe_bart, subset), "%")
print("RoBERTa-MNLI:", evaluate_winogrande(pipe_roberta, subset), "%")

DeepPavlov: 55.0 %
BART-MNLI: 77.0 %
RoBERTa-MNLI: 62.0 %


## Comments

The **Winogrande dataset** consists of sentences with a blank and two possible words to fill it in. The difficulty lies in the fact that both options are syntactically valid, but only one is semantically correct — requiring commonsense reasoning and comprehension abilities.

For each sentence, prediction was approached by transforming it into a **choice between two complete sentences**, where the blank is replaced with each option, and the model evaluates the “plausibility” of the resulting sentence.

### The following models were tested:

- `DeepPavlov/roberta-large-winogrande`: specifically fine-tuned for Winogrande  
- `facebook/bart-large-mnli`: zero-shot NLI model with strong inference performance  
- `roberta-large-mnli`: baseline zero-shot NLI model  

### Results

| Model                            | Accuracy |
|----------------------------------|----------|
| `DeepPavlov/roberta-winogrande` | 55.0%    |
| `facebook/bart-large-mnli`      | **77.0%**|
| `roberta-large-mnli`            | 62.0%    |

### Conclusions

- `facebook/bart-large-mnli` stood out with **77% accuracy**, showing that its architecture and training on zero-shot NLI tasks give it excellent generalization on commonsense selection tasks.
- The specialized `DeepPavlov` model performed moderately (55%), possibly due to overfitting or limited generalization.
- `roberta-large-mnli` landed in between (62%), outperforming the specialized model but falling short of BART.

This analysis shows that even in tasks with dedicated fine-tuned models, **general-purpose zero-shot NLI models can sometimes perform better**, especially when well-designed and fine-tuned on broad inference data.
