<a href="https://colab.research.google.com/github/benzionchen/transformer_NLP_research/blob/main/transformer_research_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The goal is for you to practice:

- Picking a dataset (and possibly trying more than one).
- Selecting different pretrained models from the Hugging Face Hub.
- Measuring performance (accuracy, F1, etc.) and comparing your results.
- Doing a brief error analysis to see where each model struggles.

Already installed:

!pip install datasets
!pip install transformers
!pip install evaluate
!pip install accelerate
!pip install -U bitsandbytes
!pip install torch
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

In [None]:
import torch

In [None]:
# test
print(torch.cuda.get_device_name(0))
print(torch.cuda.is_available())

## 1. Choose a Dataset
You can pick: (or try some other ones you find interesting):
- IMDB Movie Reviews (sentiment labels: positive/negative).
- Yelp Reviews (sentiment labels: star ratings or binary positive/negative).

For now, complete the rest of the steps (2-4) below with the above two datasets. Come back to do the following task after you're done with the above (time permitting):

We are going to use the Amazon Product Reviews (various categories, can be collapsed into positive/negative) dataset. You are free to decide how to collapse multiple categories into one. You can also compare different approaches of this as well.

Feel free to use the datasets library (e.g., load_dataset("imdb")).


In [1]:
from datasets import load_dataset

imdb = load_dataset("imdb")
yelp = load_dataset("yelp_polarity") # the name of the dataset is not 'yelp', it's 'yelp_polarity'

imdb_texts = imdb["test"]["text"]
imdb_labels = imdb["test"]["label"]

yelp_texts = yelp["test"]["text"]
yelp_labels = yelp["test"]["label"]

print(imdb)
print(imdb_texts[0])
print(imdb_labels[0])
print(yelp)
print(yelp_texts[0])
print(yelp_labels[0])

# /usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
# The secret `HF_TOKEN` does not exist in your Colab secrets.
# To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
# You will be able to reuse this secret in all of your notebooks.

# what does this mean? API key?
# the dataset is quite big so we should limit it to maybe 1000

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})
I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, 

In [2]:
imdb_texts = imdb["test"]["text"][:1000]
imdb_labels = imdb["test"]["label"][:1000]

yelp_texts = yelp["test"]["text"][:1000]
yelp_labels = yelp["test"]["label"][:1000]

# choose sample 1000 because original datasets are too big 
# 1000 is a good number, not too small and doesnt run into long runtime + prevent GPU memory overflow

## 2. Select Two (or More) Pretrained Models

Pick at least two from the Hugging Face Hub and compare them:

DistilBERT (e.g., distilbert-base-uncased-finetuned-sst-2-english)

BERT (e.g., bert-base-uncased-finetuned-sst-2-english)

RoBERTa (e.g., cardiffnlp/twitter-roberta-base-sentiment-latest or roberta-base-openai-detector)

Feel free to explore the Hugging Face Model Hub if you find something else interesting!

In [3]:
import transformers
from transformers import pipeline

In [4]:
# load models
model1 = pipeline(
    'sentiment-analysis', 
    model = 'distilbert-base-uncased-finetuned-sst-2-english', 
    truncation = True # added truncation because runtime error
    )

model2 = pipeline(
    'sentiment-analysis', 
    model = 'cardiffnlp/twitter-roberta-base-sentiment-latest',
    truncation = True, # added truncation because runtime error
    padding = True, # added padding due to error with different lengths 
    max_length = 512 # ensure sequences are no longer than 512 tokens w/ Deepseek (but no longer using DeepSeek since using RoBERTa)
    )

# migrated from colab -> jupyter notebook
# bert-base-uncased-finetuned-sst-2-english is not available on HF? neither is 
# "bert-base-uncased-finetuned-sst-2-english"

# in case of truncation, if =false, then you pick max length of dataset (which is # of cols in matrix)
# if padding = false, (padding adds 0 or padding token), when passing batch, dependign on how model is 
# trained, can mess up model and shoudl really only use this when we need to 

Device set to use cuda:0
Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


#### method 2 (how chat GPT suggested to load the model)
model1 = pipeline(
    'sentiment-analysis', 
    model='distilbert-base-uncased-finetuned-sst-2-english',
    tokenizer='distilbert-base-uncased-finetuned-sst-2-english', # what is the suggested tokenizer for? 
    device=0 # what is device=0 for? 
)

model2 = pipeline(
    'sentiment-analysis',
    model='cardiffnlp/twitter-roberta-base-sentiment-latest',
    tokenizer='cardiffnlp/twitter-roberta-base-sentiment-latest',
    device=0
)


In [5]:
# testing it's loaded
print(model1)
print(model2)
print(model1("i love dogs"))
print(model1("i hate you"))
print(model2("i love dogs"))
print(model2("i hate you"))

<transformers.pipelines.text_classification.TextClassificationPipeline object at 0x0000029404E6E040>
<transformers.pipelines.text_classification.TextClassificationPipeline object at 0x0000029404FC3340>
[{'label': 'POSITIVE', 'score': 0.999713122844696}]
[{'label': 'NEGATIVE', 'score': 0.9991129040718079}]
[{'label': 'positive', 'score': 0.9173767566680908}]
[{'label': 'negative', 'score': 0.7887632846832275}]


In [6]:
from tqdm import tqdm
# import a progress bar

# use BERT to predict sentiment analysis for both imdb and yelp datasets
imdb_prediction1 = [1 if r["label"] == "POSITIVE"
    else 0 for r in tqdm(model1(imdb_texts, batch_size=32, truncation=True, padding=True))]

yelp_prediction1 = [1 if r["label"] == "POSITIVE"
    else 0 for r in tqdm(model1(yelp_texts, batch_size=32, truncation=True, padding=True))]

# RuntimeError: The size of tensor a (532) must match the size of tensor b (512) at non-singleton dimension 1, need to add truncation for model1 (BERT)
# this takes forever to tokenize if doing linearly 1 by 1, maybe can batch by 32, and 16 if run into memory problems

imdb_prediction2 = [1 if r["label"] == "positive"
    else 0 for r in tqdm(model2(imdb_texts, batch_size=32, truncation=True, padding=True))]

yelp_prediction2 = [1 if r["label"] == "positive"
    else 0 for r in tqdm(model2(yelp_texts, batch_size=32, truncation=True, padding=True))]

# batch size = 32 is 32 sentences/predictions at once, nothing to with model, this is parallelism with 
# 1 batch = matrix with dimensions of (batch_size, sentence_length) - (row, cols)
# size = max # of tokens sent (if have memory in GPU, can do all these predictions at once) 
# batch norm is different- training thing - normalize the batch (batch needs to be balanced in training
# outlier so you can "center")

100%|█████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 2002054.42it/s]
100%|█████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 1999191.61it/s]
100%|█████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 2004925.43it/s]
100%|█████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 2006844.02it/s]


In [7]:
print(imdb_prediction1)
print(yelp_prediction1)
print('\n')
print(len(imdb_prediction1))

# 1000 negative and positive reviews

[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 

In [8]:
# check if yelp_prediction1 = yelp_prediction2, if not, compare the differences 
print(imdb_prediction2)
print(yelp_prediction2)
print('\n')
print(len(imdb_prediction2))

[0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 

In [9]:
def count_mismatches(array1, array2):
    if len(array1) != len(array2):
        raise ValueError("arrays must have the same length")

    # count mismatches
    mismatch_count = sum(1 for a, b in zip(array1, array2) if a != b)

    return mismatch_count

In [10]:
print(count_mismatches(imdb_prediction1, imdb_prediction2))
print(count_mismatches(imdb_prediction1, imdb_prediction2)/1000)
# 162 mismatches
# 16.2 percent of mismatches

162
0.162


## 3. Measure Performance
Implement an evaluation method on a test or validation split. At minimum:
- Accuracy: The fraction of examples predicted correctly.

- F1 Score: Combination of precision and recall. (explanation of this is given below, after the instructions)
You can use the Hugging Face evaluate or datasets library or write your own small function for computing these metrics.

### F1 Score Explanation

Imagine you’re trying to detect “positive” cases—for example, emails that are spam. Your model’s predictions might fall into these categories:
- True Positive (TP): Predicted spam, actually spam
- False Positive (FP): Predicted spam, but it’s not spam
- True Negative (TN): Predicted not spam, actually not spam
- False Negative (FN): Predicted not spam, but it was spam

Two important measures come from this:
- Precision: Out of the emails you labeled spam, how many were actually spam? Precision=TP/(TP + FP)​
- Recall: Out of the emails that were actually spam, how many did you catch? Recall=TP​/(TP + FN)

But often, focusing on just Precision or just Recall is not enough. The F1 score combines both in a single number. It’s defined as the harmonic mean of Precision and Recall:

F1=2× ((Precision×Recall​)/(Precision+Recall))

This way, if either Precision or Recall is low, the F1 score will also be relatively low.

Example with a Small Confusion Matrix
Suppose your model had these results:

- TP = 4
- FP = 2
- FN = 1
- TN = 3

Then:

- Precision=4/(4+2)​=0.66 (about 66%)
- Recall=4/(4+1)​=0.80 (80%)

So,
F1=2×(0.66×0.80)/(0.66+0.80)​≈0.72

Quick Python Example

Below is a short snippet using sklearn (you dont have to usethis, hugging face also has a f1 function) to calculate the F1 score from some example predictions:


```
from sklearn.metrics import f1_score

True labels and model predictions
y_true = [1, 1, 0, 1, 0, 0]
y_pred = [1, 0, 0, 1, 0, 1]

Calculate F1 score
score = f1_score(y_true, y_pred)
print("F1 Score:", score)
```

If you run this, you’ll see a single value that summarizes how good your predictions are at correctly identifying positives (with both “how often you’re correct” in positives, and “how many positives you caught” taken into account).

Key Takeaways

- F1 score balances Precision and Recall into one number.
- If you need a single metric to judge performance in situations where both false positives and false negatives matter, F1 is often a good choice.
- In Python, `sklearn.metrics.f1_score` makes it easy to compute.

In [11]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score

def calculate_performance(predictions, labels):
    accuracy = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions)  
    return accuracy, f1

distilbert_imdb_accuracy, distilbert_imdb_f1 = calculate_performance(imdb_prediction1, imdb_labels)
distilbert_yelp_accuracy, distilbert_yelp_f1 = calculate_performance(yelp_prediction1, yelp_labels)
roberta_imdb_accuracy, roberta_imdb_f1 = calculate_performance(imdb_prediction2, imdb_labels)
roberta_yelp_accuracy, roberta_yelp_f1 = calculate_performance(yelp_prediction2, yelp_labels)

print(f"DistilBERT IMDB Accuracy: {distilbert_imdb_accuracy}, F1: {distilbert_imdb_f1}")
print(f"DistilBERT Yelp Accuracy: {distilbert_yelp_accuracy}, F1: {distilbert_yelp_f1}")
print(f"RoBERTa IMDB Accuracy: {roberta_imdb_accuracy}, F1: {roberta_imdb_f1}")
print(f"RoBERTa Yelp Accuracy: {roberta_yelp_accuracy}, F1: {roberta_yelp_f1}")

# F1 is producing 0? why? 

DistilBERT IMDB Accuracy: 0.908, F1: 0.0
DistilBERT Yelp Accuracy: 0.891, F1: 0.883919062832801
RoBERTa IMDB Accuracy: 0.888, F1: 0.0
RoBERTa Yelp Accuracy: 0.858, F1: 0.8582834331337326


DistilBERT IMDB: 
- model is performing well on the negative class (precision: 1.00, recall: 0.91, F1: 0.95), the recall and f1-score are zero, which means DistilBERT is failing to predict any positive instances

RoBERTa IMDB: 
- performs very well on the negative class (precision: 1.00, recall: 0.89, F1: 0.94), but fails to predict any positive instances (F1: 0.0)


Need to fix this problem with IMDB dataset positive instances not getting detected


DistilBERT Yelp:  
- model is doing better on Yelp with both positive and negative classes showing reasonable performance (precision, recall, and F1 around 0.88 to 0.91), and  identifying positive and negative reviews 

RoBERTa Yelp: 
- lower performance on Yelp (accuracy: 0.86, F1: 0.86), but it's still within range, not as strong as DistilBERT on this dataset but performs decently

In [12]:
# raw predictions from DistilBERT on the first 10 IMDB reviews
distilbert_raw_output = model1(imdb_texts[:10])

# raw output to inspect the probabilities
print("DistilBERT Raw Output (first 10 examples):")
for i, output in enumerate(distilbert_raw_output):
    print(f"Text: {imdb_texts[i]}")
    print('\n')
    print(f"Prediction: {output['label']}, Confidence: {output['score']}")
    print('\n')


print('\n')
print ('----- line break -----')
print('\n')

# raw predictions from RoBERTa on the first 10 IMDB reviews
roberta_raw_output = model2(imdb_texts[:10])

# raw output to inspect the probabilities
print("\nRoBERTa Raw Output (first 10 examples):")
for i, output in enumerate(roberta_raw_output):
    print(f"Text: {imdb_texts[i]}")
    print('\n')
    print(f"Prediction: {output['label']}, Confidence: {output['score']}")
    print('\n')
    
# the point of this is to check that these models are: 
# 1) actually outputting something
# 2) if the output is upper/lower case
# 3) how much confidence there is (potentially set a confidence threshold in the future thats acceptable)

# we see that these raw outputs clearly exist so there must be some problem with function logic or 
# something potentially wrong with the model being trained for identifying positive instances?

DistilBERT Raw Output (first 10 examples):
Text: I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It's really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers

In [None]:
# for i, (x, y) in enumerate(zip([1,2,3], ['a','b','c'])):
    # print(f"Index {i}: {x} -> {y}")

# Index 0: 1 -> a
# Index 1: 2 -> b
# Index 2: 3 -> c

# demonstration of enumerate()

In [13]:
# maybe the model is not seeing "POSITIVE" and "positive" so hard code it to make sure it's .lower() and .upper()
# run prediction

imdb_prediction1 = [
    1 if r["label"].lower() == "positive" 
    else 0 if r["label"].lower() == "negative" 
    else None for r in tqdm(model1(imdb_texts, batch_size=32, truncation=True, padding=True))
]

yelp_prediction1 = [
    1 if r["label"].lower() == "positive" 
    else 0 if r["label"].lower() == "negative" 
    else None for r in tqdm(model1(yelp_texts, batch_size=32, truncation=True, padding=True))
]

imdb_prediction2 = [1 if r["label"].lower() == "positive" else 0 for r in model2(imdb_texts)]
yelp_prediction2 = [1 if r["label"].lower() == "positive" else 0 for r in model2(yelp_texts)]

100%|██████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 855980.41it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<?, ?it/s]


In [14]:
# copy paste code from above and try again to calculate performance 
def calculate_performance(predictions, true_labels):
    # filter out None values (for neutral predictions) and align with actual labels
    valid_predictions = [p for p in predictions if p is not None]
    valid_labels = [true_labels[i] for i in range(len(predictions)) if predictions[i] is not None]
    
    accuracy = accuracy_score(valid_labels, valid_predictions)
    f1 = f1_score(valid_labels, valid_predictions, zero_division=1)  # avoid undefined metric errors
    return accuracy, f1

distilbert_imdb_accuracy, distilbert_imdb_f1 = calculate_performance(imdb_prediction1, imdb_labels)
distilbert_yelp_accuracy, distilbert_yelp_f1 = calculate_performance(yelp_prediction1, yelp_labels)
roberta_imdb_accuracy, roberta_imdb_f1 = calculate_performance(imdb_prediction2, imdb_labels)
roberta_yelp_accuracy, roberta_yelp_f1 = calculate_performance(yelp_prediction2, yelp_labels)

print(f"DistilBERT IMDB Accuracy: {distilbert_imdb_accuracy}, F1: {distilbert_imdb_f1}")
print(f"DistilBERT Yelp Accuracy: {distilbert_yelp_accuracy}, F1: {distilbert_yelp_f1}")
print(f"RoBERTa IMDB Accuracy: {roberta_imdb_accuracy}, F1: {roberta_imdb_f1}")
print(f"RoBERTa Yelp Accuracy: {roberta_yelp_accuracy}, F1: {roberta_yelp_f1}")

# f1 still producing 0, so it can't be that the model is not recognizing the lower/upper case classifications

DistilBERT IMDB Accuracy: 0.908, F1: 0.0
DistilBERT Yelp Accuracy: 0.891, F1: 0.883919062832801
RoBERTa IMDB Accuracy: 0.888, F1: 0.0
RoBERTa Yelp Accuracy: 0.858, F1: 0.8582834331337326


In [15]:
print("DistilBERT Classification Report (IMDB):")
print(classification_report(imdb_labels, imdb_prediction1))

print("DistilBERT Classification Report (Yelp):")
print(classification_report(yelp_labels, yelp_prediction1))

print("RoBERTa Classification Report (IMDB):")
print(classification_report(imdb_labels, imdb_prediction2))

print("RoBERTa Classification Report (Yelp):")
print(classification_report(yelp_labels, yelp_prediction2))

# for IMDB: 
# the classification report is showing that "1" is not being detected and that f1 is defaulting to 0.0 

DistilBERT Classification Report (IMDB):
              precision    recall  f1-score   support

           0       1.00      0.91      0.95      1000
           1       0.00      0.00      0.00         0

    accuracy                           0.91      1000
   macro avg       0.50      0.45      0.48      1000
weighted avg       1.00      0.91      0.95      1000

DistilBERT Classification Report (Yelp):
              precision    recall  f1-score   support

           0       0.88      0.92      0.90       518
           1       0.91      0.86      0.88       482

    accuracy                           0.89      1000
   macro avg       0.89      0.89      0.89      1000
weighted avg       0.89      0.89      0.89      1000

RoBERTa Classification Report (IMDB):
              precision    recall  f1-score   support

           0       1.00      0.89      0.94      1000
           1       0.00      0.00      0.00         0

    accuracy                           0.89      1000
   macro

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### Error message:
C:\Users\raiu\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1318: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\raiu\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1318: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\raiu\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1318: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\raiu\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1318: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\raiu\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1318: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\raiu\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1318: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

In [16]:
# we can try balancing the dataset because the model might be overfitting? error says "zero division" as in
# 0 divided by soemthing?

# balance dataset for 500 positive + negative instances (1000 samples from each class)
imdb_pos = [x for x, y in zip(imdb["test"]["text"], imdb["test"]["label"]) if y == 1][:500]
imdb_neg = [x for x, y in zip(imdb["test"]["text"], imdb["test"]["label"]) if y == 0][:500]
imdb_texts_balanced = imdb_pos + imdb_neg
imdb_labels_balanced = [1] * 500 + [0] * 500

yelp_pos = [x for x, y in zip(yelp["test"]["text"], yelp["test"]["label"]) if y == 1][:500]
yelp_neg = [x for x, y in zip(yelp["test"]["text"], yelp["test"]["label"]) if y == 0][:500]
yelp_texts_balanced = yelp_pos + yelp_neg
yelp_labels_balanced = [1] * 500 + [0] * 500

# confirm that the final sizes after balance are 1000 still
print(f"IMDB Size: {len(imdb_texts_balanced)}")
print(f"Yelp Size: {len(yelp_texts_balanced)}")


IMDB Size: 1000
Yelp Size: 1000


In [17]:
# helper to get predictions
def get_predictions(model, texts):
    preds = model(texts, batch_size=32, truncation=True, padding=True, max_length=512)
    mapped = []
    for r in preds:
        label = r["label"].lower()
        if label == "positive":
            mapped.append(1)
        elif label == "negative":
            mapped.append(0)
        else:
            mapped.append(None)  # neutral or unknown
    return mapped

# get predictions for DistilBERT
imdb_prediction1 = get_predictions(model1, imdb_texts_balanced)
yelp_prediction1 = get_predictions(model1, yelp_texts_balanced)

# predictions for RoBERTa
imdb_prediction2 = get_predictions(model2, imdb_texts_balanced)
yelp_prediction2 = get_predictions(model2, yelp_texts_balanced)

# "You seem to be using the pipelines sequentially on GPU. 
# In order to maximize efficiency please use a dataset"

In [18]:
# evaluation function
def calculate_performance(predictions, labels, model_name, dataset_name):    
    valid_predictions = [p for p in predictions if p is not None]
    valid_labels = [labels[i] for i in range(len(predictions)) if predictions[i] is not None]

    accuracy = accuracy_score(valid_labels, valid_predictions)
    f1 = f1_score(valid_labels, valid_predictions, zero_division=1)
    
    print(f"\n{model_name} on {dataset_name} | Accuracy: {accuracy:.3f} | F1: {f1:.3f}")
    print(classification_report(valid_labels, valid_predictions, digits=3))
    
    return accuracy, f1

# find results from matrix
calculate_performance(imdb_prediction1, imdb_labels_balanced, "DistilBERT", "IMDB")
calculate_performance(yelp_prediction1, yelp_labels_balanced, "DistilBERT", "Yelp")
print('\n')
calculate_performance(imdb_prediction2, imdb_labels_balanced, "RoBERTa", "IMDB")
calculate_performance(yelp_prediction2, yelp_labels_balanced, "RoBERTa", "Yelp")


DistilBERT on IMDB | Accuracy: 0.881 | F1: 0.878
              precision    recall  f1-score   support

           0      0.863     0.906     0.884       500
           1      0.901     0.856     0.878       500

    accuracy                          0.881      1000
   macro avg      0.882     0.881     0.881      1000
weighted avg      0.882     0.881     0.881      1000


DistilBERT on Yelp | Accuracy: 0.891 | F1: 0.888
              precision    recall  f1-score   support

           0      0.870     0.920     0.894       500
           1      0.915     0.862     0.888       500

    accuracy                          0.891      1000
   macro avg      0.892     0.891     0.891      1000
weighted avg      0.892     0.891     0.891      1000




RoBERTa on IMDB | Accuracy: 0.874 | F1: 0.872
              precision    recall  f1-score   support

           0      0.898     0.853     0.875       402
           1      0.850     0.896     0.872       374

    accuracy                     

(0.8845726970033296, 0.8951612903225807)

In [None]:
# originally: 

# DistilBERT IMDB Accuracy: 0.908, F1: 0.0
# DistilBERT Yelp Accuracy: 0.891, F1: 0.883919062832801
# RoBERTa IMDB Accuracy: 0.888, F1: 0.0
# RoBERTa Yelp Accuracy: 0.858, F1: 0.8582834331337326

# previously, there was no positive sample present, f1 is calculated on how well you predict both 0 and 1
# if we balance the dataset to support 500 positive and 500 negative examples, then the classifier is
# asked to classify both classes during eval, so precision, recall, and f1 are all computable 

# f1 = 2 (precision)*(recall)/(precision + recall)


## 4. Compare Models & Do a Short Error Analysis
After running inference on your test set:
- Compare Metrics: Which model is most accurate overall? Does one have higher F1?
- Identify Edge Cases:

  - Look at ~5 examples that were misclassified by at least one model.
  - What patterns do you see? (e.g., tricky wording, sarcasm, short text, etc.) If you don't see any pattern that's fine but make sure you've looked hard enough (maybe you need more than 5 examples?)
  - What examples do all models make mistakes on? What mistakes are unique to a particular model? (again, if you dont find a patter that fine but make sure you've tried a lot of things)

####  Write a short paragraph or make a small table summarizing your findings:

- Which model performed best overall?
- Any surprising differences?
- How might you improve performance further?

#### Deliverables
- Code: A Python script or Jupyter notebook showing how you:
  - Load data (and potentially preprocess it).
  - Instantiate the Hugging Face pipelines.
  - Run predictions and calculate metrics.

- Short text on findings  (in a text box in your Colab notebook):
  - Which dataset(s) you chose and why.
  - The models you compared and a table/plot of accuracy or F1 (use matplotlib for graphs).
  - Example misclassified cases and your hypothesis for why they failed.

In [None]:
# calculate metrics
metrics = {}

metrics["DistilBERT-IMDB"] = calculate_performance(imdb_prediction1, imdb_labels_balanced, "DistilBERT", "IMDB")
metrics["DistilBERT-Yelp"] = calculate_performance(yelp_prediction1, yelp_labels_balanced, "DistilBERT", "Yelp")
metrics["RoBERTa-IMDB"]   = calculate_performance(imdb_prediction2, imdb_labels_balanced, "RoBERTa", "IMDB")
metrics["RoBERTa-Yelp"]   = calculate_performance(yelp_prediction2, yelp_labels_balanced, "RoBERTa", "Yelp")

In [None]:
# auto-updating Plot
import matplotlib.pyplot as plt

labels_plot = list(metrics.keys())
accuracies = [metrics[k][0] for k in labels_plot]
f1_scores  = [metrics[k][1] for k in labels_plot]

x = range(len(labels_plot))
width = 0.35

plt.bar([i - width/2 for i in x], accuracies, width=width, label="Accuracy")
plt.bar([i + width/2 for i in x], f1_scores, width=width, label="F1 Score")

for i, (a, f) in enumerate(zip(accuracies, f1_scores)):
    plt.text(i - width/2, a + 0.01, f"{a:.3f}", ha='center')
    plt.text(i + width/2, f + 0.01, f"{f:.3f}", ha='center')
    
plt.xticks(x, labels_plot, rotation=30)
plt.ylabel("Scores")
plt.title("Model Performance")
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.25), ncol=2, frameon=False) # move legend to middle of plot
plt.tight_layout()
plt.show()

## Reflection

F1 is = 2(Precision * Recall)/(Precision + Recall). Accuracy = correct predictions/total predictions and this is fine when the classes are balanced (in my case, I had to balance the positive and negatives), and false positives + false negatives have to be equal cost too. Accuracy can be misleading because if the dataset is imbalanced with 90% negative and 10% positive, you care more about either catching all positives (Recall) or making sure your positives are really positive (Precision). We use F1 because sentiment data is oftentimes imbalanced, but I picked a small enough dataset and balanced one. 

One pattern I see that the model is not the best at classifying is when we see somebody write a "not bad"-esque review where they say something along the lines of "bang for buck, but not the greatest movie in the world". The DistilBERT model classifies this as negative. For example: "Text: Worth the entertainment value of a rental, especially if you like action movies [...] The only thing I found really annoying was the constant cuts to VDs daughter during the last fight scene.<br /><br />Not bad. Not good. Passable 4. Prediction: NEGATIVE, Confidence: 0.6170620322227478". In this example text, RoBERTa actually classified it as "Prediction: positive, Confidence: 0.6511966586112976"! 

This is a model-specific mistake.

I think it's naive to say that that model with the highest F1 score is the best performing. There's nuance with evaluation. Cases where the classes are balanced make it easier to default to F1, but there are cases where that would still not tell the full story. 

But in our case, since both RoBERTa and DistilBERT are operating under the same circumstances with balanced dataset, we can compare how both perform under the same two datasets. 

RoBERTa handles positive/balance better on the Yelp dataset, likely avoiding bias. RoBERTa's yelp F1 score is higher than accuracy which means that the model is better at handling imbalanced between precision and recall (even if the total count of correct classifications is lower), so it's less likely to make major errors on Yelp dataset, even if it does have slightly more numbers of incorrect classifications entirely. 

Both models do better on Yelp than IMDB. This is probably because Yelp has less sarcasm, humor, speech patterns, etc. and so probably because IMDB probably has lengthier discussions about the film. 

For sentiment tasks, we should use F1 to judge performance, and RoBERTa was highest F1. RoBERTa is a bigger model so you expect it to perform more across the board, but DistilBERT actually performed better at IMDB for some cases. This means that IMDB reviews may be more similar to the dataset that DilstilBERT was trained on, whereas RoBERTa might not be able to transition from Twitter-text to long movie reviews easily. 

It's also a little strange that F1 score ended up being higher than accuracy in one instance, but it's possible when the model balances precision and recall well. RoBERTa might be more consistently positive-leaning like "glass-half full"-type model, whereas DistilBERT might be more like "glass-half empty". 