<a href="https://colab.research.google.com/github/benzionchen/transformer_NLP_research/blob/main/transformer_research_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The goal is for you to practice:

- Picking a dataset (and possibly trying more than one).
- Selecting different pretrained models from the Hugging Face Hub.
- Measuring performance (accuracy, F1, etc.) and comparing your results.
- Doing a brief error analysis to see where each model struggles.

Already installed:

!pip install datasets
!pip install transformers
!pip install evaluate
!pip install accelerate
!pip install -U bitsandbytes
!pip install torch
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

In [1]:
import torch

In [2]:
# test
print(torch.cuda.get_device_name(0))
print(torch.cuda.is_available())

NVIDIA GeForce RTX 3080 Ti
True


## 1. Choose a Dataset
You can pick: (or try some other ones you find interesting):
- IMDB Movie Reviews (sentiment labels: positive/negative).
- Yelp Reviews (sentiment labels: star ratings or binary positive/negative).

For now, complete the rest of the steps (2-4) below with the above two datasets. Come back to do the following task after you're done with the above (time permitting):

We are going to use the Amazon Product Reviews (various categories, can be collapsed into positive/negative) dataset. You are free to decide how to collapse multiple categories into one. You can also compare different approaches of this as well.

Feel free to use the datasets library (e.g., load_dataset("imdb")).


In [3]:
from datasets import load_dataset

imdb = load_dataset("imdb")
yelp = load_dataset("yelp_polarity") # the name of the dataset is not 'yelp', it's 'yelp_polarity'

imdb_texts = imdb["test"]["text"]
imdb_labels = imdb["test"]["label"]

yelp_texts = yelp["test"]["text"]
yelp_labels = yelp["test"]["label"]

print(imdb)
print(imdb_texts[0])
print(imdb_labels[0])
print(yelp)
print(yelp_texts[0])
print(yelp_labels[0])

# /usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
# The secret `HF_TOKEN` does not exist in your Colab secrets.
# To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
# You will be able to reuse this secret in all of your notebooks.

# what does this mean? API key?
# the dataset is quite big so we should limit it to maybe 1000

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})
I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, 

In [4]:
imdb_texts = imdb["test"]["text"][:1000]
imdb_labels = imdb["test"]["label"][:1000]

yelp_texts = yelp["test"]["text"][:1000]
yelp_labels = yelp["test"]["label"][:1000]

# choose sample 1000 because original datasets are too big 
# 1000 is a good number, not too small and doesnt run into long runtime + prevent GPU memory overflow

## 2. Select Two (or More) Pretrained Models

Pick at least two from the Hugging Face Hub and compare them:

DistilBERT (e.g., distilbert-base-uncased-finetuned-sst-2-english)

BERT (e.g., bert-base-uncased-finetuned-sst-2-english)

RoBERTa (e.g., cardiffnlp/twitter-roberta-base-sentiment-latest or roberta-base-openai-detector)

Feel free to explore the Hugging Face Model Hub if you find something else interesting!

In [5]:
import transformers
from transformers import pipeline

In [6]:
# testing two different ways to load the models

# method 1
model1 = pipeline(
    'sentiment-analysis', 
    model = 'distilbert-base-uncased-finetuned-sst-2-english', 
    truncation = True # added truncation because runtime error
    )

model2 = pipeline(
    'sentiment-analysis', 
    model = 'cardiffnlp/twitter-roberta-base-sentiment-latest',
    truncation = True, # added truncation because runtime error
    padding = True, # added padding due to error with different lengths 
    max_length = 512 # ensure sequences are no longer than 512 tokens was getting error w/ Deepseek
    )

# migrated from colab -> jupyter notebook
# bert-base-uncased-finetuned-sst-2-english is not available on HF? neither is "bert-base-uncased-finetuned-sst-2-english"

Device set to use cuda:0
Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


In [None]:
# method 2 (chat GPT suggested)
model1 = pipeline(
    'sentiment-analysis', 
    model='distilbert-base-uncased-finetuned-sst-2-english',
    tokenizer='distilbert-base-uncased-finetuned-sst-2-english',
    device=0
)

model2 = pipeline(
    'sentiment-analysis',
    model='cardiffnlp/twitter-roberta-base-sentiment-latest',
    tokenizer='cardiffnlp/twitter-roberta-base-sentiment-latest',
    device=0
)


In [7]:
# testing it 
print(model1)
print(model2)
print(model1("i love dogs"))
print(model1("i hate you"))
print(model2("i love dogs"))
print(model2("i hate you"))

<transformers.pipelines.text_classification.TextClassificationPipeline object at 0x000001B542766280>
<transformers.pipelines.text_classification.TextClassificationPipeline object at 0x000001B542764550>
[{'label': 'POSITIVE', 'score': 0.999713122844696}]
[{'label': 'NEGATIVE', 'score': 0.9991129040718079}]
[{'label': 'positive', 'score': 0.9173767566680908}]
[{'label': 'negative', 'score': 0.7887632846832275}]


In [None]:
# loading model using method 1 and 2 yield same result:

# <transformers.pipelines.text_classification.TextClassificationPipeline object at 0x0000014F60FA6760>
# <transformers.pipelines.text_classification.TextClassificationPipeline object at 0x0000014F0013AF70>
# [{'label': 'POSITIVE', 'score': 0.999713122844696}]
# [{'label': 'NEGATIVE', 'score': 0.9991129040718079}]
# [{'label': 'positive', 'score': 0.9173767566680908}]
# [{'label': 'negative', 'score': 0.7887632846832275}]

In [None]:
from tqdm import tqdm
# import a progress bar

# use BERT to predict sentiment analysis for both imdb and yelp datasets
imdb_prediction1 = [1 if r["label"] == "POSITIVE"
    else 0 for r in tqdm(model1(imdb_texts, batch_size=32, truncation=True, padding=True))]

yelp_prediction1 = [1 if r["label"] == "POSITIVE"
    else 0 for r in tqdm(model1(yelp_texts, batch_size=32, truncation=True, padding=True))]

# RuntimeError: The size of tensor a (532) must match the size of tensor b (512) at non-singleton dimension 1, need to add truncation for model1 (BERT)
# this takes forever to tokenize if doing linearly 1 by 1, maybe can batch by 32, and 16 if run into memory problems

imdb_prediction2 = [1 if r["label"] == "positive"
    else 0 for r in tqdm(model2(imdb_texts, batch_size=32, truncation=True, padding=True))]

yelp_prediction2 = [1 if r["label"] == "positive"
    else 0 for r in tqdm(model2(yelp_texts, batch_size=32, truncation=True, padding=True))]

In [None]:
print(imdb_prediction1)
print(yelp_prediction1)
print('\n')
print(len(imdb_prediction1))

# 1000 negative and positive reviews

In [None]:
# check if yelp_prediction1 = yelp_prediction2, if not, compare the differences 
print(imdb_prediction2)
print(yelp_prediction2)
print('\n')
print(len(imdb_prediction2))

In [None]:
def count_mismatches(array1, array2):
    if len(array1) != len(array2):
        raise ValueError("arrays must have the same length")

    # count mismatches
    mismatch_count = sum(1 for a, b in zip(array1, array2) if a != b)

    return mismatch_count

In [None]:
print(count_mismatches(imdb_prediction1, imdb_prediction2))
print(count_mismatches(imdb_prediction1, imdb_prediction2)/1000)
# 162 mismatches
# 16.2 percent of mismatches

## 3. Measure Performance
Implement an evaluation method on a test or validation split. At minimum:
- Accuracy: The fraction of examples predicted correctly.

- F1 Score: Combination of precision and recall. (explanation of this is given below, after the instructions)
You can use the Hugging Face evaluate or datasets library or write your own small function for computing these metrics.

### F1 Score Explanation

Imagine you’re trying to detect “positive” cases—for example, emails that are spam. Your model’s predictions might fall into these categories:
- True Positive (TP): Predicted spam, actually spam
- False Positive (FP): Predicted spam, but it’s not spam
- True Negative (TN): Predicted not spam, actually not spam
- False Negative (FN): Predicted not spam, but it was spam

Two important measures come from this:
- Precision: Out of the emails you labeled spam, how many were actually spam? Precision=TP/(TP + FP)​
- Recall: Out of the emails that were actually spam, how many did you catch? Recall=TP​/(TP + FN)

But often, focusing on just Precision or just Recall is not enough. The F1 score combines both in a single number. It’s defined as the harmonic mean of Precision and Recall:

F1=2× ((Precision×Recall​)/(Precision+Recall))

This way, if either Precision or Recall is low, the F1 score will also be relatively low.

Example with a Small Confusion Matrix
Suppose your model had these results:

- TP = 4
- FP = 2
- FN = 1
- TN = 3

Then:

- Precision=4/(4+2)​=0.66 (about 66%)
- Recall=4/(4+1)​=0.80 (80%)

So,
F1=2×(0.66×0.80)/(0.66+0.80)​≈0.72

Quick Python Example

Below is a short snippet using sklearn (you dont have to usethis, hugging face also has a f1 function) to calculate the F1 score from some example predictions:


```
from sklearn.metrics import f1_score

True labels and model predictions
y_true = [1, 1, 0, 1, 0, 0]
y_pred = [1, 0, 0, 1, 0, 1]

Calculate F1 score
score = f1_score(y_true, y_pred)
print("F1 Score:", score)
```

If you run this, you’ll see a single value that summarizes how good your predictions are at correctly identifying positives (with both “how often you’re correct” in positives, and “how many positives you caught” taken into account).

Key Takeaways

- F1 score balances Precision and Recall into one number.
- If you need a single metric to judge performance in situations where both false positives and false negatives matter, F1 is often a good choice.
- In Python, `sklearn.metrics.f1_score` makes it easy to compute.

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score

def calculate_performance(predictions, labels):
    accuracy = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions)  
    return accuracy, f1

distilbert_imdb_accuracy, distilbert_imdb_f1 = calculate_performance(imdb_prediction1, imdb_labels)
distilbert_yelp_accuracy, distilbert_yelp_f1 = calculate_performance(yelp_prediction1, yelp_labels)
roberta_imdb_accuracy, roberta_imdb_f1 = calculate_performance(imdb_prediction2, imdb_labels)
roberta_yelp_accuracy, roberta_yelp_f1 = calculate_performance(yelp_prediction2, yelp_labels)

print(f"DistilBERT IMDB Accuracy: {distilbert_imdb_accuracy}, F1: {distilbert_imdb_f1}")
print(f"DistilBERT Yelp Accuracy: {distilbert_yelp_accuracy}, F1: {distilbert_yelp_f1}")
print(f"RoBERTa IMDB Accuracy: {roberta_imdb_accuracy}, F1: {roberta_imdb_f1}")
print(f"RoBERTa Yelp Accuracy: {roberta_yelp_accuracy}, F1: {roberta_yelp_f1}")

# F1 is producing 0? why? 

DistilBERT IMDB: 
- model is performing well on the negative class (precision: 1.00, recall: 0.91, F1: 0.95), the recall and f1-score are zero, which means DistilBERT is failing to predict any positive instances

RoBERTa IMDB: 
- performs very well on the negative class (precision: 1.00, recall: 0.89, F1: 0.94), but fails to predict any positive instances (F1: 0.0)

Need to fix this problem with IMDB dataset positive instances not getting detected

DistilBERT Yelp:  
- model is doing better on Yelp with both positive and negative classes showing reasonable performance (precision, recall, and F1 around 0.88 to 0.91), and  identifying positive and negative reviews 

RoBERTa Yelp: 
- lower performance on Yelp (accuracy: 0.86, F1: 0.86), but it's still within range, not as strong as DistilBERT on this dataset but performs decently

In [None]:
# raw predictions from DistilBERT on the first 10 IMDB reviews
distilbert_raw_output = model1(imdb_texts[:10])

# raw output to inspect the probabilities
print("DistilBERT Raw Output (first 10 examples):")
for i, output in enumerate(distilbert_raw_output):
    print(f"Text: {imdb_texts[i]}")
    print('\n')
    print(f"Prediction: {output['label']}, Confidence: {output['score']}")
    print('\n')


print('\n')
print ('----- line break -----')
print('\n')

# raw predictions from RoBERTa on the first 10 IMDB reviews
roberta_raw_output = model2(imdb_texts[:10])

# raw output to inspect the probabilities
print("\nRoBERTa Raw Output (first 10 examples):")
for i, output in enumerate(roberta_raw_output):
    print(f"Text: {imdb_texts[i]}")
    print('\n')
    print(f"Prediction: {output['label']}, Confidence: {output['score']}")
    print('\n')
    
# the point of this is to check that these models are: 
# 1) actually outputting something
# 2) if the output is upper/lower case
# 3) how much confidence there is (potentially set a confidence threshold in the future thats acceptable)

# we see that these raw outputs clearly exist so there must be some problem with function logic or 
# something potentially wrong with the model being trained for identifying positive instances?

In [None]:
# maybe the model is not seeing "POSITIVE" and "positive" so hard code it to make sure it's .lower() and .upper()
imdb_prediction1 = [
    1 if r["label"].lower() == "positive" 
    else 0 if r["label"].lower() == "negative" 
    else None for r in tqdm(model1(imdb_texts, batch_size=32, truncation=True, padding=True))
]

yelp_prediction1 = [
    1 if r["label"].lower() == "positive" 
    else 0 if r["label"].lower() == "negative" 
    else None for r in tqdm(model1(yelp_texts, batch_size=32, truncation=True, padding=True))
]

imdb_prediction2 = [1 if r["label"].lower() == "positive" else 0 for r in model2(imdb_texts)]
yelp_prediction2 = [1 if r["label"].lower() == "positive" else 0 for r in model2(yelp_texts)]

In [None]:
# copy paste code from above and try again
def calculate_performance(predictions, true_labels):
    # filter out None values (for neutral predictions) and align with actual labels
    valid_predictions = [p for p in predictions if p is not None]
    valid_labels = [true_labels[i] for i in range(len(predictions)) if predictions[i] is not None]
    
    accuracy = accuracy_score(valid_labels, valid_predictions)
    f1 = f1_score(valid_labels, valid_predictions, zero_division=1)  # avoid undefined metric errors
    return accuracy, f1

distilbert_imdb_accuracy, distilbert_imdb_f1 = calculate_performance(imdb_prediction1, imdb_labels)
distilbert_yelp_accuracy, distilbert_yelp_f1 = calculate_performance(yelp_prediction1, yelp_labels)
roberta_imdb_accuracy, roberta_imdb_f1 = calculate_performance(imdb_prediction2, imdb_labels)
roberta_yelp_accuracy, roberta_yelp_f1 = calculate_performance(yelp_prediction2, yelp_labels)

print(f"DistilBERT IMDB Accuracy: {distilbert_imdb_accuracy}, F1: {distilbert_imdb_f1}")
print(f"DistilBERT Yelp Accuracy: {distilbert_yelp_accuracy}, F1: {distilbert_yelp_f1}")
print(f"RoBERTa IMDB Accuracy: {roberta_imdb_accuracy}, F1: {roberta_imdb_f1}")
print(f"RoBERTa Yelp Accuracy: {roberta_yelp_accuracy}, F1: {roberta_yelp_f1}")

# f1 still producing 0, so it can't be that the model is not recognizing the lower/upper case classifications

In [None]:
from sklearn.metrics import classification_report

print("DistilBERT Classification Report (IMDB):")
print(classification_report(imdb_labels, imdb_prediction1))

print("RoBERTa Classification Report (IMDB):")
print(classification_report(imdb_labels, imdb_prediction2))

print("DistilBERT Classification Report (Yelp):")
print(classification_report(yelp_labels, yelp_prediction1))

print("RoBERTa Classification Report (Yelp):")
print(classification_report(yelp_labels, yelp_prediction2))

# the classification report is showing that "1" is not being detected and that f1 is defaulting to 0.0 

In [None]:
# we can try balancing the dataset because the model might be overfitting? 
# balance dataset for 500 positive + negative instances (1000 samples from each class)
imdb_pos = [x for x, y in zip(imdb["test"]["text"], imdb["test"]["label"]) if y == 1][:500]
imdb_neg = [x for x, y in zip(imdb["test"]["text"], imdb["test"]["label"]) if y == 0][:500]
imdb_texts_balanced = imdb_pos + imdb_neg
imdb_labels_balanced = [1] * 500 + [0] * 500

yelp_pos = [x for x, y in zip(yelp["test"]["text"], yelp["test"]["label"]) if y == 1][:500]
yelp_neg = [x for x, y in zip(yelp["test"]["text"], yelp["test"]["label"]) if y == 0][:500]
yelp_texts_balanced = yelp_pos + yelp_neg
yelp_labels_balanced = [1] * 500 + [0] * 500

# confirm that the final sizes after balance are 1000 still
print(f"IMDB Size: {len(imdb_texts_balanced)}")
print(f"Yelp Size: {len(yelp_texts_balanced)}")


In [None]:
# load both models
model1 = pipeline(
    'sentiment-analysis', 
    model='distilbert-base-uncased-finetuned-sst-2-english',
    tokenizer='distilbert-base-uncased-finetuned-sst-2-english',
    device=0
)

model2 = pipeline(
    'sentiment-analysis',
    model='cardiffnlp/twitter-roberta-base-sentiment-latest',
    tokenizer='cardiffnlp/twitter-roberta-base-sentiment-latest',
    device=0
)



In [None]:
# helper to get predictions
def get_predictions(model, texts):
    preds = model(texts, batch_size=32, truncation=True, padding=True, max_length=512)
    mapped = []
    for r in preds:
        label = r["label"].lower()
        if label == "positive":
            mapped.append(1)
        elif label == "negative":
            mapped.append(0)
        else:
            mapped.append(None)  # neutral or unknown
    return mapped

# get predictions
imdb_prediction1 = get_predictions(model1, imdb_texts_balanced)
yelp_prediction1 = get_predictions(model1, yelp_texts_balanced)

imdb_prediction2 = get_predictions(model2, imdb_texts_balanced)
yelp_prediction2 = get_predictions(model2, yelp_texts_balanced)

# "You seem to be using the pipelines sequentially on GPU. 
# In order to maximize efficiency please use a dataset"

In [None]:
# evaluation function
def calculate_performance(predictions, labels, model_name, dataset_name):
    valid = [(p, l) for p, l in zip(predictions, labels) if p is not None]
    valid_preds, valid_labels = zip(*valid) if valid else ([], [])
    if valid_preds:
        acc = accuracy_score(valid_labels, valid_preds)
        f1 = f1_score(valid_labels, valid_preds)
        print(f"{model_name} {dataset_name} | Accuracy: {acc:.3f} | F1: {f1:.3f}")
        print(classification_report(valid_labels, valid_preds, digits=3))
    else:
        print(f"{model_name} {dataset_name} | No valid predictions.")

# results
calculate_performance(imdb_prediction1, imdb_labels_balanced, "DistilBERT", "IMDB")
calculate_performance(yelp_prediction1, yelp_labels_balanced, "DistilBERT", "Yelp")
calculate_performance(imdb_prediction2, imdb_labels_balanced, "RoBERTa", "IMDB")
calculate_performance(yelp_prediction2, yelp_labels_balanced, "RoBERTa", "Yelp")

In [None]:
# originally: 

# DistilBERT IMDB Accuracy: 0.908, F1: 0.0
# DistilBERT Yelp Accuracy: 0.891, F1: 0.883919062832801
# RoBERTa IMDB Accuracy: 0.888, F1: 0.0
# RoBERTa Yelp Accuracy: 0.858, F1: 0.8582834331337326

# previously, there was no positive sample present, f1 is calculated on how well you predict both 0 and 1
# if we balance the dataset to support 500 positive and 500 negative examples, then the classifier is
# asked to classify both classes during eval, so precision, recall, and f1 are all computable 

# f1 = 2 (precision)*(recall)/(precision + recall)


## 4. Compare Models & Do a Short Error Analysis
After running inference on your test set:
- Compare Metrics: Which model is most accurate overall? Does one have higher F1?
- Identify Edge Cases:

  - Look at ~5 examples that were misclassified by at least one model.
  - What patterns do you see? (e.g., tricky wording, sarcasm, short text, etc.) If you don't see any patter that's fine but make sure you've looked hard enough (maybe you need more than 5 examples?)
  - What examples do all models make mistakes on? What mistakes are unique to a particular model? (again, if you dont find a patter that fine but make sure you've tried a lot of things)

####  Write a short paragraph or make a small table summarizing your findings:

- Which model performed best overall?
- Any surprising differences?
- How might you improve performance further?

#### Deliverables
- Code: A Python script or Jupyter notebook showing how you:
  - Load data (and potentially preprocess it).
  - Instantiate the Hugging Face pipelines.
  - Run predictions and calculate metrics.

- Short text on findings  ( in a text box in your Colab notebook):
  - Which dataset(s) you chose and why.
  - The models you compared and a table/plot of accuracy or F1 (use matplotlib for graphs).
  - Example misclassified cases and your hypothesis for why they failed.

In [None]:
import matplotlib.pyplot as plt

models = ['DistilBERT', 'RoBERTa']
datasets = ['IMDB', 'Yelp']

plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)  # subplot for accuracy
plt.bar(models, [distilbert_imdb_accuracy, roberta_imdb_accuracy], alpha=0.7, label="IMDB Accuracy")
plt.bar(models, [distilbert_yelp_accuracy, roberta_yelp_accuracy], alpha=0.7, label="Yelp Accuracy")
plt.ylabel("Accuracy")
plt.title("Model Comparison: Accuracy")
plt.legend()

# f1 Score plot
plt.subplot(1, 2, 2)  # subplot for f1 Score
plt.bar(models, [distilbert_imdb_f1, roberta_imdb_f1], alpha=0.7, label="IMDB F1")
plt.bar(models, [distilbert_yelp_f1, roberta_yelp_f1], alpha=0.7, label="Yelp F1")
plt.ylabel("F1 Score")
plt.title("Model Comparison: F1 Score")
plt.legend()

# show plots
plt.tight_layout()
plt.show()