<a href="https://colab.research.google.com/github/benzionchen/transformer_NLP_research/blob/main/transformer_research_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The goal is for you to practice:

- Picking a dataset (and possibly trying more than one).
- Selecting different pretrained models from the Hugging Face Hub.
- Measuring performance (accuracy, F1, etc.) and comparing your results.
- Doing a brief error analysis to see where each model struggles.

In [1]:
!pip install datasets
!pip install transformers
!pip install evaluate
!pip install accelerate
!pip install -U bitsandbytes
!pip install torch



## 1. Choose a Dataset
You can pick: (or try some other ones you find interesting):
- IMDB Movie Reviews (sentiment labels: positive/negative).
- Yelp Reviews (sentiment labels: star ratings or binary positive/negative).

For now, complete the rest of the steps (2-4) below with the above two datasets. Come back to do the following task after you're done with the above (time permitting):

We are going to use the Amazon Product Reviews (various categories, can be collapsed into positive/negative) dataset. You are free to decide how to collapse multiple categories into one. You can also compare different approaches of this as well.

Feel free to use the datasets library (e.g., load_dataset("imdb")).


In [2]:
from datasets import load_dataset

imdb = load_dataset("imdb")
yelp = load_dataset("yelp_polarity") # the name of the dataset is not 'yelp', it's 'yelp_polarity'

imdb_texts = imdb["test"]["text"]
imdb_labels = imdb["test"]["label"]

yelp_texts = yelp["test"]["text"]
yelp_labels = yelp["test"]["label"]

print(imdb)
print(imdb_texts[0])
print(imdb_labels[0])
print(yelp)
print(yelp_texts[0])
print(yelp_labels[0])

# /usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
# The secret `HF_TOKEN` does not exist in your Colab secrets.
# To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
# You will be able to reuse this secret in all of your notebooks.

# what does this mean? API key?
# the dataset is quite big so we should limit it to maybe 1000

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})
I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, 

In [4]:
imdb_texts = imdb["test"]["text"][:1000]
imdb_labels = imdb["test"]["label"][:1000]

yelp_texts = yelp["test"]["text"][:1000]
yelp_labels = yelp["test"]["label"][:1000]

# 1000 is a good number, not too small and doesnt run into long runtime + prevent GPU memory overflow

## 2. Select Two (or More) Pretrained Models

Pick at least two from the Hugging Face Hub and compare them:

DistilBERT (e.g., distilbert-base-uncased-finetuned-sst-2-english)

BERT (e.g., bert-base-uncased-finetuned-sst-2-english)

RoBERTa (e.g., cardiffnlp/twitter-roberta-base-sentiment-latest or roberta-base-openai-detector)

Feel free to explore the Hugging Face Model Hub if you find something else interesting!

In [3]:
import transformers
from transformers import pipeline

model1 = pipeline('sentiment-analysis', model = 'distilbert-base-uncased-finetuned-sst-2-english')
model2 = pipeline('sentiment-analysis', model = 'bert-base-uncased-finetuned-sst-2-english')

# loading the model, started off in CPU (forgot to change to T4 GPU)
# bert-base-uncased-finetuned-sst-2-english is not available on HF? neither is "bert-base-uncased-finetuned-sst-2-english"

Device set to use cuda:0


OSError: bert-base-uncased-finetuned-sst-2-english is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

In [5]:
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model1 = pipeline(
    'sentiment-analysis',
    model = 'distilbert-base-uncased-finetuned-sst-2-english',
    truncation = True # added truncation because runtime error
    )

# load r1 and load tokenizer
# 'deepseek-ai/DeepSeek-R1' too big, will need distilled version

bnb_config = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_compute_dtype = "float16", # load in fp16 precision?
)

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-llm-7b-base", trust_remote_code=True)
model2 = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/deepseek-llm-7b-base",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# loading checkpoint shards? what does this mean? 7b is too big to fit in 1 binary file, so broken up to multiple - stored on file level

Device set to use cuda:0


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [6]:
print(model1)

<transformers.pipelines.text_classification.TextClassificationPipeline object at 0x786de623a190>


In [7]:
print(model2)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(102400, 4096)
    (layers): ModuleList(
      (0-29): 30 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-06)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-06)
      )
    )
    (norm): LlamaRMSNorm((409

In [8]:
# testing it works
model1("bert perform sentiment analysis")

[{'label': 'POSITIVE', 'score': 0.9784988760948181}]

In [9]:
from tqdm import tqdm
# import a progress bar

# use BERT to predict sentiment analysis for both imdb and yelp datasets

imdb_prediction1 = [1 if r["label"] == "POSITIVE"
    else 0 for r in tqdm(model1(imdb_texts, batch_size=32))]

yelp_prediction1 = [1 if r["label"] == "POSITIVE"
    else 0 for r in tqdm(model1(yelp_texts, batch_size=32))]

# RuntimeError: The size of tensor a (532) must match the size of tensor b (512) at non-singleton dimension 1, need to add truncation for model1 (BERT)
# this takes forever to tokenize if doing linearly 1 by 1, maybe can batch by 32, and 16 if run into memory problems


100%|██████████| 1000/1000 [00:00<00:00, 1256532.06it/s]
100%|██████████| 1000/1000 [00:00<00:00, 1477908.39it/s]


In [12]:
print(imdb_prediction1)
print(yelp_prediction1)

print(len(imdb_prediction1))

# 1000 negative and positive reviews

[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 

In [10]:
model2("deepseek perform sentiment analysis")

# this model takes in different inputs

TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not str

In [None]:
def deepseek_prediction(text):
  prompt = f"classify the sentiment as negative or positive: {text}" # text is what is passed into the function, prompt is the var we store it in
  inputs = tokenizer(prompt, return_tensors='pt', truncation = True, max_length = 512).to(model2.device)
  # tokenize the prompt into the input IDs and attention masks that can be recognized
  # return_tensors='pt' is return a pytorch tensor and send the data to the same device as the model
  outputs = model2.generate(**inputs, max_new_tokens=10) # this is generating model response, outputs a sequence of tokens that continue the prompt, limit to 10
  decoded = tokenizer.decode(outputs[0], skip_special_tokens=True) # decode output tokens IDs -> string, we don't care about <pad> or <eos>(?) what is this

  if "Positive" in decoded:
    return 1
  elif "Negative" in decoded:
    return 0
  else:
    return -1 # cant decide


## 3. Measure Performance
Implement an evaluation method on a test or validation split. At minimum:
- Accuracy: The fraction of examples predicted correctly.

- F1 Score: Combination of precision and recall. (explanation of this is given below, after the instructions)
You can use the Hugging Face evaluate or datasets library or write your own small function for computing these metrics.

### F1 Score Explanation

Imagine you’re trying to detect “positive” cases—for example, emails that are spam. Your model’s predictions might fall into these categories:
- True Positive (TP): Predicted spam, actually spam
- False Positive (FP): Predicted spam, but it’s not spam
- True Negative (TN): Predicted not spam, actually not spam
- False Negative (FN): Predicted not spam, but it was spam

Two important measures come from this:
- Precision: Out of the emails you labeled spam, how many were actually spam? Precision=TP/(TP + FP)​
- Recall: Out of the emails that were actually spam, how many did you catch? Recall=TP​/(TP + FN)

But often, focusing on just Precision or just Recall is not enough. The F1 score combines both in a single number. It’s defined as the harmonic mean of Precision and Recall:

F1=2× ((Precision×Recall​)/(Precision+Recall))

This way, if either Precision or Recall is low, the F1 score will also be relatively low.

Example with a Small Confusion Matrix
Suppose your model had these results:

- TP = 4
- FP = 2
- FN = 1
- TN = 3

Then:

- Precision=4/(4+2)​=0.66 (about 66%)
- Recall=4/(4+1)​=0.80 (80%)

So,
F1=2×(0.66×0.80)/(0.66+0.80)​≈0.72

Quick Python Example

Below is a short snippet using sklearn (you dont have to usethis, hugging face also has a f1 function) to calculate the F1 score from some example predictions:


```
from sklearn.metrics import f1_score

True labels and model predictions
y_true = [1, 1, 0, 1, 0, 0]
y_pred = [1, 0, 0, 1, 0, 1]

Calculate F1 score
score = f1_score(y_true, y_pred)
print("F1 Score:", score)
```

If you run this, you’ll see a single value that summarizes how good your predictions are at correctly identifying positives (with both “how often you’re correct” in positives, and “how many positives you caught” taken into account).

Key Takeaways

- F1 score balances Precision and Recall into one number.
- If you need a single metric to judge performance in situations where both false positives and false negatives matter, F1 is often a good choice.
- In Python, `sklearn.metrics.f1_score` makes it easy to compute.

## 4. Compare Models & Do a Short Error Analysis
After running inference on your test set:
- Compare Metrics: Which model is most accurate overall? Does one have higher F1?
- Identify Edge Cases:

  - Look at ~5 examples that were misclassified by at least one model.
  - What patterns do you see? (e.g., tricky wording, sarcasm, short text, etc.) If you don't see any patter that's fine but make sure you've looked hard enough (maybe you need more than 5 examples?)
  - What examples do all models make mistakes on? What mistakes are unique to a particular model? (again, if you dont find a patter that fine but make sure you've tried a lot of things)

Write a short paragraph or make a small table summarizing your findings:

- Which model performed best overall?
- Any surprising differences?
- How might you improve performance further?

Deliverables
- Code: A Python script or Jupyter notebook showing how you:
  - Load data (and potentially preprocess it).
  - Instantiate the Hugging Face pipelines.
  - Run predictions and calculate metrics.

- Short text on findings  ( in a text box in your Colab notebook):
  - Which dataset(s) you chose and why.
  - The models you compared and a table/plot of accuracy or F1 (use matplotlib for graphs).
  - Example misclassified cases and your hypothesis for why they failed.

If something is unclear to you, feel free to reach out to me. Try to get as far as you can with this as possible and next time, we meet i'll ask you to explain your findings and the choices you made.