# Evaluating LLMs

In this assignment, we will examine generations of LLMs and evaluate their outputs.

In [1]:
! pip install transformers torch tqdm accelerate hf-xet datasets --upgrade --quiet

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import matplotlib
import os
from transformers import GenerationConfig
import datasets

os.environ["TOKENIZERS_PARALLELISM"] = "false"

torch.device('cpu')
# Selecting the font size here will affect all the figures in this notebook
# Alternatively, you can set the font size for axis labels of each figure separately
font = {'size': 16}
matplotlib.rc('font', **font)

# Exercise 1: Measuring bias using model probabilities [45 mins]

In this exercise, we will use the [CrowS Pairs data](https://github.com/nyu-mll/crows-pairs) to measure bias in the outputs of a model.

The dataset consists of counterfactual statements like `Women don't know how to drive.` and `Men don't know how to drive.` and measures which statements the model agrees with more.

We will measure the agreement using the probability the model assigns to an input. In other words, say the input consists of t tokens $[t_1, t_2, \ldots, t_N]$. Then we can measure the model assigned (log) probability as: $\log(t_2 | t_1) + \log(t_3 | t_1, t_2) + \ldots + \log(t_M | t_{M-1}, \ldots, t_1)$.

The dataset consists of pairs of sentences called `sent_more` and `sent_less`. Your bias metric is _the number of times the model assigns more probability to `sent_more`._


**Hint:** You can pass `output_logits=True` to the [GenerationConfig](https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationConfig.output_logits) to return logits, which you can then convert to softmax probabilities.

In [None]:
model_name = "Qwen/Qwen3-0.6B" # Very small model with only 600M parameters
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    )
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
generation_config = GenerationConfig(
    max_new_tokens=10,
    temperature=0.5,
    seed=1,
    num_return_sequences=1,
    do_sample=True,
)

prompt = "Hello"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)


outputs = model.generate(**model_inputs, generation_config=generation_config)
print(tokenizer.decode(outputs[0]))


In [None]:
# Your code here

## Exercise 2: Comparing the two models. [40 mins]

Download the [IMDB movie reviews dataset](https://huggingface.co/datasets/stanfordnlp/imdb). The inputs are the movie reviews written by users. The outputs are the sentiment of the users. The sentiment is a binary labels.


Your task is to:

1. Download the dataset using `datasets.load_dataset("stanfordnlp/imdb")`.
2. Select 50 samples with positive and 50 samples with negative sentiment.
3. Prompt the model and compute its performance.
4. Compare the perforamnce with a larger model `Qwen/Qwen3-4B`.
5. Repeat the same procedure with the [dolly dataset](https://huggingface.co/datasets/databricks/databricks-dolly-15k). Limit yourself to the classification category.

In [None]:
# Your code here