Assignment 5: Behavioral evaluation of LLMs
=====
As discussed in the lecture slides for Class 13, LLM linguistic competence is often assessed using *minimal pairs* -- minimally different sentences that contrast in grammatical acceptability relative to specific linguistic properties.

For example, the following is a minimal pair for the morphological property of number agreement:

- John and Mary go to the store.
- John and Mary goes to the store.

 A good source of evaluation data for linguistic tasks is [BLiMP](https://aclanthology.org/2020.tacl-1.25/). Overall,  BLIMP has 67 individual datasets, each containing 1,000 minimal pairs. BLIMP-based evaluation involves asessing whether a given LLM assigns a higher probability to the acceptable sentence in each minimal pair. The entire benchmark can be found [here](https://huggingface.co/datasets/nyu-mll/blimp)





Your task is to evaluate the small [Qwen LLM ](https://huggingface.co/KingNish/Qwen2.5-0.5b-Test-ft )that we've been using for our assignments on two BLIMP tasks, one assessing morphology and the other assessing syntax.

> 1. Plot the accuracy of the model on the different grammatical phenomena, represented in different test suites.
> 2. Calculate the average accuracies and the confidence intervals for each linguistic property and report your results

To complete the assignment, you'll need to fill in any missing code, all of which is indicated via `### YOUR CODE HERE ###`
Upload your completed notebook by **December 23rd**

(Note that, if you're using a CPU, the evaluation tasks will take approximately 25 minutes to complete)

In [1]:
#!pip install minicons
#!pip install -U datasets huggingface_hub fsspec
#!pip install transformers -U

In [2]:
#!pip install transformers -U

In [None]:
#!pip install huggingface-hub>=0.34.0

In [None]:
from datasets import load_dataset
import torch
from minicons import scorer
import numpy as np
import pandas as pd

In [None]:
import torch
if torch.cuda.is_available():
    device = torch.device('cuda')
elif torch.backends.mps.is_available():
    device = torch.device('mps')
else:
    device = torch.device('cpu')

In [None]:
# get the test suites
dataset = load_dataset('nyu-mll/blimp', 'adjunct_island')
# inspect the dataset
dataset["train"][0]

adjunct_island/train-00000-of-00001.parq(…):   0%|          | 0.00/62.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

{'sentence_good': 'Who should Derek hug after shocking Richard?',
 'sentence_bad': 'Who should Derek hug Richard after shocking?',
 'field': 'syntax',
 'linguistics_term': 'island_effects',
 'UID': 'adjunct_island',
 'simple_LM_method': True,
 'one_prefix_method': False,
 'two_prefix_method': False,
 'lexically_identical': True,
 'pair_id': 0}

In [None]:
#Choose a morphology and syntax test suite
test_suites = {
    'anaphor_gender_agreement': 'morphology',
    'animate_subject_passive': 'syntax'
}
# Load datasets
datasets = {name: load_dataset("nyu-mll/blimp", name) for name in test_suites.keys()}

# inspect the dataset
datasets['anaphor_gender_agreement']["train"][0]

{'sentence_good': "Katherine can't help herself.",
 'sentence_bad': "Katherine can't help himself.",
 'field': 'morphology',
 'linguistics_term': 'anaphor_agreement',
 'UID': 'anaphor_gender_agreement',
 'simple_LM_method': True,
 'one_prefix_method': True,
 'two_prefix_method': False,
 'lexically_identical': False,
 'pair_id': 0}

In [None]:
# set up the model as a minicons scorer
lm_scorer = scorer.IncrementalLMScorer(
    "KingNish/Qwen2.5-0.5b-Test-ft", device=device
)

def cal_accuracy(dataset, result):
    correct_predictions = 0
    total_predictions = 0

        # compare the good and bad sentences
        ### YOUR CODE HERE ###
        answer_scores = lm_scorer.conditional_score(
        # format the question into a list of same length as the number of answer options
            [good_sentence, bad_sentence],
            ["",""]
        )
        ### YOUR CODE HERE ###
    return accuracy, correct_predictions, total_predictions

In [None]:
# calculate the performance by test suite
import matplotlib.pyplot as plt
from statsmodels.stats.proportion import proportion_confint
### YOUR CODE HERE ###
# Calculate performance by test suite
results = {name: cal_accuracy(dataset, lm_scorer) for name, dataset in datasets.items()}
### YOUR CODE HERE ###
# Calculate performance by category
categories = ['morphology', 'syntax']
category_results = {category: {'correct': 0, 'total': 0} for category in categories}

# Calculate accuracy and confidence intervals for each category
### YOUR CODE HERE ###

Did 0 minimal pair(s) of 1000 minimal pairs.
