In [1]:
%load_ext autoreload
%autoreload 2

# HellaSwag Example

In this example, we are comparing two models on the HellaSwag dataset.

## Loading Models

In [1]:
from perspectival.model import Transformer

model = Transformer('apple/OpenELM-270M', trust_remote_code=True)
model2 = Transformer('apple/OpenELM-270M-Instruct', trust_remote_code=True)

# Note: You can also use LazyTransformer if you prefer to only load the models
# during steps where they are used for computation

## Set up an Experiment

In [3]:
from perspectival.loader import load_hellaswag
from perspectival.experiment import Experiment

dataset, features = load_hellaswag()
experiment = Experiment(dataset=dataset, name='HellaSwag Example', features=features)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [4]:
# Optional: Select a random subset of the dataset for quicker processing
experiment = experiment.sample(num=100)

## Computing Features

In [5]:
experiment.compute_correctness(models=[model, model2])
experiment.compute_disagreement(models=[model, model2])

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


## Exploring Results

In [6]:
# Some general output statistics
import numpy as np
from collections import Counter

scores = experiment.get_feature('LogDisagreement', models=(model.name, model2.name)).values
disagreement_rate = sum(scores>0)/len(scores)
print(f"Overall disagreement rate: {disagreement_rate:.2f}\n")

for m in [model, model2]:
    print(m.name)
    accuracy = np.mean(experiment.get_feature('PredictionCorrectness', model=m.name).values)
    print(f"- Accuracy: {accuracy:.2f}")
    print("- Output:", Counter(experiment.get_feature('ModelChoices', model=m.name).values).most_common())

Overall disagreement rate: 0.15

apple/OpenELM-270M
- Accuracy: 0.45
- Output: [(1, 30), (3, 28), (0, 25), (2, 17)]
apple/OpenELM-270M-Instruct
- Accuracy: 0.54
- Output: [(1, 29), (3, 28), (0, 25), (2, 18)]


In [7]:
# Show items with max disagreement
scores = experiment.get_feature('LogDisagreement', models=(model.name, model2.name)).values
samples = experiment.sample(num=2, sampling_method='last', ordering_scores=scores)
samples.display_items()

ITEM (train_30757)
"""[header] How to make honey orange glazed chicken [title] Preheat the oven to 375f (190c, gas mark 5) before you begin the rest of the preparations. [title] Coat a baking dish with a very thin layer of extra virgin olive oil. [title] Place breasts or chicken in the baking dish."""
Options: ['[step] Cover with pan and cook uncovered until the chicken turns a light golden colour. [title] Remove and reserve the cooked chicken for garnish.', '[step] Arrange chicken breasts or chicken with skin between them on a baking dish. [title] Place carrots or squash in the bottom of each pan.', '[step] Turn both pieces of chicken inside out. [title] Score each breast in an upward motion into 5 small to 1 inch (2.5 to 3.8 cm) thick slices of chicken.', '[title] Splash with liquid smoke to cover chicken lightly, then rub on some more olive oil. [title] Grind peppercorn medley and sea salt lightly over the chicken.']

FEATURES
GroundTruth 3
OptionLogLikelihood apple/OpenELM-270M [-1