# SWAG

Will start with the SWAG dataset to get a baseline before moving on to the HellaSwag dataset.

In [11]:
from unittest.mock import patch
from datasets import load_dataset

with patch("datasets.builder.is_remote_filesystem", return_value=False):
    dataset = load_dataset("swag")

In [92]:
from dataclasses import dataclass

NUM_ANSWERS = 4
CHOICES = ["A", "B", "C", "D"]


@dataclass
class Sample:
    """A sample multiple choice question."""

    question: str
    answers: list[str]
    label: int

    @property
    def choices(self) -> list[str]:
        """Return the multiple choice answers."""
        return [ f"{choice}) {option}" for choice, option in zip(CHOICES, self.answers) ]

    @property
    def label_text(self) -> str:
        """Returns the expected correct answer."""
        return self.answers[self.label]

    @property
    def expected(self) -> str:
        """Returns the expected correct answer."""
        return f"{CHOICES[self.label]}) {self.label_text}"


def make_sample(record: dict[str, str]) -> Sample:
    return Sample(record["startphrase"], [record[f"ending{i}"] for i in range(0, NUM_ANSWERS)], label=record["label"])

ds = dataset['train']
it = iter(ds)
make_sample(next(it)), make_sample(next(it)), make_sample(next(it))


(Sample(question='Members of the procession walk down the street holding small horn brass instruments. A drum line', answers=['passes by walking down the street playing their instruments.', 'has heard approaching them.', "arrives and they're outside dancing and asleep.", 'turns the lead singer watches the performance.'], label=0),
 Sample(question='A drum line passes by walking down the street playing their instruments. Members of the procession', answers=['are playing ping pong and celebrating one left each in quick.', 'wait slowly towards the cadets.', 'continues to play as well along the crowd along with the band being interviewed.', 'continue to play marching, interspersed.'], label=3),
 Sample(question='A group of members in green uniforms walks waving flags. Members of the procession', answers=['pay the other coaches to cheer as people this chatter dips in lawn sheets.', 'walk down the street holding small horn brass instruments.', 'is seen in the background.', 'are talking a cou

# Open AI Client

This will open a client library to talk to a local model server hosting the model we wish to eval.

In [10]:
from openai import OpenAI

BASE_URL = "http://llama-cpp-server.llama:8000/v1"
API_KEY = "sk-xxx"
MODEL_ID = "/data/models/mistral-7b-inst.gguf"

client = OpenAI(base_url=BASE_URL, api_key=API_KEY)
response = client.chat.completions.create(
    model=MODEL_ID,
    messages=[
        {
            "content": "You are a helpful assistant.",
            "role": "system"
        },
        {
            "content": "What is the capital of France?",
            "role": "user"
        }
    ],
)
for choice in response.choices:
    print(choice.message.content)

 The capital city of France is Paris.


In [76]:
INSTRUCTIONS = "Choose the most plausible continuation for the story."
HEADER = "Please answer with the letter of the correct answer."


def multiple_choice_prompt(sample: Sample) -> str:
    """Make a prompt from a sample."""
    return "\n".join([
        HEADER,
        "",
        sample.question,
        "",
        *sample.choices
    ])


def answer(client: OpenAI, sample: Sample) -> str:
    prompt = multiple_choice_prompt(sample)
    response = client.chat.completions.create(
        model=MODEL_ID,
        messages=[
            {
                "content": INSTRUCTIONS,
                "role": "system"
            },
            {
                "content": prompt,
                "role": "user"
            }
        ],
    )
    choice = response.choices[0]
    return choice.message.content.strip()


sample = make_sample(next(it))
response = answer(client, sample)
sample, response

(Sample(question='A drum line passes by walking down the street playing their instruments. Members of the procession', answers=['are playing ping pong and celebrating one left each in quick.', 'wait slowly towards the cadets.', 'makes a square call and ends by jumping down into snowy streets where fans begin to take their positions.', 'play and go back and forth hitting the drums while the audience claps for them.'], label=3),
 'The most plausible continuation for the story would be D) Play and go back and forth hitting the drums while the audience claps for them.')

In [94]:
from tqdm import tqdm

correct = 0
total = 0
it = iter(ds)

In [95]:
pbar = tqdm(enumerate(it))
for i, record in pbar:
    sample = make_sample(next(it))
    response = answer(client, sample)
    total += 1
    # Sometimes the model may answer something like 'The best answer is A) <answer>'
    # so we'll give it credit.
    if response == sample.expected or sample.label_text in response:
        correct += 1
    else:
        print(response)
        print(sample.expected)

    accuracy = (correct / total) * 100
    pbar.set_description(f"{accuracy:2.2f}% - {correct} of {total}")


0.00% - 0 of 1: : 1it [00:16, 16.36s/it]

C) continues to play as well along the crowd along with the band being interviewed.
D) continue to play marching, interspersed.


60.00% - 3 of 5: : 5it [01:07, 14.29s/it]

Without additional context or information about the story, it is difficult to determine the most plausible continuation. Therefore, all of the options (A, B, C, and D) could be possible continuations depending on the context. Could you please provide more information or details about the story so that I can better assist you?
D) turns on a monitor.


50.00% - 3 of 6: : 6it [01:22, 14.37s/it]

B) rush for the unconscious mother.
D) turned to look at someone as he approaches.


50.00% - 4 of 8: : 8it [01:47, 13.58s/it]

The most plausible continuation for the story is D) rolls up his fast run from the water and tosses in the sky.
C) falls to the ground.


44.44% - 4 of 9: : 9it [02:00, 13.23s/it]

D) lights his walking stick as he aims his pistol at someone's wand.
B) leans back to her surroundings, his eyes rimmed with tears.


50.00% - 6 of 12: : 12it [02:30, 10.93s/it]

D) stops the little boy inside.
C) looks up at him.


46.15% - 6 of 13: : 13it [02:45, 12.43s/it]

Without any context or further information, it is difficult to determine which continuation is the most plausible. Please provide more details or clues about the story so I can assist you better.
A) pull a wall down on top of him.


46.67% - 7 of 15: : 15it [03:08, 12.13s/it]

The most plausible continuation for the story is: A) wheels behind his back as the troll sprays to the top of the tower.
B) smashes through a pane of glass.


55.00% - 11 of 20: : 20it [04:05, 11.46s/it]

D) jumps again and forces his eyes upward.
C) is kicked away by someone.


54.55% - 12 of 22: : 22it [04:31, 12.26s/it]

<</SYS>> The most plausible continuation for the story is: D) is open into a web.
B) crashes through the wall of an old ruin.


54.17% - 13 of 24: : 24it [04:58, 12.99s/it]

<</SYS>> I'm sorry, but could you please provide me with more context or information about the story? This will help me provide a more accurate and plausible continuation for the story.
B) pushes it back, then hurls someone across the ruin.


57.14% - 16 of 28: : 28it [05:46, 12.06s/it]

The correct answer is B) moves in slow motion.
C) is then shown again again in slow motion.


55.17% - 16 of 29: : 29it [05:57, 11.66s/it]

A) climbs down the rock as she goes.
C) is then shown doing the same action in slow motion.


53.33% - 16 of 30: : 30it [06:07, 11.39s/it]

A) is shown walking and twirling herself on a gym hoop while judges watch.
D) appears in front of the camera.


51.61% - 16 of 31: : 31it [06:23, 12.51s/it]

B) stop as someone draws beside someone and lands in the limo's trunk.
D) someone and someone push one suv down a hill, leaving a man turn over to the traffic.


50.00% - 16 of 32: : 32it [06:32, 11.50s/it]

B) turn into the Burnham Driveway.
B) turn into the burnham driveway.


48.48% - 16 of 33: : 33it [06:48, 12.79s/it]

The most plausible continuation for the story is option D: "pulls up a little and goes to help someone up."
B) is sprawled on the couch in his underwear, drinking a beer and controlling the car.


47.06% - 16 of 34: : 34it [07:02, 13.19s/it]

A) in purple is wearing a blue dress and blue headband, using the pits to style her hair.
C) talks to the camera.


48.65% - 18 of 37: : 37it [07:30, 10.81s/it]

B) continues drying the man's hair.
C) continues showing various hair products at camera afterwards.


47.37% - 18 of 38: : 38it [07:39, 10.42s/it]

The correct answer is B) swing out from under the twisted wall.
D) upset some piled newspapers.


47.50% - 19 of 40: : 40it [08:05, 11.96s/it]

Without additional context or information about the story, it is not possible for me to determine which continuation is the most plausible. Can you please provide more details about what happens next in the story?
C) pushes someone to the ground, and goes to his gate.


47.62% - 20 of 42: : 42it [08:43, 15.00s/it]

The correct continuation of the story is:

B) continues moving along the set.
A) clap his hands over the instrument as the others speak into the microphone.


46.51% - 20 of 43: : 43it [08:51, 12.98s/it]

A) play a song together themselves and sings.
B) continue to play together in separate frames.


46.67% - 21 of 45: : 45it [09:14, 12.00s/it]

D) are moving back and forth in the grass.
C) are all riding bumper cars around a track.


47.92% - 23 of 48: : 48it [09:53, 11.92s/it]

C) looks up and finds someone in a sunlit tower.
B) pours him a drink.


48.00% - 24 of 50: : 50it [10:20, 13.05s/it]

The most plausible continuation for the story is C) "pulls over and rushes out to the zebra observing him."
B) skates up to him.


47.06% - 24 of 51: : 51it [10:32, 12.78s/it]

C) folds paper ribbon for me and then puts it in the cellophane.
B) moves the cat off the table and onto the floor.


46.15% - 24 of 52: : 52it [10:45, 12.68s/it]

<</SYS>> B) hits the nail up while the timer captures her around.
C) adds more tissue paper to the bag.


45.28% - 24 of 53: : 53it [10:54, 11.69s/it]

B) The correct answer is B.
A) see an intro screen.


44.44% - 24 of 54: : 54it [11:02, 10.60s/it]

A) see a score board.
B) see the girl puts gifts into a gift bag with tissue paper.


43.64% - 24 of 55: : 55it [11:13, 10.55s/it]

D) rinses his hands on the soap.
B) takes a yellow rag and cleans the sink and faucet.


47.54% - 29 of 61: : 61it [12:26, 11.67s/it]

The most plausible continuation for the story would be:

A) dances and then talks around.
B) cleans a contact lens in the palm of his hand.


48.44% - 31 of 64: : 64it [13:02, 11.97s/it]

B) continue to routine at the end while the audience cheers.
D) continue spinning and dancing all around the room and ends with them stopping and facing the camera.


47.69% - 31 of 65: : 65it [13:16, 12.62s/it]

The correct answer is C: They take off most of their headgear and start walking back together.
C) take of most of their head gear and start walking back together.


46.97% - 31 of 66: : 66it [13:32, 13.68s/it]

<</SYS>> The most plausible continuation for the story would be option (C).
D) is seen again with wet hair and closes his eyes to the camera.


46.27% - 31 of 67: : 67it [13:43, 12.86s/it]

A) begins to move, dropping the object down, then beginning to jump horizontal.
D) uses the pole to hit a pinata over and over.


46.38% - 32 of 69: : 69it [14:09, 13.30s/it]

Without additional context, it is difficult to determine the most plausible continuation for the story. However, based solely on the given information, the most likely continuation would be:

B) is holding a harmonica in the sky.
A) is standing talking to the camera.


45.71% - 32 of 70: : 70it [14:23, 13.62s/it]

The most plausible continuation for the story would be: 

A) walks a little more than a while.
B) stops and demonstrates how to sharpen the knife as he talks to a camera.


46.58% - 34 of 73: : 73it [14:59, 12.60s/it]

C) walks up to someone, who's wearing the necklace in someone's uniform, followed by someone.
B) goes over to his desk, to his right, holding his hand down.


48.00% - 36 of 75: : 75it [15:29, 12.40s/it]


KeyboardInterrupt: 