# A Guide to LLM Benchmarks

Every time a breakthrough LLM is surfaced to the public, it generally touts several numbers that sound impressive, yet, if we're really honest with ourselves, we don't know what those numbers really mean. The goal of this guide is to help give you a general high level overview of what each LLM benchmark represents. (Stretch goal: Get code working for all of them.)

(COME BACK FOR MORE OF AN INTRODUCTION.)

(LIST OF BENCHMARKS? CATEGORIES OF BENCHMARKS?)

## Notebook Setup

In [1]:
# Importing the necessary Python libraries
import os
import yaml
from datasets import load_dataset

from langchain_core.messages import HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI


## (Category 1) Benchmarks

### MMLU

In [2]:
# (Down)loading the MMLU dataset from HuggingFace
mmlu_dataset = load_dataset(path = 'cais/mmlu',
                            name = 'all',
                            cache_dir = '../data/')

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [3]:
df_mmlu = mmlu_dataset['auxiliary_train'].to_pandas()

In [4]:
mmlu_dataset

DatasetDict({
    auxiliary_train: Dataset({
        features: ['question', 'subject', 'choices', 'answer'],
        num_rows: 99842
    })
    test: Dataset({
        features: ['question', 'subject', 'choices', 'answer'],
        num_rows: 14042
    })
    validation: Dataset({
        features: ['question', 'subject', 'choices', 'answer'],
        num_rows: 1531
    })
    dev: Dataset({
        features: ['question', 'subject', 'choices', 'answer'],
        num_rows: 285
    })
})

In [5]:
df_mmlu_dev = mmlu_dataset['dev'].to_pandas()

In [6]:
df_mmlu_dev

Unnamed: 0,question,subject,choices,answer
0,Box a nongovernmental not-for-profit organizat...,professional_accounting,"[$70,000, $75,000, $80,000, 100000]",3
1,"One hundred years ago, your great-great-grandm...",professional_accounting,"[$13,000, $600, $15,000, $28,000]",0
2,Krete is an unmarried taxpayer with income exc...,professional_accounting,"[$0, $500, $1,650, $16,500]",0
3,"On January 1, year 1, Alpha Co. signed an annu...",professional_accounting,"[$5,000, $13,500, $16,000, $20,000]",1
4,An auditor traces the serial numbers on equipm...,professional_accounting,"[Valuation and allocation, Completeness, Right...",1
...,...,...,...,...
280,Which of the following conditions will ensure ...,high_school_physics,"[I and II only, I and III only, II and III onl...",3
281,A pipe full of air is closed at one end. A sta...,high_school_physics,"[The pressure is at a node, but the particle d...",1
282,A photocell of work function ϕ = 2eV is connec...,high_school_physics,"[2:00 AM, 6:00 AM, 12:00 AM, 24 A]",3
283,"A microwave oven is connected to an outlet, 12...",high_school_physics,"[10 W, 30 W, 60 W, 240 W]",3


In [7]:
# Loading the Perplexity AI API key from sensitive file (NOT pushed to GitHub) as the OpenAI API Key
with open('../keys/api_keys.yaml') as f:
    os.environ['OPENAI_API_KEY'] = yaml.safe_load(f)['PERPLEXITY_API_KEY']

In [9]:
chat = ChatOpenAI(base_url = 'https://api.perplexity.ai', model = 'mistral-7b-instruct')

messages = [
    SystemMessage(content = "You are a helpful assistant."),
    HumanMessage(content = "What is the capital of Illinois?")
]

chat.invoke(messages)

AIMessage(content="The capital city of Illinois is Springfield. It's located in the central part of the state and is home to several notable attractions, including the Abraham Lincoln Presidential Library and Museum. Springfield has been the capital of Illinois since the state was admitted to the Union in 1818.")