# MMLU - Massive Multitask Language Understanding

MMLU stands for **Massive Multitask Language Understanding**, and it is perhaps the most popular metric used across model cards to demonstrate a model's performance in terms of knowledge breadth. This benchmark contains a series of scenarios and questions for the LLM to answer across 57 different domains. These domains include STEM, humanities, social sciences, and more. Within each of these domains, there include questions that range from more generalized areas, like history of the topic, and then there are questions that are more specialized in nature or ask "harder" questions, like ethical implications.

Originally conceived by a team a UC Berkeley in ?, MMLU has evolved into many different flavors, each taking variance on things like prompting style, evaluation codes, or even using a subset of all the questions asked. HuggingFace has [a really great write up](https://github.com/huggingface/blog/blob/main/evaluating-mmlu-leaderboard.md) on all these different variations, and while they all can produce a wide range of differences, the same goal remains: assessing the LLM's breadth of knowledge.

## Notebook Setup

In [1]:
# Importing the necessary Python libraries
import os
import json
import yaml
import numpy as np
import pandas as pd
from datasets import load_dataset
from langchain_openai import ChatOpenAI
from langchain.prompts import HumanMessagePromptTemplate

In [2]:
# Loading API keys from file (NOT pushed up to GitHub)
with open('../keys/api_keys.yaml') as f:
    API_KEYS = yaml.safe_load(f)

# Statically setting the choices available
MMLU_CHOICES = ['A', 'B', 'C', 'D']

In [3]:
def softmax(x):
    z = x - max(x)
    numerator = np.exp(z)
    denominator = np.sum(numerator)
    softmax = numerator/denominator
    return softmax

## LLM & LangChain Setup

In [4]:
# Instantiating the OpenAI client with LangChain
llm = ChatOpenAI(
    api_key = API_KEYS['OPENAI_API_KEY'],
    model_name = 'gpt-4',
    temperature = 0,
    max_tokens = 1,
    model_kwargs = {
        'logprobs': True,
        'top_logprobs': 5
    }
)

In [5]:
# Establishing the original prompt
ORIGINAL_MMLU_PROMPT_TEMPLATE = '''The following are questions (with answers) about {formatted_subject}.

{question}
{choices}
'''

# Setting a template around the prompt
mmlu_prompt_template = HumanMessagePromptTemplate.from_template(ORIGINAL_MMLU_PROMPT_TEMPLATE)

## Loading the MMLU Dataset

In [6]:
# (Down)loading the MMLU dataset from HuggingFace
mmlu_dataset = load_dataset(path = 'cais/mmlu',
                            name = 'all',
                            cache_dir = '../data/',
                            trust_remote_code = True,
                            split = 'dev')

# Loading the dataset as a Pandas dataframe
df_mmlu = mmlu_dataset.to_pandas()

## Starting With One Topic

In [7]:
# Getting a list of all the subjects
subjects = sorted(df_mmlu['subject'].value_counts().keys())

# Extracting a single sample subject
subject = subjects[42]

# Formatting the subject by removing the underscores
formatted_subject = subject.replace('_', ' ')

print(f'Subject: {subject}')
print(f'Formatted Subject: {formatted_subject}')

Subject: moral_disputes
Formatted Subject: moral disputes


In [8]:
# Filtering down the full dataset to only include the currently analyzed subject
df_subject = df_mmlu[df_mmlu['subject'] == subject]
df_subject

Unnamed: 0,question,subject,choices,answer
155,"According to Metz, what is wrong with conseque...",moral_disputes,[It is unclear as of yet whether or not capita...,0
156,A fertilized ovum is also known as,moral_disputes,"[a zygote., an embryo., a viability., a blasto...",0
157,"Which of the following is an intrinsic good, a...",moral_disputes,"[being in the upper class, the ability to lear...",1
158,Pence compares six different cases of reproduc...,moral_disputes,[SCNT is not a different kind of reproduction ...,0
159,Baron admits that the versions of the ticking ...,moral_disputes,[the stupidity of most traditional philosophic...,3


In [16]:
# Instantiating containers to hold all boolean "is correct" values and probabilities for this subject
subject_is_corrects = []
subject_probs = []

# Iterating over all the rows of the DataFrame
for index, row in df_subject.iterrows():
    
    # Extracting the question from the row
    question = row['question']

    # Extracting the choices from the row
    choices = ''
    all_choices = row['choices']
    for index, choice in enumerate(all_choices):
        choices += f'{MMLU_CHOICES[index]}. {choice}\n'

    # Extracting the answer from the row
    number_answer = row['answer']
    ground_truth = MMLU_CHOICES[number_answer]

    # Setting the variables within the template
    messages = mmlu_prompt_template.format_messages(
        formatted_subject = formatted_subject,
        question = question,
        choices = choices
    )

    # Invoking the LLM to generate a response
    llm_response = json.loads(llm.generate(messages = [messages]).json())

    # Extracting the prediction from the LLM response
    prediction = llm_response['generations'][0][0]['text']

    # Extracting the log probs from the LLM response
    logprobs = llm_response['generations'][0][0]['generation_info']['logprobs']['content'][0]['top_logprobs']

    # Instantiating a dictionary to a direct key-value pair between token name and that token's logprob
    token_logprobs = {}

    # Iterating over the "raw" log probs
    for logprob in logprobs:
        token_name = logprob['token']
        token_logprob = logprob['logprob']
        token_logprobs[token_name] = token_logprob

    # Instantiating a dictionary to hold the log probs for each of the respective ABCD answers
    abcd_logprobs = {}

    # Iterating over each of the MMLU choices
    for choice in MMLU_CHOICES:

        # Checking to see if each ABCD answer is in the current logprobs
        if choice in token_logprobs.keys():
            
            # Adding the appropriate logprob to abcd_logprobs
            abcd_logprobs[choice] = token_logprobs[choice]

        else:

            # Adding a fake penalty logprob since the answer isn't present
            abcd_logprobs[choice] = -100

    # Setting the softmax values for the ABCD logprobs
    abcd_prob_values = softmax(np.array(list(abcd_logprobs.values())))

    # Creating a boolean value to check if the prediction matches the ground truth
    is_true = ground_truth == prediction
    
    # Appending the final values
    subject_is_corrects.append(is_true)
    subject_probs.append(abcd_prob_values)


## Running the Full MMLU Benchmark

In [20]:
# Getting a list of all the subjects
subjects = sorted(df_mmlu['subject'].value_counts().keys())

all_is_corrects = []
all_probs = []

# Iterating over all the subjects
for subject in subjects:

    # Formatting the subject by removing the underscores
    formatted_subject = subject.replace('_', ' ')

    # Instantiating containers to hold all boolean "is correct" values and probabilities for this subject
    subject_is_corrects = []
    subject_probs = []

    # Iterating over all the rows of the DataFrame
    for index, row in df_subject.iterrows():
        
        # Extracting the question from the row
        question = row['question']

        # Extracting the choices from the row
        choices = ''
        all_choices = row['choices']
        for index, choice in enumerate(all_choices):
            choices += f'{MMLU_CHOICES[index]}. {choice}\n'

        # Extracting the answer from the row
        number_answer = row['answer']
        ground_truth = MMLU_CHOICES[number_answer]

        # Setting the variables within the template
        messages = mmlu_prompt_template.format_messages(
            formatted_subject = formatted_subject,
            question = question,
            choices = choices
        )

        # Invoking the LLM to generate a response
        llm_response = json.loads(llm.generate(messages = [messages]).json())

        # Extracting the prediction from the LLM response
        prediction = llm_response['generations'][0][0]['text']

        # Extracting the log probs from the LLM response
        logprobs = llm_response['generations'][0][0]['generation_info']['logprobs']['content'][0]['top_logprobs']

        # Instantiating a dictionary to a direct key-value pair between token name and that token's logprob
        token_logprobs = {}

        # Iterating over the "raw" log probs
        for logprob in logprobs:
            token_name = logprob['token']
            token_logprob = logprob['logprob']
            token_logprobs[token_name] = token_logprob

        # Instantiating a dictionary to hold the log probs for each of the respective ABCD answers
        abcd_logprobs = {}

        # Iterating over each of the MMLU choices
        for choice in MMLU_CHOICES:

            # Checking to see if each ABCD answer is in the current logprobs
            if choice in token_logprobs.keys():
                
                # Adding the appropriate logprob to abcd_logprobs
                abcd_logprobs[choice] = token_logprobs[choice]

            else:

                # Adding a fake penalty logprob since the answer isn't present
                abcd_logprobs[choice] = -100

        # Setting the softmax values for the ABCD logprobs
        abcd_prob_values = softmax(np.array(list(abcd_logprobs.values())))

        # Creating a boolean value to check if the prediction matches the ground truth
        is_true = ground_truth == prediction
        
        # Appending the final values
        subject_is_corrects.append(is_true)
        subject_probs.append(abcd_prob_values)

    # Appending the mean of all the correct subject answers to the full list of correct answers
    all_is_corrects.append(np.array(np.mean(subject_is_corrects)))

    # Appending all the correlations together
    all_probs.append(subject_probs)

In [25]:
weighted_acc = np.mean(all_is_corrects)
print(weighted_acc)

0.5543859649122806


In [19]:
subject_probs

[array([3.46639708e-03, 3.73683562e-44, 9.76271730e-01, 2.02618732e-02]),
 array([1.00000000e+00, 3.73643148e-44, 3.73643148e-44, 3.73643148e-44]),
 array([3.73869673e-44, 1.00000000e+00, 3.73869673e-44, 3.73869673e-44]),
 array([9.99958993e-01, 3.76226296e-44, 2.50712080e-05, 1.59362294e-05]),
 array([2.03314051e-02, 3.81365273e-44, 4.66091520e-05, 9.79621986e-01])]

In [None]:
llm_response['generations'][0][0]['text']

In [None]:
formatted_subject = 'business ethics'
question = "Three contrasting tactics that CSO's can engage in to meet their aims are ________ which typically involves research and communication, ________, which may involve physically attacking a company's operations or ________, often involving some form of _______."
choices = '''A. Non-violent direct action, Violent direct action, Indirect action, Boycott
B. Indirect action, Instrumental action, Non-violent direct action, Information campaign
C. Indirect action, Violent direct action, Non-violent direct-action Boycott
D. Non-violent direct action, Instrumental action, Indirect action, Information campaign'''

messages = mmlu_prompt_template.format_messages(
    formatted_subject = formatted_subject,
    question = question,
    choices = choices
)

In [None]:
print(messages)

In [None]:
# Iterating over all the rows of the DataFrame
for index, row in df_subject.iterrows():
    
    # Extracting the question from the row
    question = row['question']
    print(question)

    # Extracting the choices from the row
    choices = ''
    all_choices = row['choices']
    for index, choice in enumerate(all_choices):
        choices += f'{MMLU_CHOICES[index]}. {choice}\n'

    # Extracting the answer from the row
    number_answer = row['answer']
    answer = MMLU_CHOICES[number_answer]

    # Invoking the LLM to generate a response

    break


Typical advertising regulatory bodies suggest, for example that adverts must not: encourage _________, cause unnecessary ________ or _____, and must not cause _______ offence.

A. Unsafe practices, Wants, Fear, Trivial

B. Unsafe practices, Distress, Fear, Serious

C. Safe practices, Wants, Jealousy, Trivial

D. Safe practices, Distress, Jealousy, Serious

Answer: B

In [None]:
number_answer = df_subject.iloc[0]['answer']
answer = CHOICES[number_answer]
answer

In [None]:
choices = ''
for index, choice in enumerate(all_choices):
    choices += f'{CHOICES[index]}. {choice}\n'
print(choices)

In [None]:
from openai import OpenAI

client = OpenAI(api_key = API_KEYS['OPENAI_API_KEY'])

test_prompt = '''The following are questions (with answers) about business ethics.

Three contrasting tactics that CSO's can engage in to meet their aims are ________ which typically involves research and communication, ________, which may involve physically attacking a company's operations or ________, often involving some form of _______.
A. Non-violent direct action, Violent direct action, Indirect action, Boycott
B. Indirect action, Instrumental action, Non-violent direct action, Information campaign
C. Indirect action, Violent direct action, Non-violent direct-action Boycott
D. Non-violent direct action, Instrumental action, Indirect action, Information campaign
'''

print(test_prompt)

In [None]:
completion = client.chat.completions.create(
    model = 'gpt-3.5-turbo-1106',
    logprobs = True,
    top_logprobs = 5,
    max_tokens = 1,
    temperature = 0,
    messages = [
        {'role': 'user', 'content': test_prompt}
    ]
)

In [None]:
completion_json = json.loads(completion.model_dump_json())
test_probs = completion_json['choices'][0]['logprobs']['content'][0]['top_logprobs']

In [None]:
generated_logprobs = {}
for prob in test_probs:
    token_name = prob['token']
    token_logprob = prob['logprob']
    generated_logprobs[token_name] = token_logprob
generated_logprobs

In [None]:
logprobs = list(generated_logprobs.values())
print(logprobs)

def softmax(x):
    z = x - max(x)
    numerator = np.exp(z)
    denominator = np.sum(numerator)
    softmax = numerator/denominator
    return softmax

softmax(np.array(list(token_logprobs.values())))

In [None]:
log_probs = []
for choice in MMLU_CHOICES:
    

    completion_json['choices'][0]['logprobs']['content'][0]['top_logprobs']['token']

In [None]:
completion = client.completions.create(
    model = 'gpt-3.5-turbo-instruct',
    prompt = test_prompt,
    logprobs = 5,
    max_tokens = 1,
    temperature = 0
)

completion

In [None]:
print(completion.choices)

In [None]:
result = llm.generate(messages = [messages])

In [None]:
json.loads(result.json())