# MMLU - Massive Multitask Language Understanding

MMLU stands for **Massive Multitask Language Understanding**, and it is perhaps the most popular metric used across model cards to demonstrate a model's performance in terms of knowledge breadth. This benchmark contains a series of scenarios and questions for the LLM to answer across 57 different domains. These domains include STEM, humanities, social sciences, and more. Within each of these domains, there include questions that range from more generalized areas, like history of the topic, and then there are questions that are more specialized in nature or ask "harder" questions, like ethical implications.

Originally conceived by a team a UC Berkeley in ?, MMLU has evolved into many different flavors, each taking variance on things like prompting style, evaluation codes, or even using a subset of all the questions asked. HuggingFace has [a really great write up](https://github.com/huggingface/blog/blob/main/evaluating-mmlu-leaderboard.md) on all these different variations, and while they all can produce a wide range of differences, the same goal remains: assessing the LLM's breadth of knowledge.

In [1]:
# Importing the necessary Python libraries
import os
import yaml
from datasets import load_dataset

from langchain_core.messages import HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI

In [2]:
# Loading API keys from file (NOT pushed up to GitHub)
with open('../keys/api_keys.yaml') as f:
    API_KEYS = yaml.safe_load(f)

In [3]:
SUBCATEGORIES = {
    "abstract_algebra": ["math"],
    "anatomy": ["health"],
    "astronomy": ["physics"],
    "business_ethics": ["business"],
    "clinical_knowledge": ["health"],
    "college_biology": ["biology"],
    "college_chemistry": ["chemistry"],
    "college_computer_science": ["computer science"],
    "college_mathematics": ["math"],
    "college_medicine": ["health"],
    "college_physics": ["physics"],
    "computer_security": ["computer science"],
    "conceptual_physics": ["physics"],
    "econometrics": ["economics"],
    "electrical_engineering": ["engineering"],
    "elementary_mathematics": ["math"],
    "formal_logic": ["philosophy"],
    "global_facts": ["other"],
    "high_school_biology": ["biology"],
    "high_school_chemistry": ["chemistry"],
    "high_school_computer_science": ["computer science"],
    "high_school_european_history": ["history"],
    "high_school_geography": ["geography"],
    "high_school_government_and_politics": ["politics"],
    "high_school_macroeconomics": ["economics"],
    "high_school_mathematics": ["math"],
    "high_school_microeconomics": ["economics"],
    "high_school_physics": ["physics"],
    "high_school_psychology": ["psychology"],
    "high_school_statistics": ["math"],
    "high_school_us_history": ["history"],
    "high_school_world_history": ["history"],
    "human_aging": ["health"],
    "human_sexuality": ["culture"],
    "international_law": ["law"],
    "jurisprudence": ["law"],
    "logical_fallacies": ["philosophy"],
    "machine_learning": ["computer science"],
    "management": ["business"],
    "marketing": ["business"],
    "medical_genetics": ["health"],
    "miscellaneous": ["other"],
    "moral_disputes": ["philosophy"],
    "moral_scenarios": ["philosophy"],
    "nutrition": ["health"],
    "philosophy": ["philosophy"],
    "prehistory": ["history"],
    "professional_accounting": ["other"],
    "professional_law": ["law"],
    "professional_medicine": ["health"],
    "professional_psychology": ["psychology"],
    "public_relations": ["politics"],
    "security_studies": ["politics"],
    "sociology": ["culture"],
    "us_foreign_policy": ["politics"],
    "virology": ["health"],
    "world_religions": ["philosophy"],
}

CATEGORIES = {
    "STEM": ["physics", "chemistry", "biology", "computer science", "math", "engineering"],
    "humanities": ["history", "philosophy", "law"],
    "social sciences": ["politics", "culture", "economics", "geography", "psychology"],
    "other (business, health, misc.)": ["other", "business", "health"],
}

CHOICES = ['A', 'B', 'C', 'D']

In [4]:
# (Down)loading the MMLU dataset from HuggingFace
mmlu_dataset = load_dataset(path = 'cais/mmlu',
                            name = 'all',
                            cache_dir = '../data/',
                            trust_remote_code = True,
                            split = 'dev')

# Loading the dataset as a Pandas dataframe
df_mmlu = mmlu_dataset.to_pandas()

In [44]:
# Getting a list of all the subjects
subjects = sorted(df_mmlu['subject'].value_counts().keys())
subject = subjects[3]
formatted_subject = subject.replace('_', ' ')
subject

'business_ethics'

In [45]:
formatted_subject

'business ethics'

In [33]:
df_subject = df_mmlu[df_mmlu['subject'] == subject]

In [46]:
PROMPT_START = "The following are questions (with answers) about {}.\n\n"

In [60]:
ORIGINAL_MMLU_PROMPT = '''The following are questions (with answers) about {}.

{}
{}
Answer: {}
'''

In [61]:
print(ORIGINAL_MMLU_PROMPT.format(formatted_subject, 'question', 'choices', 'my_answer'))

The following are questions (with answers) about business ethics.

question
choices
Answer: my_answer



In [78]:
# Iterating over all the rows of the DataFrame
for row in range(len(df_subject)):
    
    # Extracting the question from the row
    question = df_subject.iloc[row]['question']

    # Extracting the choices from the row
    choices = ''
    all_choices = df_subject.iloc[row]['choices']
    for index, choice in enumerate(all_choices):
        choices += f'{CHOICES[index]}. {choice}\n'

    # Extracting the answer from the row
    number_answer = df_subject.iloc[0]['answer']
    answer = CHOICES[number_answer]

    # Formatting the prompt with the derived information
    prompt = ORIGINAL_MMLU_PROMPT.format(formatted_subject, question, choices, answer)

    print(prompt)
    print('\n\n\n')


The following are questions (with answers) about business ethics.

Beyond the business case for engaging in CSR there are a number of moral arguments relating to: negative _______, the _______that corporations possess and the ________ of business and society.
A. Externalities, Power, Independence
B. Publicity, Insubstantial resources, Mutual dependence
C. Publicity, Power, Independence
D. Externalities, Power, Mutual dependence

Answer: D





The following are questions (with answers) about business ethics.

_______ is the direct attempt to formally or informally manage ethical issues or problems, through specific policies, practices and programmes.
A. Corporate social responsibility
B. Business ethics management
C. Sustainability
D. Environmental management

Answer: D





The following are questions (with answers) about business ethics.

To ensure the independence of the non-executive board members, they are a number of steps which can be taken, which include non-executives being dr

Typical advertising regulatory bodies suggest, for example that adverts must not: encourage _________, cause unnecessary ________ or _____, and must not cause _______ offence.

A. Unsafe practices, Wants, Fear, Trivial

B. Unsafe practices, Distress, Fear, Serious

C. Safe practices, Wants, Jealousy, Trivial

D. Safe practices, Distress, Jealousy, Serious

Answer: B

In [76]:
number_answer = df_subject.iloc[0]['answer']
answer = CHOICES[number_answer]
answer

'D'

In [71]:
choices = ''
for index, choice in enumerate(all_choices):
    choices += f'{CHOICES[index]}. {choice}\n'
print(choices)

A. Buycotts, Boycotts, Blockchain technology, Charitable donations
B. Buycotts, Boycotts, Digital technology, Increased Sales
C. Boycotts, Buyalls, Blockchain technology, Charitable donations
D. Boycotts, Buycotts, Digital technology, Increased Sales

