# MMLU - Massive Multitask Language Understanding

MMLU stands for **Massive Multitask Language Understanding**, and it is perhaps the most popular metric used across model cards to demonstrate a model's performance in terms of knowledge breadth. This benchmark contains a series of scenarios and questions for the LLM to answer across 57 different domains. These domains include STEM, humanities, social sciences, and more. Within each of these domains, there include questions that range from more generalized areas, like history of the topic, and then there are questions that are more specialized in nature or ask "harder" questions, like ethical implications.

Originally conceived by a team a UC Berkeley in ?, MMLU has evolved into many different flavors, each taking variance on things like prompting style, evaluation codes, or even using a subset of all the questions asked. HuggingFace has [a really great write up](https://github.com/huggingface/blog/blob/main/evaluating-mmlu-leaderboard.md) on all these different variations, and while they all can produce a wide range of differences, the same goal remains: assessing the LLM's breadth of knowledge.

In [28]:
# Importing the necessary Python libraries
import os
import yaml
from datasets import load_dataset

from langchain_core.messages import HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI

In [30]:
# Loading API keys from file (NOT pushed up to GitHub)
with open('../keys/api_keys.yaml') as f:
    API_KEYS = yaml.safe_load(f)

In [26]:
SUBCATEGORIES = {
    "abstract_algebra": ["math"],
    "anatomy": ["health"],
    "astronomy": ["physics"],
    "business_ethics": ["business"],
    "clinical_knowledge": ["health"],
    "college_biology": ["biology"],
    "college_chemistry": ["chemistry"],
    "college_computer_science": ["computer science"],
    "college_mathematics": ["math"],
    "college_medicine": ["health"],
    "college_physics": ["physics"],
    "computer_security": ["computer science"],
    "conceptual_physics": ["physics"],
    "econometrics": ["economics"],
    "electrical_engineering": ["engineering"],
    "elementary_mathematics": ["math"],
    "formal_logic": ["philosophy"],
    "global_facts": ["other"],
    "high_school_biology": ["biology"],
    "high_school_chemistry": ["chemistry"],
    "high_school_computer_science": ["computer science"],
    "high_school_european_history": ["history"],
    "high_school_geography": ["geography"],
    "high_school_government_and_politics": ["politics"],
    "high_school_macroeconomics": ["economics"],
    "high_school_mathematics": ["math"],
    "high_school_microeconomics": ["economics"],
    "high_school_physics": ["physics"],
    "high_school_psychology": ["psychology"],
    "high_school_statistics": ["math"],
    "high_school_us_history": ["history"],
    "high_school_world_history": ["history"],
    "human_aging": ["health"],
    "human_sexuality": ["culture"],
    "international_law": ["law"],
    "jurisprudence": ["law"],
    "logical_fallacies": ["philosophy"],
    "machine_learning": ["computer science"],
    "management": ["business"],
    "marketing": ["business"],
    "medical_genetics": ["health"],
    "miscellaneous": ["other"],
    "moral_disputes": ["philosophy"],
    "moral_scenarios": ["philosophy"],
    "nutrition": ["health"],
    "philosophy": ["philosophy"],
    "prehistory": ["history"],
    "professional_accounting": ["other"],
    "professional_law": ["law"],
    "professional_medicine": ["health"],
    "professional_psychology": ["psychology"],
    "public_relations": ["politics"],
    "security_studies": ["politics"],
    "sociology": ["culture"],
    "us_foreign_policy": ["politics"],
    "virology": ["health"],
    "world_religions": ["philosophy"],
}

CATEGORIES = {
    "STEM": ["physics", "chemistry", "biology", "computer science", "math", "engineering"],
    "humanities": ["history", "philosophy", "law"],
    "social sciences": ["politics", "culture", "economics", "geography", "psychology"],
    "other (business, health, misc.)": ["other", "business", "health"],
}

CHOICES = ['A', 'B', 'C', 'D']

In [22]:
# (Down)loading the MMLU dataset from HuggingFace
mmlu_dataset = load_dataset(path = 'cais/mmlu',
                            name = 'all',
                            cache_dir = '../data/',
                            trust_remote_code = True,
                            split = 'dev')

# Loading the dataset as a Pandas dataframe
df_mmlu = mmlu_dataset.to_pandas()

In [35]:
# Getting a list of all the subjects
subjects = sorted(df_mmlu['subject'].value_counts().keys())

In [20]:
df_mmlu.head()

Unnamed: 0,question,subject,choices,answer
0,Which of the following best describes the bala...,high_school_government_and_politics,[Freedom of speech is protected except in cert...,3
1,Which of the following statements does NOT acc...,high_school_government_and_politics,[Registered voters between the ages of 35 and ...,1
2,Which of the following plays the most signific...,high_school_government_and_politics,[The geographical area in which the child grow...,1
3,What power was granted to the states by the Ar...,high_school_government_and_politics,"[Coining money, Authorizing constitutional ame...",0
4,The primary function of political action commi...,high_school_government_and_politics,"[contribute money to candidates for election, ...",0


In [17]:
mmlu_dataset

Dataset({
    features: ['question', 'subject', 'choices', 'answer'],
    num_rows: 14042
})