# Evaluating LLMs

| **Category**                     | **Tool**                                         | **Description**                                                                                                                                                 | **Tested LLMs**                 |
|----------------------------------|--------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------|
| **General Evaluation Frameworks** | [Evals by OpenAI](https://github.com/openai/evals) | A framework by OpenAI designed for evaluating the performance of LLMs on custom datasets, using quantitative metrics like accuracy and F1.                       | GPT-3, GPT-3.5, GPT-4, GPT-4o   |
|                                  | [LangChain](https://github.com/hwchase17/langchain) | A library that supports evaluation by chaining together different components, making it easier to benchmark LLMs on diverse tasks.                               | GPT-3, GPT-3.5, GPT-4           |
| **Benchmarking Suites**          | [EleutherAI's LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) | A suite of tools for evaluating LLMs on a variety of NLP benchmarks, covering tasks like question answering, text classification, and more.                      | GPT-3, GPT-4                    |
|                                  | [BIG-bench](https://github.com/google/BIG-bench)  | A large-scale benchmark from Google for testing LLMs on over 200 diverse tasks. It’s designed to stress-test models on open-ended, complex tasks.                | GPT-3, GPT-4, GPT-4o            |
| **Advanced Reasoning and Knowledge** | [TruthfulQA](https://github.com/sylinrl/TruthfulQA) | Assesses a model’s ability to generate truthful responses and avoid misinformation.                                       | GPT-3.5, GPT-4, GPT-4o          |
|                                  | [MMLU](https://github.com/hendrycks/test)        | Tests multitask knowledge across varied domains such as humanities, sciences, and technical fields.                                                              | GPT-3, GPT-4, GPT-4o            |
|                                  | [HellaSwag](https://rowanzellers.com/hellaswag/) | Measures a model's ability to continue sentences in a way that makes sense, focusing on common sense reasoning.                                                 | GPT-3, GPT-3.5, GPT-4           |
|                                  | [BBH](https://arxiv.org/abs/2208.03299)          | Evaluates human-level tasks like arithmetic, logical reasoning, and understanding concepts like dates and numbers.                                               | GPT-4, GPT-4o                    |
| | [ANLI](https://github.com/facebookresearch/anli)        | Adversarially-created Natural Language Inference dataset, assessing complex reasoning with challenging cases.     | GPT-3.5, GPT-4                              |
|                                  | [ARC](https://allenai.org/data/arc)                      | Science questions requiring sophisticated reasoning and background knowledge, aimed at school-level science exams. | GPT-3, GPT-4                                |
| **Coding and Mathematical Reasoning** | [HumanEval](https://github.com/openai/human-eval) | Evaluates models' programming skills by generating code in response to prompts.                                            | GPT-3, GPT-3.5, GPT-4, GPT-4o   |
|                                  | [MATH](https://github.com/hendrycks/math)        | Tests the model's ability to solve high school-level math problems.                                                                                             | GPT-4, GPT-4o                    |
|                                  | [GSM8K](https://github.com/openai/grade-school-math) | Focuses on elementary-level math tasks, testing arithmetic and basic multi-step reasoning skills.                                                                 | GPT-4, GPT-4o                    |
|                                  | [Codeforces](https://codeforces.com/)            | A competitive programming platform to test advanced coding and problem-solving skills.                                                                           | GPT-4, GPT-4o                    |
| **Task-Specific Evaluation Benchmarks** | [CoQA](https://stanfordnlp.github.io/coqa/) | The Conversational Question Answering dataset used to evaluate a model's ability to answer questions in a conversational context, measuring F1 score.             | GPT-2, GPT-3                     |
|                                  | [MNLI](https://cims.nyu.edu/~sbowman/multinli/)   | The Multi-Genre Natural Language Inference benchmark that tests a model’s ability to determine the relationship between pairs of sentences.                      | GPT-2, GPT-3, GPT-4              |
|                                  | [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) | The Stanford Question Answering Dataset used for evaluating a model’s reading comprehension by answering questions based on Wikipedia articles.                  | GPT-2, GPT-3, GPT-4              |
|                                  | [GLUE](https://gluebenchmark.com/)                | The General Language Understanding Evaluation benchmark suite that assesses LLMs on a variety of NLP tasks, such as sentiment analysis, paraphrase detection, and more. | GPT-2, GPT-3, GPT-4              |
|                                  | [SuperGLUE](https://super.gluebenchmark.com/)     | An enhanced version of GLUE that includes more challenging NLP tasks, designed to evaluate cutting-edge language models on general language understanding.        | GPT-3, GPT-4                      |
|                                  | [Winograd Schema Challenge](https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html) | A task for evaluating a model's common sense reasoning ability by resolving ambiguities in pronoun references.                                                   | GPT-3, GPT-4                      |
|                                  | [WikiText-2](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) | A Wikipedia-based dataset with a large vocabulary, used for testing general language modeling abilities.                                                          | GPT-2, GPT-3                     |
|                                  | [PTB (Penn Treebank)](https://catalog.ldc.upenn.edu/LDC99T42) | A language modeling dataset based on Wall Street Journal articles, known for its small vocabulary.                                                               | GPT-2, GPT-3                     |
|                                  | [CBT (Children's Book Test)](https://research.fb.com/downloads/babi/) | Tests story comprehension by requiring models to fill in missing words in children's book passages.                                                              | GPT-2, GPT-3                     |
|                                  | [WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) | A larger version of WikiText-2, with a greater variety of topics and vocabulary.                                                                                | GPT-2, GPT-3                     |
|                                  | [BookCorpus](https://yknzhu.wixsite.com/mbweb)     | A dataset of unpublished books providing narrative-rich text, challenging models with diverse styles.                                                            | GPT-2, GPT-3                     |
|                                  | [1BW (One Billion Word)](http://www.statmt.org/lm-benchmark/) | Consists of sentences from news articles, used for large-scale language modeling.                                                                               | GPT-2, GPT-3                     |
|                                  | [enwiki8 (Hutter Prize)](http://prize.hutter1.net/) | A highly compressed dataset based on Wikipedia text, commonly used for language model compression benchmarks.                                                    | GPT-2, GPT-3                     |
|                                  | [OpenWebText](https://github.com/skeskinen/OpenWebText) | An open-source alternative to WebText, designed to replicate the dataset methodology using high-quality web content shared on Reddit.                            | GPT-2, GPT-3, GPT-4              |
|  | [BoolQ](https://github.com/google-research-datasets/boolean-questions) | Binary (yes/no) question-answering based on factual content.                                               | GPT-3.5, GPT-4                              |
|                                  | [OpenBookQA](https://allenai.org/data/open-book-qa)      | Open-book science QA on elementary-level science, requiring factual recall and knowledge application.             | GPT-4                                       |
|                                  | [PIQA](https://yonatanbisk.com/piqa/)                    | Physical commonsense reasoning, focusing on real-world physical interactions and understanding.                   | GPT-3.5, GPT-4                              |
| **Evaluation Metrics**           | [ROUGE](https://pypi.org/project/rouge-score/)   | A set of metrics for evaluating the quality of text summaries, based on overlapping n-grams, word sequences, and word pairs.                                    | GPT-3, GPT-4                    |
|                                  | [BLEU](https://pypi.org/project/bleu/)           | Metric for evaluating text generation tasks, often used in machine translation by measuring the overlap between generated and reference translations.            | GPT-3, GPT-4                    |
|                                  | [Perplexity](https://en.wikipedia.org/wiki/Perplexity) | Measures how well a language model predicts a sample, with lower values indicating better predictive performance. Commonly used for language modeling tasks.     | GPT-2, GPT-3, GPT-4, GPT-4o     |
|                                  | [Accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision) | A basic metric for classification tasks, measuring the ratio of correctly predicted instances to the total instances.                                            | All LLMs                        |
|                                  | [TREC](https://trec.nist.gov/)                            | Text retrieval benchmark for evaluating search and information retrieval tasks.                                   | Used for GPT-4 with retrieval capabilities  |
| **Multilingual and Sentiment Analysis** | [MARC](https://registry.opendata.aws/amazon-reviews-ml/)   | Multilingual Amazon Reviews corpus, testing sentiment analysis across various languages.                        | GPT-4 with multilingual capabilities         |

# Exams

| **Exam Category**       | **Exam Name**                 | **Description**                                                                                                  | **Typical LLM Performance**                |
|-------------------------|-------------------------------|------------------------------------------------------------------------------------------------------------------|--------------------------------------------|
| **Advanced Placement (AP)** | [AP Calculus BC](https://apstudents.collegeboard.org/courses/ap-calculus-bc)         | Covers advanced calculus concepts, including differential and integral calculus.                                  | GPT-4: 3, GPT-4o: 3                        |
|                         | [AP English Literature](https://apstudents.collegeboard.org/courses/ap-english-literature-and-composition) | Exam assessing reading comprehension and literary analysis.                                                   | GPT-4: 4                                   |
|                         | [AP English Language](https://apstudents.collegeboard.org/courses/ap-english-language-and-composition) | Exam covering English composition and rhetorical analysis.                                                  | GPT-4: 3.5, GPT-4o: 3                      |
|                         | [AP Chemistry](https://apstudents.collegeboard.org/courses/ap-chemistry)               | Covers general chemistry principles.                                                                              | GPT-4: 2, GPT-4o: 2.5                      |
|                         | [AP Physics 2](https://apstudents.collegeboard.org/courses/ap-physics-2-algebra-based)  | Covers topics in fluid dynamics, thermodynamics, and electromagnetism.                                           | GPT-4: 2, GPT-4o: 2                        |
|                         | [AP Macroeconomics](https://apstudents.collegeboard.org/courses/ap-macroeconomics)     | Principles of macroeconomics, including economic indicators and policy.                                          | GPT-4: 3.5, GPT-4o: 3                      |
|                         | [AP Microeconomics](https://apstudents.collegeboard.org/courses/ap-microeconomics)     | Covers microeconomic principles, such as market structures and consumer behavior.                                | GPT-4: 3, GPT-4o: 3                        |
|                         | [AP Biology](https://apstudents.collegeboard.org/courses/ap-biology)                   | Covers topics in genetics, ecology, and cellular biology.                                                        | GPT-4: 3, GPT-4o: 3                        |
|                         | [AP World History](https://apstudents.collegeboard.org/courses/ap-world-history-modern) | Covers global historical events and civilizations.                                                               | GPT-4: 3                                   |
|                         | [AP US History](https://apstudents.collegeboard.org/courses/ap-united-states-history)   | Covers U.S. history from pre-Columbian times to present.                                                         | GPT-4: 3.5                                 |
|                         | [AP US Government](https://apstudents.collegeboard.org/courses/ap-united-states-government-and-politics) | Exam covering U.S. government structures and political systems.                                                | GPT-4: 3                                   |
|                         | [AP Psychology](https://apstudents.collegeboard.org/courses/ap-psychology)             | Covers psychological theories and concepts, including cognition and behavior.                                    | GPT-4: 4, GPT-4o: 3.5                      |
|                         | [AP Art History](https://apstudents.collegeboard.org/courses/ap-art-history)           | Exam covering art movements, techniques, and historical pieces.                                                 | GPT-4: 3                                   |
|                         | [AP Environmental Science](https://apstudents.collegeboard.org/courses/ap-environmental-science) | Covers human impact on the environment and sustainability.                                                       | GPT-4: 2.5                                 |
|                         | [AP Statistics](https://apstudents.collegeboard.org/courses/ap-statistics)             | Exam covering probability, data interpretation, and statistical analysis.                                        | GPT-4: 2.5, GPT-4o: 2                      |
| **Standardized Tests**  | [SAT Math](https://collegereadiness.collegeboard.org/sat)                             | Math section covering algebra, geometry, and trigonometry.                                                       | GPT-4: 700/800, GPT-4o: 680/800            |
|                         | [SAT EBRW](https://collegereadiness.collegeboard.org/sat)                              | Reading and writing section covering grammar, vocabulary, and comprehension.                                     | GPT-4: 740/800, GPT-4o: 720/800            |
|                         | [GRE Quantitative](https://www.ets.org/gre.html)                                       | Quantitative section covering math skills like algebra, geometry, and data analysis.                             | GPT-4: 153/170                             |
|                         | [GRE Verbal](https://www.ets.org/gre.html)                                             | Verbal section covering vocabulary, reading comprehension, and critical thinking.                                | GPT-4: 165/170                             |
|                         | [GRE Writing](https://www.ets.org/gre.html)                                            | Analytical writing section assessing argument analysis and writing skills.                                       | GPT-4: 3.5/6                               |
| **Math Competitions**   | [AMC 12](https://www.maa.org/math-competitions/amc-1012)                               | Covers advanced high school math topics, including algebra, geometry, and combinatorics.                         | GPT-4: 50% average performance             |
|                         | [AMC 10](https://www.maa.org/math-competitions/amc-1012)                               | High school math competition with introductory algebra and geometry.                                             | GPT-4: 60% average performance             |
| **Law Exams**           | [Uniform Bar Exam](https://www.ncbex.org/exams/ube/)                                   | A general legal knowledge exam covering various legal topics required for law licensure in the U.S.              | GPT-4: 276/400, GPT-4o: 270/400            |
|                         | [LSAT](https://www.lsac.org/lsat)                                                      | Law school admissions test covering logical and analytical reasoning and reading comprehension.                  | GPT-4: 150/180                             |
| **Competitive Coding**  | [Codeforces Rating](https://codeforces.com/)                                           | Competitive programming platform with a rating system based on problem-solving skills.                           | GPT-4: 1300-1400                           |
| **Science Competitions**| [USABO Semifinal 2020](https://www.usabo-trc.org/)                                     | Biology competition covering advanced biology topics.                                                            | GPT-4: 45% average performance             |
| **Other AP Exams**      | [AP Microeconomics](https://apstudents.collegeboard.org/courses/ap-microeconomics)     | Covers microeconomic principles, including market structures and consumer theory.                                | GPT-4: 3, GPT-4o: 3                        |
