Skip to content

A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨

License

Notifications You must be signed in to change notification settings

ZubinGou/math-evaluation-harness

Repository files navigation

LLM Math Evaluation Harness

A unified, precise, and extensible toolkit to benchmark LLMs on various mathematical tasks 🧮✨.

🔴🚀 Important Notice: We've identified variances above 5% in results from diverse math evaluation frameworks. To ensure fair and standardized comparisons across research, our toolkit strives to harmonize evaluation methods, promoting consistent and reliable math evaluation.

🌟 In Practice: Esteemed projects like ToRA (ICLR'24) and DeepSeek-Coder have leveraged this suite!

Features:

  • Models: Seamless compatibility with models from Hugging Face 🤗 and vLLM.

  • Datasets: An extensive array of datasets including minerva_math, math, math_oai, gsm8k, gsm_hard, svamp, asdiv, mawps, tabmwp, finqa, theorem_qa, bbh, mmlu_stem, sat_math, mathqa, hungarian_exam.

  • Prompts: Diverse prompting paradigms, from Direct to Chain-of-Thought (CoT), Program-of-Thought (PoT/PAL), and Tool-Integrated Reasoning (ToRA).

🚀 Getting Started

⚙️ Environment Setup

Option 1: Conda

conda create -n math_eval python=3.10
conda activate math_eval

Option 2: Docker

We suggest using vLLM docker directly:

docker run --network host --cap-add=SYS_ADMIN --privileged -d \
    --entrypoint '' --name vllm \
    --runtime nvidia --gpus all \
    --security-opt apparmor:unconfined \
    --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
    -v /mnt:/mnt \
    -p 8000:8000 \
    vllm/vllm-openai:latest \
    sleep infinity

Install

git clone https://github.com/ZubinGou/math-evaluation-harness.git
cd math-evaluation-harness
pip install -r requirements.txt

⚖️ Evaluation

  1. Configure model and data settings in scripts/run_math_eval.sh, and set the PROMPT_TYPE variable accordingly:

    • For base models, choose from: direct, cot, pal, or tool-integrated.
    • For SFT models, your options include: tora, wizard_zs, deepseek-math, etc.
      • To add new models, update the construct_prompt function in utils.py to include your new prompt template.
  2. Run the script:

bash scripts/run_eval.sh $PROMPT_TYPE $MODEL_NAME_OR_PATH

📊 Results

Base Models (CoT)

PROMPT_TYPE=cot

Model Size Data Uniq. Token Train Token GSM8K MATH1 SVAMP ASDiv MAWPS TAB2 MQA MMLU STEM SAT AVG
1-2B Base Models
Tinyllama 1.1B - - - 2.9 3.2 11.0 18.1 20.4 12.5 14.6 16.1 21.9 13.4
Phi-1.5 1.3B - - - 32.4 4.2 43.4 53.1 66.2 24.4 14.3 21.8 18.8 31.0
Qwen1.5 1.8B - - - 36.1 6.8 48.5 63.6 79.0 29.2 25.1 31.3 40.6 40.0
Gemma 2.0B - - - 18.8 11.4 38.0 56.6 72.5 36.9 26.8 34.4 50.0 38.4
DeepSeekLLM 1.3B OWM 14B 150B 11.5 8.9 - - - - - 29.6 31.3 -
DeepSeekMath 1.3B - 120B 150B 23.8 13.6 - - - - - 33.1 56.3 -
Rho-Math 1.1B OWM 14B 30B 36.2 15.6 52.1 67.0 83.9 29.0 32.5 23.3 28.1 40.9
>= 7B Base Models
LLaMA-2 7B - - 14.0 3.6 39.5 51.7 63.5 30.9 12.4 32.7 34.4 31.4
Mistral 7B - - 41.2 11.6 64.7 68.5 87.5 52.9 33.0 49.5 59.4 52.0
Minerva 8B - 39B 164B 16.2 14.1 - - - - - 35.6 - -
Minerva 62B - 39B 109B 52.4 27.6 - - - - - 53.9 - -
Minerva 540B - 39B 26B 58.8 33.6 - - - - - 63.9 - -
LLemma 7B PPile 55B 200B 38.8 17.2 56.1 69.1 82.4 48.7 41.0 45.4 59.4 50.9
LLemma 34B PPile 55B 50B 54.2 23.0 67.9 75.7 90.1 57.0 49.8 54.7 68.8 60.1
Intern-Math 7B - 31B 125B 41.8 14.4 61.6 66.8 83.7 50.0 57.3 24.8 37.5 48.7
Intern-Math 20B - 31B 125B 65.4 30.0 75.7 79.3 94.0 50.9 38.5 53.1 71.9 62.1
DeepSeekMath 7B - 120B 500B 64.1 34.2 74.0 83.9 92.4 63.4 62.4 56.4 84.4 68.4
Rho-Math 7B OWM 14B 10.5B 66.9 31.0 77.8 79.0 93.9 49.9 58.7 54.6 84.4 66.2

SFT Models (Code Interpreter)

PROMPT_TYPE=tora

Model Size SFT Data GSM8k MATH SVAMP ASDiv MAWPS TAB GSM-Hard AVG
GPT4-early (PAL) - - 94.2 51.8 94.8 92.6 97.7 95.9 77.6 86.4
MAmmoTH 70B MI-260k 76.9 41.8 82.4 - - - - -
ToRA 7B ToRA-69k 68.8 40.1 68.2 73.9 88.8 42.4 54.6 62.4
ToRA 70B ToRA-69k 84.3 49.7 82.7 86.8 93.8 74.0 67.2 76.9
DeepSeekMath 7B ToRA-69k 79.8 52.0 80.1 87.1 93.8 85.8 63.1 77.4
Rho-Math 1B ToRA-69k 59.4 40.6 60.7 74.2 88.6 26.7 48.1 56.9
Rho-Math 7B ToRA-69k 81.3 51.8 80.8 85.5 94.5 70.1 63.1 75.3

SFT Models (CoT)

PROMPT_TYPE=deepseek-math

Size Model GSM8k MATH SWAMP ASDiv MAWPS AVG
7B DeepSeek-Math-Instruct 82.4 45.8 83.5 90.1 95.7 79.5
DeepSeek-Math-RL 88.3 50.0 87.2 92.0 95.5 82.6

🍀 Contributing

This project is still under active development. We welcome any contributions, including bug reports, feature requests, and pull requests.

☕️ References

Footnotes

  1. We suggest utilizing the OpenAI test subset for evaluating MATH performance, since the original MATH test set has already been included in public training sets such as PRM800k. We use minerva_math prompt.

  2. abbreviations: TAB=tabmwp, MQA = mathqa, SAT = sat_math

About

A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published