math401-llm

Source codes and datasets for How well do Large Language Models perform in Arithmetic tasks?

Full evaluation of all size models.

Dataset

MATH 401 = 1 Euler Equation + 16 group * 25 problems

Euler Equation.
Add & Subtract of two integers within 10.
Add & Subtract of two integers within 100.
Add & Subtract of two integers within 1,000.
Add & Subtract of two integers within 1,000,000,000,000.
Add & Subtract of two integers within -10~10.
Add & Subtract of two decimal numbers within -100~100.
Multiply two integers within 100.
Multiply two decimal numbers within 10.
Multiply two integers within 100,000.
Division of two integers within 100.
Exponentiation of with integer base within 10 and integer exponent within 2~4.
Exponentiation of with a decimal number within 10 as the base and a decimal number within 2~4 as the exponent.
Add, Subtract & Multiply with one integer within 10 and a common irrational number (i.e. $e$ or $\pi$).
Long arithmetic expressions with brackets, involved integers are all within 100 and operators contain add, subtract, multiply, and division.
Trigonometry functions including $\sin$, $\cos$, and $\tan$. Inputs can be in the format of degrees and radians ($\pi$ can also appear in the inputs).
Logarithm of integers within 1000 of different bases: $2,e,10$.

Metric

Accuracy

If the difference between the decoded number and the target number is less than $1e-3$, we consider it a correct prediction. Accuracy is calculated based on correct prediction counts.

Relative error

We denote decoded number is $\hat{y}$ and target is $y$. We calculate relative error by:

$RE = \min(10, \frac{|\hat{y}-y|}{\max(|y|, 1)})$

If LLM does not decode any number, we consider $RE=10$. We truncate the relative error to 10 to prevent that one big mistake dominate the average relative error.

Non-number ratio

If decoded content does not contain any numbers, we consider it a failure. We calculate the non-number ratio based on it.

Citation

@misc{yuan2023large,
      title={How well do Large Language Models perform in Arithmetic tasks?}, 
      author={Zheng Yuan and Hongyi Yuan and Chuanqi Tan and Wei Wang and Songfang Huang},
      year={2023},
      eprint={2304.02015},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
math401.json		math401.json
math401_1.png		math401_1.png
math401_2.png		math401_2.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

math401.json

math401.json

math401_1.png

math401_1.png

math401_2.png

math401_2.png

Repository files navigation

math401-llm

Dataset

Metric

Accuracy

Relative error

Non-number ratio

Citation

About

Releases

Packages

GanjinZero/math401-llm

Folders and files

Latest commit

History

Repository files navigation

math401-llm

Dataset

Metric

Accuracy

Relative error

Non-number ratio

Citation

About

Resources

Stars

Watchers

Forks