This is the official repository for the paper "StatsChartMWP: A Dataset for Evaluating Multimodal Mathematical Reasoning Abilities on Math Word Problems with Statistical Charts", the paper link is coming soon.
The leaderboard is continuously being updated. If you have any new results to contribute, please feel free to reach out to us.
| # | Model | Method | Date | ALL | Bar | Hist | Line | Line-f | Scatter | D-axis | P-bar | Pie | Table | Comp | Radar |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | o3 | LMM | 2025-09-08 | 82.75 | 81.73 | 77.71 | 76.96 | 71.97 | 83.12 | 82.81 | 90.91 | 88.10 | 93.23 | 83.98 | 33.33 |
| 2 | Qwen2.5-VL-72B | LMM | 2025-09-08 | 71.12 | 78.45 | 59.51 | 68.45 | 56.90 | 54.37 | 65.62 | 63.64 | 65.78 | 85.89 | 61.07 | 41.67 |
| 3 | Qwen2-VL-72B | LMM | 2025-02-23 | 59.33 | 69.91 | 39.29 | 60.03 | 46.44 | 43.75 | 62.50 | 59.09 | 65.78 | 77.12 | 50.39 | 62.50 |
| 4 | GPT-4o | LMM | 2025-02-23 | 57.05 | 66.51 | 26.38 | 58.76 | 42.26 | 45.62 | 68.75 | 54.55 | 72.57 | 81.54 | 49.50 | 45.83 |
| 5 | InternVL2_5-78B | LMM | 2025-02-23 | 55.25 | 70.93 | 29.26 | 56.12 | 40.59 | 48.75 | 57.81 | 54.55 | 57.01 | 74.27 | 51.84 | 37.04 |
| 6 | GPT4 (GPT-4o) | LLM | 2025-02-23 | 46.95 | 59.98 | 13.30 | 52.72 | 35.98 | 27.50 | 45.31 | 27.27 | 59.19 | 71.85 | 38.82 | 20.83 |
| 7 | InternVL2-Llama3-76B | LMM | 2025-02-23 | 45.02 | 58.81 | 24.58 | 50.43 | 35.98 | 43.12 | 42.19 | 13.64 | 48.08 | 57.38 | 35.37 | 29.17 |
| 8 | Qwen2-VL-7B | LMM | 2025-02-23 | 37.46 | 45.67 | 20.16 | 39.29 | 30.96 | 31.25 | 65.62 | 36.36 | 44.54 | 51.25 | 25.70 | 62.50 |
| 9 | GPT-4V | LMM | 2025-02-23 | 34.28 | 38.57 | 12.10 | 40.48 | 28.87 | 30.00 | 39.06 | 18.18 | 38.25 | 55.67 | 27.89 | 33.33 |
| 10 | LLaVA-OV-72B | LMM | 2025-02-23 | 32.39 | 38.33 | 15.26 | 39.80 | 30.54 | 35.62 | 42.19 | 31.82 | 34.32 | 45.97 | 22.91 | 16.67 |
| 11 | GPT4 (GPT-4V) | LLM | 2025-02-23 | 31.47 | 38.11 | 8.61 | 39.12 | 22.18 | 20.62 | 35.94 | 4.55 | 34.71 | 52.46 | 24.36 | 20.83 |
| 12 | Qwen-VL-MAX | LLM | 2025-02-23 | 30.24 | 37.40 | 10.19 | 29.51 | 19.25 | 20.00 | 29.69 | 18.18 | 37.86 | 54.74 | 16.91 | 33.33 |
| 13 | IXC-2.5-7B | LMM | 2025-02-23 | 22.55 | 31.10 | 7.36 | 29.25 | 17.99 | 18.75 | 43.75 | 18.18 | 24.88 | 29.72 | 15.02 | 41.67 |
| 14 | Cambrian-34B | LMM | 2025-02-23 | 18.15 | 22.03 | 8.77 | 27.89 | 14.23 | 18.75 | 46.88 | 22.73 | 16.52 | 20.24 | 14.02 | 41.67 |
| 15 | LLaVA-NeXT-34B | LMM | 2025-02-23 | 15.67 | 20.96 | 5.45 | 23.13 | 13.39 | 20.00 | 25.00 | 4.55 | 14.06 | 19.24 | 12.44 | 20.83 |
| 16 | DeepSeek-VL-7B | LMM | 2025-02-23 | 13.20 | 16.06 | 4.63 | 21.43 | 11.72 | 12.50 | 28.12 | 4.55 | 14.16 | 15.47 | 9.78 | 8.33 |
| 17 | HPT-1.0 | LMM | 2025-02-23 | 10.10 | 9.91 | 5.07 | 17.77 | 9.62 | 10.62 | 26.56 | 9.09 | 7.18 | 10.62 | 11.56 | 29.17 |
The StatsChartMWP dataset is designed as a benchmark to develop AI models capable of understanding multimodal information present in math word problems with statistical charts. Our dataset incorporates a variety of chart forms, presenting a broad visual spectrum and mathematical knowledge competencies and each item originates from real-world educational contexts, encompassing challenges formulated by mathematics educators, genuine student inquiries, and historical examination questions. The StatsChartMWP dataset encompasses 8,514 unique MWPs with statistical charts. The StatsChartMWP dataset contains 11 different types of statistical charts, including bar, line, line-function, dual-axis, pie, composite, radar, histograms, scatter, percentage-bar, tables. A comparative example between our dataset and ChartQA and FigureQA is shown below. R-Steps means the average reasoning steps of the dataset.
The StatsChartMWP dataset json file and images are provided in [data].
We introduce CoTAR, a data augmentation strategy that utilizes CoT augmented reasoning to alleviate the cross-modal alignment between representations of visual mediums of artificial figures and technical language and equations. Specifically, instead of directly using the concise textual solutions of the MWPs, we use the state-of-the-art LLM, so convert them into detailed step-by-step explanations in a CoT-alike format to improve their logical clarity. Furthermore, each step is made up of a short step summary that explicitly states the purpose of this step and a concrete reasoning response. The step summary serves as a guiding directive for the logical analysis or computation required in the current step, while the concrete reasoning response provides a detailed explanation of the process undertaken in response to the step summary. The architecture of our method illustrated in follow:
An illustration of CoTAR. (a) the original MWP with statistical chart. (b) the corresponding original solution. (c) the solution of CoTAR. The bold words are the step summaries and the following sentences are reasoning responses.
We conducted fine-tuning on Qwen2-VL-7B. By employing both problem-original solution pairs and problem-augmented solution pairs on our proprietary training dataset, we achieved a 8.76% improvement in algorithmic accuracy.
Finetune the Qwen2-VL-7B, you can see the official GitHub repository of Qwen2-VL-7B.
the prompt of CoTAR is provided in prompts. You can run the main code to get the CoTAR solution data.
python main.pyThis work is marked with CC0 1.0
Explore our additional research on Vision-Language Large Models, focusing on multi-modal LLMs and mathematical reasoning:
- [ChartQA] ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
- [TABMWP] DYNAMIC PROMPT LEARNING VIA POLICY GRADIENT FOR SEMI-STRUCTURED MATHEMATICAL REASONING
- [MathVista] MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
- [MathVerse] MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
- [MATH-Vision] Measuring Multimodal Mathematical Reasoning with the MATH-Vision Dataset
- [OlympiadBench] OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems
- [InternVL] InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
- [LLaVA] LLaVA: Large Language and Vision Assistant
