Skip to content

Latest commit

 

History

History
34 lines (27 loc) · 5.01 KB

more_results.md

File metadata and controls

34 lines (27 loc) · 5.01 KB

More Evaluation Results

Model HellaSwag PIQA WinoGrande RACE-Middle RACE-High TriviaQA NaturalQuestions MMLU MMLU (LM) ARC-Easy ARC-Challenge GSM8K HumanEval MBPP DROP (EM) DROP (F1) OpenBookQA Pile-test Pile-test-BPB BBH AGIEval CLUEWSC CHID CEval CMMLU
LLaMA2 7B 75.6 78.0 69.6 60.7 45.8 63.8 25.5 45.8 44.5 69.1 49.0 15.5 14.6 21.8 34.0 39.8 57.4 1.739 0.764 38.5 22.8 64.0 37.9 33.9 32.6
Qwen 7B v2 73.7 77.5 67.3 57.0 43.1 59.6 32.3 57.3 40.9 56.5 41.5 52.1 32.3 37.2 43.4 51.7 48.6 2.025 0.756 47.3 29.3 76.5 86.6 62.3 62.6
Baichuan2 7B 67.9 73.6 60.2 59.8 45.1 59.1 21.3 53.4 35.5 44.6 36.8 23.4 22.0 26.0 31.6 37.1 34.8 1.842 0.781 41.6 42.7 69.6 80.4 54.2 56.2
DeepSeek 7B Base 75.4 79.2 70.5 63.2 46.5 59.7 22.2 48.2 42.9 67.9 48.1 17.4 26.2 39.0 34.9 41.0 55.8 1.871 0.746 39.5 26.4 73.1 89.3 45.0 47.2
DeepSeek 7B Chat 68.5 77.6 66.9 65.2 50.8 57.9 32.5 49.4 42.3 71.0 49.4 62.6 48.2 35.2 37.5 49.1 54.8 / / 42.3 19.3 71.9 64.9 47.0 49.7
LLaMA2 70B 84.0 82.0 80.4 70.1 54.3 79.5 36.1 69.0 53.5 76.5 59.5 58.4 28.7 45.6 63.6 69.2 60.4 1.526 0.671 62.9 37.2 76.5 55.5 51.4 53.1
DeepSeek 67B Base 84.0 83.6 79.8 69.9 50.7 78.9 36.6 71.3 54.1 76.9 59.0 63.4 42.7 57.4 61.0 67.9 60.2 1.660 0.662 68.7 41.3 81.0 92.1 66.1 70.8
DeepSeek 67B Chat 75.7 82.6 76.0 70.9 56.0 81.5 47.0 71.1 55.0 81.6 64.1 84.1 73.8 61.4 59.4 71.9 63.2 / / 71.7 46.4 60.0 72.6 65.2 67.8

Math evaluation results of DeepSeek LLM 67B Chat

Inference GSM8k MATH MGSM-zh CMATH Gaokao-MathCloze Gaokao-MathQA
CoT 84.1% 32.6% 74.0% 80.3% 16.9% 20.2%
Tool-Integrated Reasoning 86.7% 51.1% 76.4% 85.4% 21.2% 28.2%

Never Seen Before Exam

Model DeepSeek LLM 67B Chat Qwen-14B-Chat ChatGLM3-6B Baichuan2-Chat-13B Yi-Chat-34B GPT-3.5-Turbo Grok-1 Claude 2 GPT-4
Hungarian National High-School Exam 58 36.5 32 19.5 39 41 59 55 68
Model Qwen-14B-Chat ChatGLM3-6B Baichuan2-Chat-13B Yi-Chat-34B PaLM2 Small DeepSeek LLM 67B Chat GPT-4
Prompt-level Instruction Following 48.9 35.0 51.0 51.2 46.9 59.1 79.3
Model Qwen-14B-Chat ChatGLM3-6B Baichuan2-Chat-13B Yi-Chat-34B GPT-3.5-Turbo Phind-CodeLlama-34B-v2 DeepSeek LLM 67B Chat DeepSeek Coder 33B GPT-4
LeetCode Weekly Contest 11.1 2.38 1.58 7.9 20.6 12.6 17.5 31.7 48.4