GitHub - AI-EDU-LAB/E-EVAL: Official github repo for E-Eval, a Chinese K12 education evaluation benchmark for LLMs.

E-EVAL

E-Eval is a Chinese K12 educational assessment benchmark for large language models, covering 4,391 multiple-choice questions in 11 different subjects, divided into three levels of difficulty. More details can be found in our paper.

News

Leaderboard

The following lists the zero-shot and five-shot accuracy of models evaluated in our initial version. Please visit our official Leaderboard for the latest models and their detailed results in each subject. We have noticed that for many models fine-tuned on specific instructions, the zero-shot results are better than the few-shot ones.

Zero-shot && five-shot

Model	0-shot answer-only	5-shot answer-only	5-shot COT	Average
Qwen-72b	89.0	88.7	88.8	88.8
Ernie-Bot 4.0	86.7	85.2	84.6	85.5
Yi-34b-chat	72.5	81.4	76.6	76.8
Ernie-Bot	76.1	75.7	75.7	75.8
GPT-4	70.5	73.8	67.4	70.6
Yi-6b-chat	68.8	71.2	66.5	68.8
chatglm3-6b	72.9	59.3	65.0	65.7
Qwen-7b	58.7	60.4	60.4	59.9
baichuan2-13b-chat	56.1	60.9	56.1	57.7
baichuan2-7b-chat	55.2	56.2	52.9	54.8
GPT-3.5	54.5	56.9	52.3	54.6
Chinese-Alpaca-2-13B	44.8	46.2	38.7	43.2
Educhat-base-002-13b	37.1	40.6	36.1	37.9
Educhat-sft-002-13b	33.2	39.4	36.1	36.2
Chinese-LLaMA-2-13B	35.7	38.9	33.2	35.9
Educhat-sft-002-13b-baichuan	54.0	14.4	38.1	35.5
Educhat-base-002-7b	30.4	27.9	29.3	29.2

How to Evaluate on E-Eval

Normally you can directly take the model's generations and extract the answer token (i.e. A,B,C,D) from it with simple regular expressions. In few-shot evaluation, the model usually follows the given template thus this is easy. Sometimes, however, especially in zero-shot evaluation for models without experiencing instruction tuning, the model may not follow the instruction well to give a well-formatted generation, in this case we recommend computing the probability of "A", "B", "C", "D" and take the most likely one as the answer -- this is a constrained decoding approach and was used in the official MMLU test code. Such a probability approach is not applicable for chain-of-thought settings.

We use the following prompt when evaluating the models in our first release:

answer-only prompt

以下是中国关于{科目}考试的单项选择题，请选出其中的正确答案。

{题目1}
A. {选项A}
B. {选项B}
C. {选项C}
D. {选项D}
答案：A

[k-shot demo, note that k is 0 in the zero-shot case]

{测试题目}
A. {选项A}
B. {选项B}
C. {选项C}
D. {选项D}
答案：

chain-of-thought prompt

以下是中国关于{科目}考试的单项选择题，请选出其中的正确答案。

{题目1}
A. {选项A}
B. {选项B}
C. {选项C}
D. {选项D}
答案：让我们一步一步思考，
1. {解析过程步骤1}
2. {解析过程步骤2}
3. {解析过程步骤3}
所以答案是A。

[k-shot demo, note that k is 0 in the zero-shot case]

{测试题目}
A. {选项A}
B. {选项B}
C. {选项C}
D. {选项D}
答案：让我们一步一步思考，
1.

How to Submit

You need to first prepare a UTF-8 encoded JSON file with the following format, please refer to submission_example.json for details.

## key within each subject is the "id" field from the dataset
{
    "high_school_biology": {
        "0": "A",
        "1": "B",
        "2": "B",
        ...
    },
    
    "subject_name":{
    "0":"ans_1",
    "1":"ans_2",
    ...
    }
    ....
}

Then you can submit the prepared json file here, note that you need to first log in to access the submission page.

Citation

@article{hou2024eeval,
            title={E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models},
            author={Jinchang Hou and Chang Ao and Haihong Wu and Xiangtao Kong and Zhigang Zheng and Daijia Tang and Chengming Li and Xiping Hu and Ruifeng Xu and Shiwen Ni and Min Yang},
            journal={https://arxiv.org/abs/2401.15927},
            year={2024}
}

Acknowledgement

Thanks to UNION INFORMATION for their support of this work.

TODO

add zero-shot results
incorporate into openai eval

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
code		code
data		data
E-EVAL_sample.json		E-EVAL_sample.json
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

data

data

E-EVAL_sample.json

E-EVAL_sample.json

README.md

README.md

Repository files navigation

E-EVAL

News

Table of Contents

Leaderboard

Zero-shot && five-shot

How to Evaluate on E-Eval

answer-only prompt

chain-of-thought prompt

How to Submit

Citation

Acknowledgement

TODO

About

Releases

Packages

Contributors 2

Languages

AI-EDU-LAB/E-EVAL

Folders and files

Latest commit

History

Repository files navigation

E-EVAL

News

Table of Contents

Leaderboard

Zero-shot && five-shot

How to Evaluate on E-Eval

answer-only prompt

chain-of-thought prompt

How to Submit

Citation

Acknowledgement

TODO

About

Resources

Stars

Watchers

Forks

Languages