PATCH! Psychometrics-AssisTed benCHmarking of Large Language Models: A Case Study of Mathematics Proficiency

This is the code repo for our paper Psychometrics-AssisTed benCHmarking of Large Language Models: A Case Study of Mathematics Proficiency.

Four evaluation datasets based on TIMSS 2008 and 2011 can be downloaded separately here.

Instructions

Download and unzip the evaluation datasets in the data folder.
You will also need to create a separate csv file denoting whether a test item is a multiple choice (MC), in the following format:

| Item    | MC |
|---------|----|
| M032064 | 0  |
| M032094 | 1  |
| M032166 | 1  |

Define the following variables in utils.py:

folder_path = "insert folder path"
latex_file_name = "insert the name of the latex file of the evaluation dataset"
test_item_info_file_name = "insert name of the info file for test items (whether they are multiple choices)"
open_ai_api_key = "insert open ai api key"
google_api_key = "insert google api key"

To sample responses from gemini and gpt4, run:

python gemini.py
python gpt4.py

Grade the responses from gemini and gpt4 manually
To analyse the graded responses, use the analysis.R script under analysis folder.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.idea		.idea
analysis		analysis
results		results
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bnaic_paper.pdf		bnaic_paper.pdf
gemini.py		gemini.py
gpt4.py		gpt4.py
prompts.py		prompts.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PATCH! Psychometrics-AssisTed benCHmarking of Large Language Models: A Case Study of Mathematics Proficiency

Instructions

About

Releases

Packages

Languages

License

fqixiang/patch_llm_benchmarking_with_psychometrics

Folders and files

Latest commit

History

Repository files navigation

PATCH! Psychometrics-AssisTed benCHmarking of Large Language Models: A Case Study of Mathematics Proficiency

Instructions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages