PATCH! Psychometrics-AssisTed benCHmarking of Large Language Models: A Case Study of Mathematics Proficiency
This is the code repo for our paper Psychometrics-AssisTed benCHmarking of Large Language Models: A Case Study of Mathematics Proficiency.
Four evaluation datasets based on TIMSS 2008 and 2011 can be downloaded separately here.
- Download and unzip the evaluation datasets in the
data
folder. - You will also need to create a separate csv file denoting whether a test item is a multiple choice (MC), in the following format:
| Item | MC |
|---------|----|
| M032064 | 0 |
| M032094 | 1 |
| M032166 | 1 |
- Define the following variables in
utils.py
:
folder_path = "insert folder path"
latex_file_name = "insert the name of the latex file of the evaluation dataset"
test_item_info_file_name = "insert name of the info file for test items (whether they are multiple choices)"
open_ai_api_key = "insert open ai api key"
google_api_key = "insert google api key"
- To sample responses from gemini and gpt4, run:
python gemini.py
python gpt4.py
-
Grade the responses from gemini and gpt4 manually
-
To analyse the graded responses, use the
analysis.R
script underanalysis
folder.