Skip to content

fqixiang/patch_llm_benchmarking_with_psychometrics

Repository files navigation

PATCH! Psychometrics-AssisTed benCHmarking of Large Language Models: A Case Study of Mathematics Proficiency

This is the code repo for our paper Psychometrics-AssisTed benCHmarking of Large Language Models: A Case Study of Mathematics Proficiency.

Four evaluation datasets based on TIMSS 2008 and 2011 can be downloaded separately here.

Instructions

  1. Download and unzip the evaluation datasets in the data folder.
  2. You will also need to create a separate csv file denoting whether a test item is a multiple choice (MC), in the following format:
| Item    | MC |
|---------|----|
| M032064 | 0  |
| M032094 | 1  |
| M032166 | 1  |
  1. Define the following variables in utils.py:
folder_path = "insert folder path"
latex_file_name = "insert the name of the latex file of the evaluation dataset"
test_item_info_file_name = "insert name of the info file for test items (whether they are multiple choices)"
open_ai_api_key = "insert open ai api key"
google_api_key = "insert google api key"
  1. To sample responses from gemini and gpt4, run:
python gemini.py
python gpt4.py
  1. Grade the responses from gemini and gpt4 manually

  2. To analyse the graded responses, use the analysis.R script under analysis folder.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published