Add ability to evauate multiple choice tasks #5047

ikawrakow · 2024-01-20T08:47:44Z

Summary

This PR adds the ability to run multiple-choice-single-correct-answer type of LLM benchmarks with the perplexity tool.

Details

Commonly used LLM benchmarks of this type are

ARC
HellaSwag
MMLU
Truthful QA

Although the HellaSwag test asks for continuing a given context with one of 4 possible endings, it is basically the same as finding the highest probability answer (as per LLM predicted logits) to a multiple-choice question. Hence, it can be done with the exact same evaluation approach as ARC, MMLU, and Truthful QA (and the multiple_choice_score() function added by the PR achieves the exact same HellSwag score as the existing implementation in hellaswag_score()).

I have posted validation datasets for all of the above benchmarks in this Huggingface repository. A very simple binary format is used for these datasets and in the implementation in this PR.

The results I'm getting with this implementation are not quite the same as we find on the Huggingface Open LLM Leaderboard (HFLB), see table below. This can be due to several reasons:

The validation datasets being different from the test datasets being used by HFLB
The evaluation technique implemented here not being quite the same as in the Eleuther AI Language Model Evaluation Harness utilized by HFLB
The way I'm combining the question with the possible answers not being the best
These are 0-shot evaluations, while HFLB tends to contain few-shot results (and few-shot is typically better than 0-shot)

Nevertheless, I think that it is useful to add this capability in the present from to llama.cpp to allow experimentation. Perhaps this will lead to better approaches and scores matching better HFLB.

The following table summarizes results for Mistral-7B-Instruct-v0.2. The full fp16 model is used, and the calculations are run on an RTX-4080 GPU.

Benchmark	Score	HFLB score	Tasks	Time in seconds
ARC-Easy	64.56 ± 2.01	-	570	4.5
ARC-Challenge	50.17 ± 2.90	-	299	2.9
ARC-Mix	59.61 ± 1.66	63.14	869	7.4
HellaSwag	84.42 ± 0.36	84.88	10042	301
MMLU-test	42.01 ± 0.42	60.78	13943	269
Truthful QA	55.20 ± 1.74	68.26	817	9.7

Note: I'm assuming that ARC on HFLB is an even mix of ARC-Easy and ARC-Challenge.

In this implementation, the prompts passed for evaluation for ARC, MMLU and TruthfulQA are in the form

Question: "question_body" Answer: answer_body

and the probability for each answer is computed from the tokens in answer_body. I did experiment with several variations, but the above gave the highest score. Somewhat surprisingly, I did not get a higher score for Mistral-7B-Instruct-v0.2 using

[INST]  question_body [/INST] answer_body

Given this, and the fact that the former is LLM/tuning agnostic, I have prepared the datasets in this form. Obviously one can change to having the bare question/answers stored in the dataset, and add question prefix/suffix via command line arguments.

Usage

./perplexity -m <some_model> -bf <some_data_file> --multiple-choice [--multiple-choice-tasks N] [other GPT parameters]

Without the --multiple-choice-tasks argument, or with N = 0, or N >= number of tasks, all tasks <some_data_file> will be run consecutively, else a random sample of N tasks will be selected.

~~It woks, but the scores I'm getting are much lower compared to HFLB leader board:~~

~~29.1% vs 42.2% on HFLB for Mistral-7B~~
~~55.2% vs 68.3% on HFLB for Mistral-7B-Instruct-v0.2~~

I know the implementation is correct because the same function that is used to evaluate the TruthfulQA score can be also used for HellaSwag if one converts the HellaSwag dataset to the binary format I use for TruthfulQA, and I get the exact same HellaSwag score as in the existing HellSwag implementation.

~~The implementation uses the same batched implementation as it is now used for HellaSwag and Winogrande, and needs just 9 seconds to process the 817 validation dataset tasks.~~

I'm combining the question and each answer as Question: "question goes here" Answer: answer goes here. I guess, this is not the best way, but I didn't find a variation that works better (produces higher scores), but it definitely works better than just concatenating the question with each multiple choice answer.

Why binary format for this test? Because, unlike HellSwag's line-oriented text data, the format can handle multiple choice questions with arbitrary number of answers along with a single correct answer or multiple correct answers, without adding a massive dependency on a Parquet (format used on Huggingface) or JSON parsing libraries to llama.cpp.

cmp-nct · 2024-01-20T15:45:46Z

Nice work

I think json would be the best choice for all those kind of benchmarks.
We have nlohmann already in use in quite a few examples, the library is a charm to deal with in c++
Converting other formats to "our" json format would also be possible using a simple script given how accessible json is

I'd also not include it into llama.cpp directly but it could be an optional dependency, could be added by a #define flag, so only the actual benchmarking tool(s) have it included

The same implementation can be used for HellaSwag as well, so I converted a HellaSwag validation dataset to the binary format used here and tested with that. The score is only around 50, so something is not quite right.

I know it works because if I convert the HellaSwag validation data to the binary format used in the truthful_qa_score() function I get the exact same result as from the hellaswag_score() function. But I guess, the questions are tricky and the way I have done the combination of question + answer is very likely not the best. The TruthfulQA validation dataset contains 817 questions, with random chance result around 19%. With this version I get 29.1% for Mistral-7B and 55.2% for Mistral-7B-Instruct-v0.2. The HF leader board results for these two models are 42.2% and 68.3%, respectively.

ikawrakow · 2024-01-20T17:55:58Z

I think json would be the best choice for all those kind of benchmarks.

I knew someone will bring up json :-)

The code that reads the binary data is 24 lines in 3 functions. In comparison, nlohmann's json.hpp is 24,596 LOCs (and I find it only in the server example). But let's say we agreed that it needs to be json because binary files are scary. This is not my repo, so @ggerganov should express his opinion, but if it was mine, I would definitely not want to have N > 1 copies of json.hpp in it. Which would make the project have this as a dependency. Would I add a 25k LOC dependency to my project to replace 24 LOC reading binary data that is very unlikely to change? Hmm, not sure.

The way it is now it handles ARC, MMLU, TruthfulQA, and HellaSwag. I have posted test/validation datasets in https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp. It could theoretically also handle Winogrande, but I would live that one to a separate implementation due to the slightly different probability evaluation (it uses partial evaluation). Converting to json is trivial: just copy/paste the 24 LOC into a .cpp file and use json.hpp to dump the data.

ggerganov

The binary format is great, no need for json

examples/perplexity/perplexity.cpp

I had forgotten that MSVC does not make constexpr's available inside a lambda.

cmp-nct · 2024-01-21T15:44:47Z

I think json would be the best choice for all those kind of benchmarks.

I knew someone will bring up json :-)

The code that reads the binary data is 24 lines in 3 functions. In comparison, nlohmann's json.hpp is 24,596 LOCs (and I find it only in the server example). But let's say we agreed that it needs to be json because binary files are scary. This is not my repo, so @ggerganov should express his opinion, but if it was mine, I would definitely not want to have N > 1 copies of json.hpp in it. Which would make the project have this as a dependency. Would I add a 25k LOC dependency to my project to replace 24 LOC reading binary data that is very unlikely to change? Hmm, not sure.

The way it is now it handles ARC, MMLU, TruthfulQA, and HellaSwag. I have posted test/validation datasets in https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp. It could theoretically also handle Winogrande, but I would live that one to a separate implementation due to the slightly different probability evaluation (it uses partial evaluation). Converting to json is trivial: just copy/paste the 24 LOC into a .cpp file and use json.hpp to dump the data.

It was no complaint
I would not make json a dependency for llama.cpp, just for the benchmark tool - given we already have the server with json support the library is available in the repo. It can be #conditionally included.

My reasoning is that a binary format can not be viewed, it needs a special viewer someone has to write.
It can not be edited, it needs a compiler to produce it and people have to learn how to find and use the compiler.
The benchmarks are human readable text, so making them human readable removes a barrier on their use.

People might want to use their custom benchmarks for example. JSON means anyone can just start working on it, bin means you've to have a much higher level of competence (currently no compiler/converter available ?)

Nexesenex · 2024-01-22T02:00:05Z

Could you make and upload a ARC-Mix .bin please, @ikawrakow ?

ikawrakow · 2024-01-22T07:29:41Z

Could you make and upload a ARC-Mix .bin please, @ikawrakow ?

Done.

It can not be edited,

This is a feature, not a bug :-)

People might want to use their custom benchmarks for example. JSON means anyone can just start working on it, bin means you've to have a much higher level of competence (currently no compiler/converter available ?)

I have added a simple demo program to the repository that uses nlohman::json to convert to JSON. Simply

g++ -o convert convert.cpp
./convert some_file.bin some_file.json

Nexesenex · 2024-01-28T17:52:15Z

MMLU test might have some trouble at iteration 1160 (crash after 1159), and 314 (tested stopped a few times there also.
Arc combined is not working, unlike Arc challenge and Arc Easy.

Also, noob question : is it possible to chain test several benchs (or even perplexity calculations at different ctx) via several commands without reloading the model?

cmp-nct · 2024-01-29T02:27:30Z

Could you make and upload a ARC-Mix .bin please, @ikawrakow ?

Done.

It can not be edited,

This is a feature, not a bug :-)

People might want to use their custom benchmarks for example. JSON means anyone can just start working on it, bin means you've to have a much higher level of competence (currently no compiler/converter available ?)

I have added a simple demo program to the repository that uses nlohman::json to convert to JSON. Simply
g++ -o convert convert.cpp
./convert some_file.bin some_file.json

Now it's a feature :-)
nice to see

* TruthfulQA: 1st attempt, does not look like it is working The same implementation can be used for HellaSwag as well, so I converted a HellaSwag validation dataset to the binary format used here and tested with that. The score is only around 50, so something is not quite right. * TruthfulQA: works but the result is bad I know it works because if I convert the HellaSwag validation data to the binary format used in the truthful_qa_score() function I get the exact same result as from the hellaswag_score() function. But I guess, the questions are tricky and the way I have done the combination of question + answer is very likely not the best. The TruthfulQA validation dataset contains 817 questions, with random chance result around 19%. With this version I get 29.1% for Mistral-7B and 55.2% for Mistral-7B-Instruct-v0.2. The HF leader board results for these two models are 42.2% and 68.3%, respectively. * TruthfulQA: fix random sample * TruthfulQA: prepare tasks in parallel for large test datasets * Rename truthful_qa to multiple_choice * Make MSVC happy I had forgotten that MSVC does not make constexpr's available inside a lambda. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

strawberrymelonpanda · 2024-02-03T09:25:33Z

It can not be edited,

This is a feature, not a bug :-)

People might want to use their custom benchmarks for example. JSON means anyone can just start working on it, bin means you've to have a much higher level of competence (currently no compiler/converter available ?)

I have added a simple demo program to the repository that uses nlohman::json to convert to JSON. Simply
g++ -o convert convert.cpp
./convert some_file.bin some_file.json
Now it's a feature :-) nice to see

@ikawrakow, @cmp-nct

Just want to see if I'm missing anything - I would like to do exactly as mentioned above, by making a custom benchmark. Is the above only to convert .bin -> JSON, and if so, is there a way to go back to .bin?

My use-case is simply that many public benchmarks have contaminated training data, and a custom benchmark could be more relevantly tailored to my usages.

ikawrakow · 2024-02-03T10:37:47Z

@strawberrymelonpanda

Somebody needs to modify the tool to be able to load JSON, store the data into the MultiplChoice struct, and then use the provided serialize function to store in the binary format. Not sure the someone will be me as I genuinely dislike working with JSONs.

strawberrymelonpanda · 2024-02-04T07:24:55Z

Understood, and thanks.

* TruthfulQA: 1st attempt, does not look like it is working The same implementation can be used for HellaSwag as well, so I converted a HellaSwag validation dataset to the binary format used here and tested with that. The score is only around 50, so something is not quite right. * TruthfulQA: works but the result is bad I know it works because if I convert the HellaSwag validation data to the binary format used in the truthful_qa_score() function I get the exact same result as from the hellaswag_score() function. But I guess, the questions are tricky and the way I have done the combination of question + answer is very likely not the best. The TruthfulQA validation dataset contains 817 questions, with random chance result around 19%. With this version I get 29.1% for Mistral-7B and 55.2% for Mistral-7B-Instruct-v0.2. The HF leader board results for these two models are 42.2% and 68.3%, respectively. * TruthfulQA: fix random sample * TruthfulQA: prepare tasks in parallel for large test datasets * Rename truthful_qa to multiple_choice * Make MSVC happy I had forgotten that MSVC does not make constexpr's available inside a lambda. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

strawberrymelonpanda · 2024-05-16T20:07:32Z

For anyone who comes across this topic, I've managed to put together a JSON->bin encoder. I've tested it both with my own multi-choice questions, and by decoding and re-encoding a .bin file and getting the same results back.

If you're interested in making your own multiple-choice benchmarks, please see the gist here.

Note: As of now, to include a test less than 100 questions, you must make a small edit to llama.cpp/examples/perplexity/perplexity.cpp and change 1428 to int n_dot = std::max((int) n_task/100, 1);, or you'll get a floating point error.
Edit: PR fix merged!

I've also included a simple text -> JSON formatter to make the process easier.

Q:] QUESTION TEXT
A1:] CORRECT ANSWER TEXT
A2:] ANSWER
A3:] ANSWER
A4:] ANSWER

Q:] QUESTION TEXT
A1:] CORRECT ANSWER TEXT
A2:] ANSWER
A3:] ANSWER
A4:] ANSWER

python tojson.py custom-test.txt custom-test.json
./encode custom-test.json custom-test.bin
./perplexity -m model -bf custom-test.bin --multiple choice

@ikawrakow Please feel free to include either script in the Readme of your repo, the repo itself, or even as part of convert.cpp, if you wish to.

ikawrakow changed the title ~~Add TruthfulQA benchmark - multiple choice variiant with single correct answer~~ Add TruthfulQA benchmark - multiple choice variant with single correct answer Jan 20, 2024

Kawrakow added 4 commits January 20, 2024 19:31

TruthfulQA: 1st attempt, does not look like it is working

6ce0662

The same implementation can be used for HellaSwag as well, so I converted a HellaSwag validation dataset to the binary format used here and tested with that. The score is only around 50, so something is not quite right.

TruthfulQA: fix random sample

21d0ce5

TruthfulQA: prepare tasks in parallel for large test datasets

d86f80f

ikawrakow force-pushed the ik/truthfull_qa branch from 109dada to d86f80f Compare January 20, 2024 17:32

ikawrakow changed the title ~~Add TruthfulQA benchmark - multiple choice variant with single correct answer~~ Add ability to evauate multiple choice tasks Jan 21, 2024

ikawrakow marked this pull request as ready for review January 21, 2024 09:53

ikawrakow requested a review from ggerganov January 21, 2024 09:53

Rename truthful_qa to multiple_choice

92540e4

ggerganov approved these changes Jan 21, 2024

View reviewed changes

examples/perplexity/perplexity.cpp Outdated Show resolved Hide resolved

Make MSVC happy

9c9523f

I had forgotten that MSVC does not make constexpr's available inside a lambda.

ikawrakow merged commit 7dcbe39 into master Jan 21, 2024
43 of 47 checks passed

ikawrakow deleted the ik/truthfull_qa branch January 21, 2024 12:42

strawberrymelonpanda mentioned this pull request May 17, 2024

Fix floating point error with ndot progress printing in Perplexity and show stats with < 100 tasks #7348

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to evauate multiple choice tasks #5047

Add ability to evauate multiple choice tasks #5047

ikawrakow commented Jan 20, 2024 •

edited

Loading

cmp-nct commented Jan 20, 2024 •

edited

Loading

ikawrakow commented Jan 20, 2024

ggerganov left a comment

cmp-nct commented Jan 21, 2024

Nexesenex commented Jan 22, 2024

ikawrakow commented Jan 22, 2024

Nexesenex commented Jan 28, 2024 •

edited

Loading

cmp-nct commented Jan 29, 2024

strawberrymelonpanda commented Feb 3, 2024 •

edited

Loading

ikawrakow commented Feb 3, 2024

strawberrymelonpanda commented Feb 4, 2024

strawberrymelonpanda commented May 16, 2024 •

edited

Loading

Add ability to evauate multiple choice tasks #5047

Add ability to evauate multiple choice tasks #5047

Conversation

ikawrakow commented Jan 20, 2024 • edited Loading

Summary

Details

Usage

cmp-nct commented Jan 20, 2024 • edited Loading

ikawrakow commented Jan 20, 2024

ggerganov left a comment

Choose a reason for hiding this comment

cmp-nct commented Jan 21, 2024

Nexesenex commented Jan 22, 2024

ikawrakow commented Jan 22, 2024

Nexesenex commented Jan 28, 2024 • edited Loading

cmp-nct commented Jan 29, 2024

strawberrymelonpanda commented Feb 3, 2024 • edited Loading

ikawrakow commented Feb 3, 2024

strawberrymelonpanda commented Feb 4, 2024

strawberrymelonpanda commented May 16, 2024 • edited Loading

ikawrakow commented Jan 20, 2024 •

edited

Loading

cmp-nct commented Jan 20, 2024 •

edited

Loading

Nexesenex commented Jan 28, 2024 •

edited

Loading

strawberrymelonpanda commented Feb 3, 2024 •

edited

Loading

strawberrymelonpanda commented May 16, 2024 •

edited

Loading