Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to evauate multiple choice tasks #5047

Merged
merged 6 commits into from
Jan 21, 2024
Merged

Conversation

ikawrakow
Copy link
Contributor

@ikawrakow ikawrakow commented Jan 20, 2024

Summary

This PR adds the ability to run multiple-choice-single-correct-answer type of LLM benchmarks with the perplexity tool.

Details

Commonly used LLM benchmarks of this type are

  • ARC
  • HellaSwag
  • MMLU
  • Truthful QA

Although the HellaSwag test asks for continuing a given context with one of 4 possible endings, it is basically the same as finding the highest probability answer (as per LLM predicted logits) to a multiple-choice question. Hence, it can be done with the exact same evaluation approach as ARC, MMLU, and Truthful QA (and the multiple_choice_score() function added by the PR achieves the exact same HellSwag score as the existing implementation in hellaswag_score()).

I have posted validation datasets for all of the above benchmarks in this Huggingface repository. A very simple binary format is used for these datasets and in the implementation in this PR.

The results I'm getting with this implementation are not quite the same as we find on the Huggingface Open LLM Leaderboard (HFLB), see table below. This can be due to several reasons:

  • The validation datasets being different from the test datasets being used by HFLB
  • The evaluation technique implemented here not being quite the same as in the Eleuther AI Language Model Evaluation Harness utilized by HFLB
  • The way I'm combining the question with the possible answers not being the best
  • These are 0-shot evaluations, while HFLB tends to contain few-shot results (and few-shot is typically better than 0-shot)

Nevertheless, I think that it is useful to add this capability in the present from to llama.cpp to allow experimentation. Perhaps this will lead to better approaches and scores matching better HFLB.

The following table summarizes results for Mistral-7B-Instruct-v0.2. The full fp16 model is used, and the calculations are run on an RTX-4080 GPU.

Benchmark Score HFLB score Tasks Time in seconds
ARC-Easy 64.56 ± 2.01 - 570 4.5
ARC-Challenge 50.17 ± 2.90 - 299 2.9
ARC-Mix 59.61 ± 1.66 63.14 869 7.4
HellaSwag 84.42 ± 0.36 84.88 10042 301
MMLU-test 42.01 ± 0.42 60.78 13943 269
Truthful QA 55.20 ± 1.74 68.26 817 9.7

Note: I'm assuming that ARC on HFLB is an even mix of ARC-Easy and ARC-Challenge.

In this implementation, the prompts passed for evaluation for ARC, MMLU and TruthfulQA are in the form

Question: "question_body" Answer: answer_body

and the probability for each answer is computed from the tokens in answer_body. I did experiment with several variations, but the above gave the highest score. Somewhat surprisingly, I did not get a higher score for Mistral-7B-Instruct-v0.2 using

[INST]  question_body [/INST] answer_body

Given this, and the fact that the former is LLM/tuning agnostic, I have prepared the datasets in this form. Obviously one can change to having the bare question/answers stored in the dataset, and add question prefix/suffix via command line arguments.

Usage

./perplexity -m <some_model> -bf <some_data_file> --multiple-choice [--multiple-choice-tasks N] [other GPT parameters]

Without the --multiple-choice-tasks argument, or with N = 0, or N >= number of tasks, all tasks <some_data_file> will be run consecutively, else a random sample of N tasks will be selected.

It woks, but the scores I'm getting are much lower compared to HFLB leader board:

  • 29.1% vs 42.2% on HFLB for Mistral-7B
  • 55.2% vs 68.3% on HFLB for Mistral-7B-Instruct-v0.2

I know the implementation is correct because the same function that is used to evaluate the TruthfulQA score can be also used for HellaSwag if one converts the HellaSwag dataset to the binary format I use for TruthfulQA, and I get the exact same HellaSwag score as in the existing HellSwag implementation.

The implementation uses the same batched implementation as it is now used for HellaSwag and Winogrande, and needs just 9 seconds to process the 817 validation dataset tasks.

I'm combining the question and each answer as Question: "question goes here" Answer: answer goes here. I guess, this is not the best way, but I didn't find a variation that works better (produces higher scores), but it definitely works better than just concatenating the question with each multiple choice answer.

Why binary format for this test? Because, unlike HellSwag's line-oriented text data, the format can handle multiple choice questions with arbitrary number of answers along with a single correct answer or multiple correct answers, without adding a massive dependency on a Parquet (format used on Huggingface) or JSON parsing libraries to llama.cpp.

@ikawrakow ikawrakow changed the title Add TruthfulQA benchmark - multiple choice variiant with single correct answer Add TruthfulQA benchmark - multiple choice variant with single correct answer Jan 20, 2024
@cmp-nct
Copy link
Contributor

cmp-nct commented Jan 20, 2024

Nice work

I think json would be the best choice for all those kind of benchmarks.
We have nlohmann already in use in quite a few examples, the library is a charm to deal with in c++
Converting other formats to "our" json format would also be possible using a simple script given how accessible json is

I'd also not include it into llama.cpp directly but it could be an optional dependency, could be added by a #define flag, so only the actual benchmarking tool(s) have it included

The same implementation can be used for HellaSwag as well,
so I converted a HellaSwag validation dataset to the binary
format used here and tested with that. The score is only
around 50, so something is not quite right.
I know it works because if I convert the HellaSwag validation
data to the binary format used in the truthful_qa_score() function
I get the exact same result as from the hellaswag_score() function.
But I guess, the questions are tricky and the way I have done
the combination of question + answer is very likely not the best.
The TruthfulQA validation dataset contains 817 questions, with
random chance result around 19%. With this version I get
29.1% for Mistral-7B and 55.2% for Mistral-7B-Instruct-v0.2.
The HF leader board results for these two models are
42.2% and 68.3%, respectively.
@ikawrakow
Copy link
Contributor Author

I think json would be the best choice for all those kind of benchmarks.

I knew someone will bring up json :-)

The code that reads the binary data is 24 lines in 3 functions. In comparison, nlohmann's json.hpp is 24,596 LOCs (and I find it only in the server example). But let's say we agreed that it needs to be json because binary files are scary. This is not my repo, so @ggerganov should express his opinion, but if it was mine, I would definitely not want to have N > 1 copies of json.hpp in it. Which would make the project have this as a dependency. Would I add a 25k LOC dependency to my project to replace 24 LOC reading binary data that is very unlikely to change? Hmm, not sure.

The way it is now it handles ARC, MMLU, TruthfulQA, and HellaSwag. I have posted test/validation datasets in https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp. It could theoretically also handle Winogrande, but I would live that one to a separate implementation due to the slightly different probability evaluation (it uses partial evaluation). Converting to json is trivial: just copy/paste the 24 LOC into a .cpp file and use json.hpp to dump the data.

@ikawrakow ikawrakow changed the title Add TruthfulQA benchmark - multiple choice variant with single correct answer Add ability to evauate multiple choice tasks Jan 21, 2024
@ikawrakow ikawrakow marked this pull request as ready for review January 21, 2024 09:53
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The binary format is great, no need for json

examples/perplexity/perplexity.cpp Outdated Show resolved Hide resolved
I had forgotten that MSVC does not make constexpr's available
inside a lambda.
@ikawrakow ikawrakow merged commit 7dcbe39 into master Jan 21, 2024
43 of 47 checks passed
@ikawrakow ikawrakow deleted the ik/truthfull_qa branch January 21, 2024 12:42
@cmp-nct
Copy link
Contributor

cmp-nct commented Jan 21, 2024

I think json would be the best choice for all those kind of benchmarks.

I knew someone will bring up json :-)

The code that reads the binary data is 24 lines in 3 functions. In comparison, nlohmann's json.hpp is 24,596 LOCs (and I find it only in the server example). But let's say we agreed that it needs to be json because binary files are scary. This is not my repo, so @ggerganov should express his opinion, but if it was mine, I would definitely not want to have N > 1 copies of json.hpp in it. Which would make the project have this as a dependency. Would I add a 25k LOC dependency to my project to replace 24 LOC reading binary data that is very unlikely to change? Hmm, not sure.

The way it is now it handles ARC, MMLU, TruthfulQA, and HellaSwag. I have posted test/validation datasets in https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp. It could theoretically also handle Winogrande, but I would live that one to a separate implementation due to the slightly different probability evaluation (it uses partial evaluation). Converting to json is trivial: just copy/paste the 24 LOC into a .cpp file and use json.hpp to dump the data.

It was no complaint
I would not make json a dependency for llama.cpp, just for the benchmark tool - given we already have the server with json support the library is available in the repo. It can be #conditionally included.

My reasoning is that a binary format can not be viewed, it needs a special viewer someone has to write.
It can not be edited, it needs a compiler to produce it and people have to learn how to find and use the compiler.
The benchmarks are human readable text, so making them human readable removes a barrier on their use.

People might want to use their custom benchmarks for example. JSON means anyone can just start working on it, bin means you've to have a much higher level of competence (currently no compiler/converter available ?)

@Nexesenex
Copy link
Contributor

Could you make and upload a ARC-Mix .bin please, @ikawrakow ?

@ikawrakow
Copy link
Contributor Author

Could you make and upload a ARC-Mix .bin please, @ikawrakow ?

Done.

It can not be edited,

This is a feature, not a bug :-)

People might want to use their custom benchmarks for example. JSON means anyone can just start working on it, bin means you've to have a much higher level of competence (currently no compiler/converter available ?)

I have added a simple demo program to the repository that uses nlohman::json to convert to JSON. Simply

g++ -o convert convert.cpp
./convert some_file.bin some_file.json

@Nexesenex
Copy link
Contributor

Nexesenex commented Jan 28, 2024

  • MMLU test might have some trouble at iteration 1160 (crash after 1159), and 314 (tested stopped a few times there also.
  • Arc combined is not working, unlike Arc challenge and Arc Easy.

Also, noob question : is it possible to chain test several benchs (or even perplexity calculations at different ctx) via several commands without reloading the model?

@cmp-nct
Copy link
Contributor

cmp-nct commented Jan 29, 2024

Could you make and upload a ARC-Mix .bin please, @ikawrakow ?

Done.

It can not be edited,

This is a feature, not a bug :-)

People might want to use their custom benchmarks for example. JSON means anyone can just start working on it, bin means you've to have a much higher level of competence (currently no compiler/converter available ?)

I have added a simple demo program to the repository that uses nlohman::json to convert to JSON. Simply

g++ -o convert convert.cpp
./convert some_file.bin some_file.json

Now it's a feature :-)
nice to see

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
* TruthfulQA: 1st attempt, does not look like it is working

The same implementation can be used for HellaSwag as well,
so I converted a HellaSwag validation dataset to the binary
format used here and tested with that. The score is only
around 50, so something is not quite right.

* TruthfulQA: works but the result is bad

I know it works because if I convert the HellaSwag validation
data to the binary format used in the truthful_qa_score() function
I get the exact same result as from the hellaswag_score() function.
But I guess, the questions are tricky and the way I have done
the combination of question + answer is very likely not the best.
The TruthfulQA validation dataset contains 817 questions, with
random chance result around 19%. With this version I get
29.1% for Mistral-7B and 55.2% for Mistral-7B-Instruct-v0.2.
The HF leader board results for these two models are
42.2% and 68.3%, respectively.

* TruthfulQA: fix random sample

* TruthfulQA: prepare tasks in parallel for large test datasets

* Rename truthful_qa to multiple_choice

* Make MSVC happy

I had forgotten that MSVC does not make constexpr's available
inside a lambda.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
@strawberrymelonpanda
Copy link
Contributor

strawberrymelonpanda commented Feb 3, 2024

It can not be edited,

This is a feature, not a bug :-)

People might want to use their custom benchmarks for example. JSON means anyone can just start working on it, bin means you've to have a much higher level of competence (currently no compiler/converter available ?)

I have added a simple demo program to the repository that uses nlohman::json to convert to JSON. Simply

g++ -o convert convert.cpp
./convert some_file.bin some_file.json

Now it's a feature :-) nice to see

@ikawrakow, @cmp-nct

Just want to see if I'm missing anything - I would like to do exactly as mentioned above, by making a custom benchmark. Is the above only to convert .bin -> JSON, and if so, is there a way to go back to .bin?

My use-case is simply that many public benchmarks have contaminated training data, and a custom benchmark could be more relevantly tailored to my usages.

@ikawrakow
Copy link
Contributor Author

@strawberrymelonpanda

Somebody needs to modify the tool to be able to load JSON, store the data into the MultiplChoice struct, and then use the provided serialize function to store in the binary format. Not sure the someone will be me as I genuinely dislike working with JSONs.

@strawberrymelonpanda
Copy link
Contributor

Understood, and thanks.

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* TruthfulQA: 1st attempt, does not look like it is working

The same implementation can be used for HellaSwag as well,
so I converted a HellaSwag validation dataset to the binary
format used here and tested with that. The score is only
around 50, so something is not quite right.

* TruthfulQA: works but the result is bad

I know it works because if I convert the HellaSwag validation
data to the binary format used in the truthful_qa_score() function
I get the exact same result as from the hellaswag_score() function.
But I guess, the questions are tricky and the way I have done
the combination of question + answer is very likely not the best.
The TruthfulQA validation dataset contains 817 questions, with
random chance result around 19%. With this version I get
29.1% for Mistral-7B and 55.2% for Mistral-7B-Instruct-v0.2.
The HF leader board results for these two models are
42.2% and 68.3%, respectively.

* TruthfulQA: fix random sample

* TruthfulQA: prepare tasks in parallel for large test datasets

* Rename truthful_qa to multiple_choice

* Make MSVC happy

I had forgotten that MSVC does not make constexpr's available
inside a lambda.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
@strawberrymelonpanda
Copy link
Contributor

strawberrymelonpanda commented May 16, 2024

For anyone who comes across this topic, I've managed to put together a JSON->bin encoder. I've tested it both with my own multi-choice questions, and by decoding and re-encoding a .bin file and getting the same results back.

If you're interested in making your own multiple-choice benchmarks, please see the gist here.

Note: As of now, to include a test less than 100 questions, you must make a small edit to llama.cpp/examples/perplexity/perplexity.cpp and change 1428 to int n_dot = std::max((int) n_task/100, 1);, or you'll get a floating point error.
Edit: PR fix merged!

I've also included a simple text -> JSON formatter to make the process easier.

Q:] QUESTION TEXT
A1:] CORRECT ANSWER TEXT
A2:] ANSWER
A3:] ANSWER
A4:] ANSWER

Q:] QUESTION TEXT
A1:] CORRECT ANSWER TEXT
A2:] ANSWER
A3:] ANSWER
A4:] ANSWER
python tojson.py custom-test.txt custom-test.json
./encode custom-test.json custom-test.bin
./perplexity -m model -bf custom-test.bin --multiple choice

@ikawrakow Please feel free to include either script in the Readme of your repo, the repo itself, or even as part of convert.cpp, if you wish to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants