Evaluation / benchmark mode #2

iiLaurens · 2023-04-19T07:15:16Z

This is great stuff! I was wondering if you could also add a evaluation mode, for example to calculate perplexity in language models? Sometimes GPTQ doesn't work too well and the performance is hurt significantly. There is a risk that sometime takes a badly quantized model without knowing. So perhaps some kind of method to calculate metrics to compare the original model with the quantized model would be a great help.

PanQiWei · 2023-04-19T11:13:55Z

Hi, @iiLaurens Thanks for your suggestion, I will add evaluation mode to my todo list and perhaps add into this project this weekend.
I think the diversity of samples used to quantize a model is the key to the final quality, and I also found out that how many samples are used also have some influence. Thus in my practice, I usually using a combination of instruction-following dataset and chat dataset with thousands of samples.

PanQiWei linked a pull request Apr 22, 2023 that will close this issue

[WIP]Evaluation and benchmark #5

Merged

PanQiWei closed this as completed in #5 Apr 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation / benchmark mode #2

Evaluation / benchmark mode #2

iiLaurens commented Apr 19, 2023

PanQiWei commented Apr 19, 2023 •

edited

Evaluation / benchmark mode #2

Evaluation / benchmark mode #2

Comments

iiLaurens commented Apr 19, 2023

PanQiWei commented Apr 19, 2023 • edited

PanQiWei commented Apr 19, 2023 •

edited