Benchmark HyDE for Open Source LLMs #21

AndreSlavescu · 2023-11-28T16:06:02Z

Goals

Llama2 7B chat hf results (iters = 1):
map all 0.3118
ndcg_cut_10 all 0.4728
recall_1000 all 0.7900
Llama2 7B chat hf results (iters = 8):
map all 0.3722
ndcg_cut_10 all 0.5561
recall_1000 all 0.8185
Mistral 7B instruct v0.2 (iters = 1):
map all 0.3201
ndcg_cut_10 all 0.4918
recall_1000 all 0.8021
Mistral 7B instruct v0.2 (iters = 8):
map all 0.3725
ndcg_cut_10 all 0.5578
recall_1000 all 0.8319
Zepyher-7B beta (iters = 1):
map all 0.2368
ndcg_cut_10 all 0.3935
recall_1000 all 0.7286
Zepyher-7B beta (iters = 8):
map all 0.3613
ndcg_cut_10 all 0.5231
recall_1000 all 0.8196

Llama2 7B chat hf results (iters = 1):
map all 0.3291
ndcg_cut_10 all 0.5293
recall_1000 all 0.8022
Llama2 7B chat hf results (iters = 8):
map all 0.3463
ndcg_cut_10 all 0.5554
recall_1000 all 0.8272
Mistral 7B instruct v0.2 (iters = 1):
map all 0.3602
ndcg_cut_10 all 0.5238
recall_1000 all 0.8401
Mistral 7B instruct v0.2 (iters = 8):
map all 0.3678
ndcg_cut_10 all 0.5601
recall_1000 all 0.8380

Only run on Mistral 7B instruct v0.2.

Mistral 7B instruct v0.2 (iters = 1):
map all 0.0328
ndcg_cut_10 all 0.0506
recall_1000 all 0.1050
Mistral 7B instruct v0.2 (iters = 8):
map all 0.0942
ndcg_cut_10 all 0.1612
recall_1000 all 0.2576

Look into finetuned versions of the above LLMs for improved task specific effectiveness? e.g. Improved code performance with code finetunes of llama, mistral, and MPT.
Explore different quantization schemes? How does loss impact quality? look into different quantization methods listed in: https://github.com/ggerganov/llama.cpp/blob/dc271c52ed65e7c8dfcbaaf84dabb1f788e4f3d0/README.md#quantization
Explore Self-RAG + HyDE? https://github.com/AkariAsai/self-rag

MXueguang · 2024-02-09T16:36:09Z

we need to fill this table to start.