Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark HyDE for Open Source LLMs #21

Open
3 tasks done
AndreSlavescu opened this issue Nov 28, 2023 · 1 comment
Open
3 tasks done

Benchmark HyDE for Open Source LLMs #21

AndreSlavescu opened this issue Nov 28, 2023 · 1 comment

Comments

@AndreSlavescu
Copy link

AndreSlavescu commented Nov 28, 2023

Goals

  • collect data on HyDE effectiveness with open source LLMs

Tested LLMs

  • Mistral 7B instruct v0.2
  • Llama2 7B chat hf
  • Zepyher-7B beta

Benchmark Results

Output Quality (Contriever)

  • Llama2 7B chat hf results (iters = 1):
    map all 0.3118
    ndcg_cut_10 all 0.4728
    recall_1000 all 0.7900

  • Llama2 7B chat hf results (iters = 8):
    map all 0.3722
    ndcg_cut_10 all 0.5561
    recall_1000 all 0.8185

  • Mistral 7B instruct v0.2 (iters = 1):
    map all 0.3201
    ndcg_cut_10 all 0.4918
    recall_1000 all 0.8021

  • Mistral 7B instruct v0.2 (iters = 8):
    map all 0.3725
    ndcg_cut_10 all 0.5578
    recall_1000 all 0.8319

  • Zepyher-7B beta (iters = 1):
    map all 0.2368
    ndcg_cut_10 all 0.3935
    recall_1000 all 0.7286

  • Zepyher-7B beta (iters = 8):
    map all 0.3613
    ndcg_cut_10 all 0.5231
    recall_1000 all 0.8196

Output Quality (BM25)

  • Llama2 7B chat hf results (iters = 1):
    map all 0.3291
    ndcg_cut_10 all 0.5293
    recall_1000 all 0.8022

  • Llama2 7B chat hf results (iters = 8):
    map all 0.3463
    ndcg_cut_10 all 0.5554
    recall_1000 all 0.8272

  • Mistral 7B instruct v0.2 (iters = 1):
    map all 0.3602
    ndcg_cut_10 all 0.5238
    recall_1000 all 0.8401

  • Mistral 7B instruct v0.2 (iters = 8):
    map all 0.3678
    ndcg_cut_10 all 0.5601
    recall_1000 all 0.8380

3-shot HyDE Output Quality (Contriever):

Only run on Mistral 7B instruct v0.2.

  • Mistral 7B instruct v0.2 (iters = 1):
    map all 0.0328
    ndcg_cut_10 all 0.0506
    recall_1000 all 0.1050

  • Mistral 7B instruct v0.2 (iters = 8):
    map all 0.0942
    ndcg_cut_10 all 0.1612
    recall_1000 all 0.2576

Potential Next Steps

HyDE + AlphaGeometry performance

  • can we beat a score of 25 achieved by deepmind?
@MXueguang
Copy link
Member

Model Retriever N generation DL19 nDCG@10 DL19 R@1000
LLaMA2-7B-Chat Contriever 1
Contriever 8
BM25 1
BM25 8
Mistral-7B-instruct-v0.2 Contriever
Contriever
BM25
BM25
Zepher-7B
LLaMA2-13B-Chat
LLaMA2-70B-Chat
(our baseline)
(our methods)

we need to fill this table to start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants