-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discrepancies in Model Evaluation Results for Mistral-7B-Instruct-v0.2 and phi-3-mini-128-instruct #41
Comments
This one I don't have an immediate clue because I actually have not used HF to produce results so far. It could be a bug in the HF code.
One potential reason I guess the new results are better:
@JialeTomTian Let's optimize the post-processing here: repoqa/repoqa/compute_score.py Line 64 in 1e97925
|
@zyzzzz-123 Would be nice if you could compare our results. We store all model outputs at: https://github.com/evalplus/repoqa/releases/download/dev-results/ntoken_16384-output.zip If they share the same prefix then #41 (comment) should be the direction. Otherwise, it can be a problem of configurations. |
Could you possibly run any of these models on the HF backend? This would help me determine whether the issue is with the HF-backend code or if it's related to a configuration issue on my end.
I'll proceed with this and update you as soon as I can. |
Applied post-processing by cutting the response by the single-turn EOF. The scores remained the same. |
@ganler Could you provide more details about the configuration issue? Are there any steps to diagnose it? |
@zyzzzz-123 I used to run these models using:
|
It might be because of an update of the pre-processing cache. Let me re-run your models to confirm. But in general, I think higher scores are correct scores. (bugs should make things worse) |
Am re-running phi-3 and mistral-7b-instruct-v0.2 |
Much thanks to your prompt reply! It would be so nice of you to run any model on huggingface backend as well. Because I got bad results on all models that I have tried, including |
@zyzzzz-123 Thanks! Yeah, if you got bad results on HF that must be a bug. So maybe please stick on vLLM backend. :) I created #42 to track that but it's gonna take a while to fix it. Or any contributions are welcome: HF code available at: https://github.com/evalplus/repoqa/blob/main/repoqa/provider/hf.py |
Using local vllm 0.4.2 on 2xA6000 |
@zyzzzz-123 Do you want to compare the model outputs between yours and ours as I suggested here? #41 (comment) Any insights from the diff might be interesting. |
BTW I am also using the latest commit which you can install via:
|
Globally updated the leaderboard. https://evalplus.github.io/repoqa.html |
Closing this issue as Huggingface backend as been fixed in #42. |
Hi,
I recently explored your benchmark RepoQA, which I found to be excellent to evaluate the long-context code understanding capabilities of LLMs. Eager to test it myself, I conducted evaluations using the Mistral-7B-Instruct-v0.2 and phi-3-mini-128-instruct models. However, my results were different from those reported on your website and I noticed significant discrepancies between the hf-backend and vllm-backend.
Here are my results:
phi-3-mini-128-instruct:
Mistral-7B-Instruct-v0.2 :
I just used these commands to install the environments and ran the experiments.
repoqa.search_needle_function --model "Mistral-7B-Instruct-v0.2" --backend vllm --trust-remote-code
Could you provide any insights into why there might be such differences in the results? Is there a possibility of environmental or configuration differences that could influence the outcomes?
Thank you for your assistance.
The text was updated successfully, but these errors were encountered: