Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancies in Model Evaluation Results for Mistral-7B-Instruct-v0.2 and phi-3-mini-128-instruct #41

Closed
zyzzzz-123 opened this issue May 11, 2024 · 19 comments

Comments

@zyzzzz-123
Copy link

Hi,

I recently explored your benchmark RepoQA, which I found to be excellent to evaluate the long-context code understanding capabilities of LLMs. Eager to test it myself, I conducted evaluations using the Mistral-7B-Instruct-v0.2 and phi-3-mini-128-instruct models. However, my results were different from those reported on your website and I noticed significant discrepancies between the hf-backend and vllm-backend.

Here are my results:

phi-3-mini-128-instruct:
image
Mistral-7B-Instruct-v0.2 :
image

I just used these commands to install the environments and ran the experiments.

# without vLLM (can run openai, anthropic, and huggingface backends)
pip install --upgrade repoqa
# To enable vLLM
pip install --upgrade "repoqa[vllm]"

repoqa.search_needle_function --model "Mistral-7B-Instruct-v0.2" --backend vllm --trust-remote-code

Could you provide any insights into why there might be such differences in the results? Is there a possibility of environmental or configuration differences that could influence the outcomes?

Thank you for your assistance.

@ganler
Copy link
Member

ganler commented May 11, 2024

hf results are worse than vLLM results

This one I don't have an immediate clue because I actually have not used HF to produce results so far. It could be a bug in the HF code.

new vLLM results are better than the reported

One potential reason I guess the new results are better:

  1. f0395c0 will cut model's response (only use the next model response as the output whereas initially, the output can include multi-turn response)
  2. The post-processing of LLM response is still not perfect and multi-turn results will lead to a worse result.

@JialeTomTian Let's optimize the post-processing here:

def sanitize_output(model_output: str, lang: str) -> str:

  1. Based on the model name, cut the response by the single-turn EOF (use hacky_assistant_stop_seq)
  2. Recompute the scores of the results
  3. Check if the new scores are better, if so make a PR.

@ganler
Copy link
Member

ganler commented May 11, 2024

@zyzzzz-123 Would be nice if you could compare our results.

We store all model outputs at: https://github.com/evalplus/repoqa/releases/download/dev-results/ntoken_16384-output.zip

If they share the same prefix then #41 (comment) should be the direction.

Otherwise, it can be a problem of configurations.

@zyzzzz-123
Copy link
Author

This one I don't have an immediate clue because I actually have not used HF to produce results so far. It could be a bug in the HF code.

Could you possibly run any of these models on the HF backend? This would help me determine whether the issue is with the HF-backend code or if it's related to a configuration issue on my end.

Would be nice if you could compare our results.
If they share the same prefix then #41 (comment) should be the direction.

I'll proceed with this and update you as soon as I can.

@JialeTomTian
Copy link
Collaborator

One potential reason I guess the new results are better:

f0395c0 will cut model's response (only use the next model response as the output whereas initially, the output can include multi-turn response)
The post-processing of LLM response is still not perfect and multi-turn results will lead to a worse result.

Applied post-processing by cutting the response by the single-turn EOF. The scores remained the same.

@zyzzzz-123
Copy link
Author

@ganler Could you provide more details about the configuration issue? Are there any steps to diagnose it?

@ganler
Copy link
Member

ganler commented May 13, 2024

@zyzzzz-123 I used to run these models using:

  1. Local vllm version v0.4.1 and v0.4.0
  2. vllm's docker image for OpenAI-compatible service version v0.4.1 and v0.4.0

@ganler
Copy link
Member

ganler commented May 13, 2024

It might be because of an update of the pre-processing cache. Let me re-run your models to confirm. But in general, I think higher scores are correct scores. (bugs should make things worse)

@ganler
Copy link
Member

ganler commented May 13, 2024

Am re-running phi-3 and mistral-7b-instruct-v0.2

@zyzzzz-123
Copy link
Author

Much thanks to your prompt reply!
And I will also rerun the experiments with vllmv4.0.0 to see whether it's the reason.

It would be so nice of you to run any model on huggingface backend as well. Because I got bad results on all models that I have tried, including phi-3, mistral-7b-instruct-v0.2, CodeLlama-7b-Instruct-hf, llama3.

@ganler
Copy link
Member

ganler commented May 13, 2024

@zyzzzz-123 Thanks! Yeah, if you got bad results on HF that must be a bug. So maybe please stick on vLLM backend. :)

I created #42 to track that but it's gonna take a while to fix it. Or any contributions are welcome: HF code available at: https://github.com/evalplus/repoqa/blob/main/repoqa/provider/hf.py

@ganler
Copy link
Member

ganler commented May 13, 2024

Updated Phi-3 results:

image

@ganler
Copy link
Member

ganler commented May 13, 2024

Updated mistral instruct-v0.2

image

@ganler
Copy link
Member

ganler commented May 13, 2024

Using local vllm 0.4.2 on 2xA6000

@ganler
Copy link
Member

ganler commented May 13, 2024

@zyzzzz-123 Do you want to compare the model outputs between yours and ours as I suggested here? #41 (comment)

Any insights from the diff might be interesting.

@ganler
Copy link
Member

ganler commented May 13, 2024

BTW I am also using the latest commit which you can install via:

pip install --upgrade "repoqa[vllm] @ git+https://github.com/evalplus/repoqa@main

@ganler
Copy link
Member

ganler commented May 13, 2024

I found out why phi3 score is a bit low in the HEAD commit. In ab0ba02 the regex pattern ask the output lines to start with "```" but phi3 likes adding a space ahead of it.

So I fixed this by doing a model_output.strip() and the phi3 score looks much better:

image

966ff16

@ganler
Copy link
Member

ganler commented May 13, 2024

Same pattern for mistral models. They seem to produce spaces before "```" in vLLM for some reason.

image

@ganler
Copy link
Member

ganler commented May 13, 2024

Globally updated the leaderboard. https://evalplus.github.io/repoqa.html

@ganler
Copy link
Member

ganler commented May 19, 2024

Closing this issue as Huggingface backend as been fixed in #42.

@ganler ganler closed this as completed May 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants