Discrepancies in Model Evaluation Results for Mistral-7B-Instruct-v0.2 and phi-3-mini-128-instruct #41

zyzzzz-123 · 2024-05-11T04:15:19Z

Hi,

I recently explored your benchmark RepoQA, which I found to be excellent to evaluate the long-context code understanding capabilities of LLMs. Eager to test it myself, I conducted evaluations using the Mistral-7B-Instruct-v0.2 and phi-3-mini-128-instruct models. However, my results were different from those reported on your website and I noticed significant discrepancies between the hf-backend and vllm-backend.

Here are my results:

phi-3-mini-128-instruct:

Mistral-7B-Instruct-v0.2 :

I just used these commands to install the environments and ran the experiments.

# without vLLM (can run openai, anthropic, and huggingface backends)
pip install --upgrade repoqa
# To enable vLLM
pip install --upgrade "repoqa[vllm]"

repoqa.search_needle_function --model "Mistral-7B-Instruct-v0.2" --backend vllm --trust-remote-code

Could you provide any insights into why there might be such differences in the results? Is there a possibility of environmental or configuration differences that could influence the outcomes?

Thank you for your assistance.

The text was updated successfully, but these errors were encountered:

ganler · 2024-05-11T05:47:49Z

hf results are worse than vLLM results

This one I don't have an immediate clue because I actually have not used HF to produce results so far. It could be a bug in the HF code.

new vLLM results are better than the reported

One potential reason I guess the new results are better:

f0395c0 will cut model's response (only use the next model response as the output whereas initially, the output can include multi-turn response)
The post-processing of LLM response is still not perfect and multi-turn results will lead to a worse result.

@JialeTomTian Let's optimize the post-processing here:

repoqa/repoqa/compute_score.py

Line 64 in 1e97925

def sanitize_output(model_output: str, lang: str) -> str:

Based on the model name, cut the response by the single-turn EOF (use hacky_assistant_stop_seq)
Recompute the scores of the results
Check if the new scores are better, if so make a PR.

ganler · 2024-05-11T05:50:11Z

@zyzzzz-123 Would be nice if you could compare our results.

We store all model outputs at: https://github.com/evalplus/repoqa/releases/download/dev-results/ntoken_16384-output.zip

If they share the same prefix then #41 (comment) should be the direction.

Otherwise, it can be a problem of configurations.

zyzzzz-123 · 2024-05-11T06:03:35Z

This one I don't have an immediate clue because I actually have not used HF to produce results so far. It could be a bug in the HF code.

Could you possibly run any of these models on the HF backend? This would help me determine whether the issue is with the HF-backend code or if it's related to a configuration issue on my end.

Would be nice if you could compare our results.
If they share the same prefix then #41 (comment) should be the direction.

I'll proceed with this and update you as soon as I can.

JialeTomTian · 2024-05-11T07:05:12Z

One potential reason I guess the new results are better:

f0395c0 will cut model's response (only use the next model response as the output whereas initially, the output can include multi-turn response)
The post-processing of LLM response is still not perfect and multi-turn results will lead to a worse result.

Applied post-processing by cutting the response by the single-turn EOF. The scores remained the same.

zyzzzz-123 · 2024-05-13T03:44:58Z

@ganler Could you provide more details about the configuration issue? Are there any steps to diagnose it?

ganler · 2024-05-13T03:47:27Z

@zyzzzz-123 I used to run these models using:

Local vllm version v0.4.1 and v0.4.0
vllm's docker image for OpenAI-compatible service version v0.4.1 and v0.4.0

ganler · 2024-05-13T03:50:14Z

It might be because of an update of the pre-processing cache. Let me re-run your models to confirm. But in general, I think higher scores are correct scores. (bugs should make things worse)

ganler · 2024-05-13T03:56:11Z

Am re-running phi-3 and mistral-7b-instruct-v0.2

zyzzzz-123 · 2024-05-13T03:59:30Z

Much thanks to your prompt reply!
And I will also rerun the experiments with vllmv4.0.0 to see whether it's the reason.

It would be so nice of you to run any model on huggingface backend as well. Because I got bad results on all models that I have tried, including phi-3, mistral-7b-instruct-v0.2, CodeLlama-7b-Instruct-hf, llama3.

ganler · 2024-05-13T04:30:37Z

@zyzzzz-123 Thanks! Yeah, if you got bad results on HF that must be a bug. So maybe please stick on vLLM backend. :)

I created #42 to track that but it's gonna take a while to fix it. Or any contributions are welcome: HF code available at: https://github.com/evalplus/repoqa/blob/main/repoqa/provider/hf.py

ganler · 2024-05-13T06:05:23Z

Updated Phi-3 results:

ganler · 2024-05-13T06:07:12Z

Updated mistral instruct-v0.2

ganler · 2024-05-13T06:08:30Z

Using local vllm 0.4.2 on 2xA6000

ganler · 2024-05-13T06:17:43Z

@zyzzzz-123 Do you want to compare the model outputs between yours and ours as I suggested here? #41 (comment)

Any insights from the diff might be interesting.

ganler · 2024-05-13T06:18:31Z

BTW I am also using the latest commit which you can install via:

pip install --upgrade "repoqa[vllm] @ git+https://github.com/evalplus/repoqa@main

ganler · 2024-05-13T06:29:06Z

I found out why phi3 score is a bit low in the HEAD commit. In ab0ba02 the regex pattern ask the output lines to start with "```" but phi3 likes adding a space ahead of it.

So I fixed this by doing a model_output.strip() and the phi3 score looks much better:

966ff16

ganler · 2024-05-13T06:30:07Z

Same pattern for mistral models. They seem to produce spaces before "```" in vLLM for some reason.

ganler · 2024-05-13T06:53:01Z

Globally updated the leaderboard. https://evalplus.github.io/repoqa.html

ganler · 2024-05-19T01:20:09Z

Closing this issue as Huggingface backend as been fixed in #42.

ganler mentioned this issue May 13, 2024

[Bug] Huggingface backend seems to be broken #42

Closed

ganler closed this as completed May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancies in Model Evaluation Results for Mistral-7B-Instruct-v0.2 and phi-3-mini-128-instruct #41

Discrepancies in Model Evaluation Results for Mistral-7B-Instruct-v0.2 and phi-3-mini-128-instruct #41

zyzzzz-123 commented May 11, 2024

ganler commented May 11, 2024

ganler commented May 11, 2024

zyzzzz-123 commented May 11, 2024

JialeTomTian commented May 11, 2024

zyzzzz-123 commented May 13, 2024

ganler commented May 13, 2024

ganler commented May 13, 2024

ganler commented May 13, 2024

zyzzzz-123 commented May 13, 2024

ganler commented May 13, 2024

ganler commented May 13, 2024

ganler commented May 13, 2024

ganler commented May 13, 2024

ganler commented May 13, 2024

ganler commented May 13, 2024

ganler commented May 13, 2024

ganler commented May 13, 2024 •

edited

Loading

ganler commented May 13, 2024

ganler commented May 19, 2024 •

edited

Loading

Discrepancies in Model Evaluation Results for Mistral-7B-Instruct-v0.2 and phi-3-mini-128-instruct #41

Discrepancies in Model Evaluation Results for Mistral-7B-Instruct-v0.2 and phi-3-mini-128-instruct #41

Comments

zyzzzz-123 commented May 11, 2024

ganler commented May 11, 2024

ganler commented May 11, 2024

zyzzzz-123 commented May 11, 2024

JialeTomTian commented May 11, 2024

zyzzzz-123 commented May 13, 2024

ganler commented May 13, 2024

ganler commented May 13, 2024

ganler commented May 13, 2024

zyzzzz-123 commented May 13, 2024

ganler commented May 13, 2024

ganler commented May 13, 2024

ganler commented May 13, 2024

ganler commented May 13, 2024

ganler commented May 13, 2024

ganler commented May 13, 2024

ganler commented May 13, 2024

ganler commented May 13, 2024 • edited Loading

ganler commented May 13, 2024

ganler commented May 19, 2024 • edited Loading

ganler commented May 13, 2024 •

edited

Loading

ganler commented May 19, 2024 •

edited

Loading