Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistencies in results in Mistral, CodeLlama and some strange behavior. #165

Closed
Anindyadeep opened this issue Nov 13, 2023 · 11 comments
Closed

Comments

@Anindyadeep
Copy link

Anindyadeep commented Nov 13, 2023

So, I also came across some inconsistencies in results. Also, when I was going to raise the issue, I also came across this issue #142. So, I was trying to reproduce the results for codellama-7B and Mistral-7B. Let's discuss on HumanEval benchmark (this same happens for benchmarks like MBPP and Multipl-e too).

I would like to discuss these in-consistencies in two parts:

Part 1: Inconsistencies in paper results.

So here are the results of CodeLlama7B and Mistral7B for HumanEval benchmark.

CodeLlama-7B: (in codellama paper) 33.5% 
CodeLlama-7B: (in Mistral paper) 31.1%
CodeLlama-7B  (from Bigcode Leaderboard) 29.9%

Mistral-7B: (in Mistral paper) 30.5%

When I did the experiments, with the required greedy configuration, I saw it was giving me a score of 29.8 %, which was same as the leaderboard. However, I still had questions that why we have a difference of such a value of ~ (2-3) % w.r.t. the paper. Although this clarifies a part of my doubt.

However, I also want to know is there any reasons, these model's repo (like mistral-src or codellama repo), do not provide any reproducibility script?

Part 2: Mistral giving higher result with int-4 (bnb) rather fp32.

So, I also evaluated Mistral model and there I saw a very strange phenomenon. At first here are the evaluation results on HumanEval:

Mistral-7B fp16: 28.6 (deviating by 1.89 %)
Mistral-7B int-4: 29.8 

Now, I made some slight code changes, where instead of passing the model precision from command line, I made a custom change in the model using BitsAndBytes config and passed it the evaluation engine. Here is the sample code:

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_id = "filipealmeida/Mistral-7B-v0.1-sharded"
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    quantization_config=bnb_config, 
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    truncation_side="left",
    padding_side="right",
)

So, instead of flagging --load_in_4bit I passed the model with above changes. Because --load_in_4bit just adds this under model_kwargs. Now the question is, just doing this, it increases the evaluation performance score by huge amount. I wonder why is that the case?

Here are the results of both the model when evaluated using BitsAndBytes.

Mistral-7B with BitAndBytes int-4 loading: 30.49 %
CodeLlama with BitsAndBytes int-4 loading: 31.71 %

What is super strange is that now the models are giving very close results to that it was in the paper. And for all the cases I set do_sample to False.

I would love to discuss and understand more on this and what might go unusual.

@loubnabnl
Copy link
Collaborator

When I did the experiments, with the required greedy configuration, I saw it was giving me a score of 29.8 %, which was same as the leaderboard. However, I still had questions that why we have a difference of such a value of ~ (2-3) % w.r.t. the paper. Although #142 (comment) clarifies a part of my doubt.

yes as explained in my comment, it might be due to small differences in post-processing or inference precision. When comparing models it should be fine as long as you use the same framework and pipeline which is what the leaderboard is intended for, instead of just comparing scores across research papers which might use different implementatins.

However, I also want to know is there any reasons, these model's repo (like mistral-src or codellama repo), do not provide any reproducibility script?

That's up to the Mistral and CodeLLaMa authors to answer ^^

Regarding the differences you're observing with respect to quantization, greedy evaluation might have more noise than top-p sampling with a large number of solutions e.g 50, where you give the model 50 chances to solve each problem instead of one, maybe try the evaluation again with this setup?

@Anindyadeep
Copy link
Author

where you give the model 50 chances to solve each problem instead of one, maybe try the evaluation again with this setup?

That's something worth trying out, will do that. Thanks for clearing some of my doubts.

@phqtuyen
Copy link

May I also ask what is the decoding strategy is applied by default? Does it affect the performance observed? Much appreciated.

@phqtuyen
Copy link

After re-reading the learderboard about and Huggingface document (https://huggingface.co/blog/how-to-generate) I assume that the decoding strategy used is nucleus sampling with temperature=0.2 and top_p=0.95? Thanks.

@Anindyadeep
Copy link
Author

After re-reading the learderboard about and Huggingface document (https://huggingface.co/blog/how-to-generate) I assume that the decoding strategy used is nucleus sampling with temperature=0.2 and top_p=0.95? Thanks.

Yes you are right. But for pass@1, we want greedy decoding (which should provide deterministic response) and so, temp is kept to 0 and top_p is not required so None should work.

@loubnabnl
Copy link
Collaborator

The leaderboard uses top-p sampling, top-p=0.95 to generate 50 samples per problem to compute pass@1.

You can also use greedy which is deterministic for the model under the same setup and gives results with a 1 or 2 points difference to top_p sampling with 50 samples. However if you want to reduce noise doing sampling with a large number of samples (e.g 50 or 100) might be less noisy than greedy since you give the model more than one chance to solve each problem.

@Anindyadeep
Copy link
Author

Anindyadeep commented Nov 16, 2023

Oh, I see. Actually I was following the convention provided by codellama paper. However, this makes sense. I would like to do that for my case of evaluation.

The leaderboard uses top-p sampling, top-p=0.95 to generate 50 samples per problem to compute pass@1

And I believe the temp is 0.2 right?

@loubnabnl
Copy link
Collaborator

yes

@Anindyadeep
Copy link
Author

Anindyadeep commented Nov 16, 2023

The leaderboard uses top-p sampling, top-p=0.95 to generate 50 samples per problem to compute pass@1.

I am kinda confused here, because does by definition pass@1 means to use the model's generation only one time? Doing 50 times, would mean, I am getting pass@1 and pass@10, but yeah...

@loubnabnl
Copy link
Collaborator

From Codex paper

pass@k metric, where k code samples are generated per problem, a problem is considered solved if any sample passes the unit tests, and the total fraction of problems solved is reported. However, computing pass@k in this
way can have high variance. Instead, to evaluate pass@k, we generate n ≥ k samples per task (in this paper, we
use n = 200 and k ≤ 100), count the number of correct samples c ≤ n which pass unit tests, and calculate the unbiased estimator

@Anindyadeep
Copy link
Author

From Codex paper

pass@k metric, where k code samples are generated per problem, a problem is considered solved if any sample passes the unit tests, and the total fraction of problems solved is reported. However, computing pass@k in this
way can have high variance. Instead, to evaluate pass@k, we generate n ≥ k samples per task (in this paper, we
use n = 200 and k ≤ 100), count the number of correct samples c ≤ n which pass unit tests, and calculate the unbiased estimator

Ahh I see, got it, thanks for this reference appreciate this :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants