Inconsistencies in results in Mistral, CodeLlama and some strange behavior. #165

Anindyadeep · 2023-11-13T20:34:06Z

So, I also came across some inconsistencies in results. Also, when I was going to raise the issue, I also came across this issue #142. So, I was trying to reproduce the results for codellama-7B and Mistral-7B. Let's discuss on HumanEval benchmark (this same happens for benchmarks like MBPP and Multipl-e too).

I would like to discuss these in-consistencies in two parts:

Part 1: Inconsistencies in paper results.

So here are the results of CodeLlama7B and Mistral7B for HumanEval benchmark.

CodeLlama-7B: (in codellama paper) 33.5% 
CodeLlama-7B: (in Mistral paper) 31.1%
CodeLlama-7B  (from Bigcode Leaderboard) 29.9%

Mistral-7B: (in Mistral paper) 30.5%

When I did the experiments, with the required greedy configuration, I saw it was giving me a score of 29.8 %, which was same as the leaderboard. However, I still had questions that why we have a difference of such a value of ~ (2-3) % w.r.t. the paper. Although this clarifies a part of my doubt.

However, I also want to know is there any reasons, these model's repo (like mistral-src or codellama repo), do not provide any reproducibility script?

Part 2: Mistral giving higher result with int-4 (bnb) rather fp32.

So, I also evaluated Mistral model and there I saw a very strange phenomenon. At first here are the evaluation results on HumanEval:

Mistral-7B fp16: 28.6 (deviating by 1.89 %)
Mistral-7B int-4: 29.8

Now, I made some slight code changes, where instead of passing the model precision from command line, I made a custom change in the model using BitsAndBytes config and passed it the evaluation engine. Here is the sample code:

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_id = "filipealmeida/Mistral-7B-v0.1-sharded"
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    quantization_config=bnb_config, 
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    truncation_side="left",
    padding_side="right",
)

So, instead of flagging --load_in_4bit I passed the model with above changes. Because --load_in_4bit just adds this under model_kwargs. Now the question is, just doing this, it increases the evaluation performance score by huge amount. I wonder why is that the case?

Here are the results of both the model when evaluated using BitsAndBytes.

Mistral-7B with BitAndBytes int-4 loading: 30.49 %
CodeLlama with BitsAndBytes int-4 loading: 31.71 %

What is super strange is that now the models are giving very close results to that it was in the paper. And for all the cases I set do_sample to False.

I would love to discuss and understand more on this and what might go unusual.

The text was updated successfully, but these errors were encountered:

loubnabnl · 2023-11-15T16:31:02Z

When I did the experiments, with the required greedy configuration, I saw it was giving me a score of 29.8 %, which was same as the leaderboard. However, I still had questions that why we have a difference of such a value of ~ (2-3) % w.r.t. the paper. Although #142 (comment) clarifies a part of my doubt.

yes as explained in my comment, it might be due to small differences in post-processing or inference precision. When comparing models it should be fine as long as you use the same framework and pipeline which is what the leaderboard is intended for, instead of just comparing scores across research papers which might use different implementatins.

However, I also want to know is there any reasons, these model's repo (like mistral-src or codellama repo), do not provide any reproducibility script?

That's up to the Mistral and CodeLLaMa authors to answer ^^

Regarding the differences you're observing with respect to quantization, greedy evaluation might have more noise than top-p sampling with a large number of solutions e.g 50, where you give the model 50 chances to solve each problem instead of one, maybe try the evaluation again with this setup?

Anindyadeep · 2023-11-15T17:32:23Z

where you give the model 50 chances to solve each problem instead of one, maybe try the evaluation again with this setup?

That's something worth trying out, will do that. Thanks for clearing some of my doubts.

phqtuyen · 2023-11-16T14:12:56Z

May I also ask what is the decoding strategy is applied by default? Does it affect the performance observed? Much appreciated.

phqtuyen · 2023-11-16T14:31:48Z

After re-reading the learderboard about and Huggingface document (https://huggingface.co/blog/how-to-generate) I assume that the decoding strategy used is nucleus sampling with temperature=0.2 and top_p=0.95? Thanks.

Anindyadeep · 2023-11-16T15:21:21Z

After re-reading the learderboard about and Huggingface document (https://huggingface.co/blog/how-to-generate) I assume that the decoding strategy used is nucleus sampling with temperature=0.2 and top_p=0.95? Thanks.

Yes you are right. But for pass@1, we want greedy decoding (which should provide deterministic response) and so, temp is kept to 0 and top_p is not required so None should work.

loubnabnl · 2023-11-16T15:33:34Z

The leaderboard uses top-p sampling, top-p=0.95 to generate 50 samples per problem to compute pass@1.

You can also use greedy which is deterministic for the model under the same setup and gives results with a 1 or 2 points difference to top_p sampling with 50 samples. However if you want to reduce noise doing sampling with a large number of samples (e.g 50 or 100) might be less noisy than greedy since you give the model more than one chance to solve each problem.

Anindyadeep · 2023-11-16T15:58:16Z

Oh, I see. Actually I was following the convention provided by codellama paper. However, this makes sense. I would like to do that for my case of evaluation.

The leaderboard uses top-p sampling, top-p=0.95 to generate 50 samples per problem to compute pass@1

And I believe the temp is 0.2 right?

loubnabnl · 2023-11-16T16:28:56Z

yes

Anindyadeep · 2023-11-16T18:59:19Z

The leaderboard uses top-p sampling, top-p=0.95 to generate 50 samples per problem to compute pass@1.

I am kinda confused here, because does by definition pass@1 means to use the model's generation only one time? Doing 50 times, would mean, I am getting pass@1 and pass@10, but yeah...

loubnabnl · 2023-11-17T12:08:25Z

From Codex paper

pass@k metric, where k code samples are generated per problem, a problem is considered solved if any sample passes the unit tests, and the total fraction of problems solved is reported. However, computing pass@k in this
way can have high variance. Instead, to evaluate pass@k, we generate n ≥ k samples per task (in this paper, we
use n = 200 and k ≤ 100), count the number of correct samples c ≤ n which pass unit tests, and calculate the unbiased estimator

Anindyadeep · 2023-11-17T19:05:19Z

From Codex paper

pass@k metric, where k code samples are generated per problem, a problem is considered solved if any sample passes the unit tests, and the total fraction of problems solved is reported. However, computing pass@k in this
way can have high variance. Instead, to evaluate pass@k, we generate n ≥ k samples per task (in this paper, we
use n = 200 and k ≤ 100), count the number of correct samples c ≤ n which pass unit tests, and calculate the unbiased estimator

Ahh I see, got it, thanks for this reference appreciate this :)

loubnabnl closed this as completed Nov 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistencies in results in Mistral, CodeLlama and some strange behavior. #165

Inconsistencies in results in Mistral, CodeLlama and some strange behavior. #165

Anindyadeep commented Nov 13, 2023 •

edited

Loading

loubnabnl commented Nov 15, 2023

Anindyadeep commented Nov 15, 2023

phqtuyen commented Nov 16, 2023

phqtuyen commented Nov 16, 2023

Anindyadeep commented Nov 16, 2023

loubnabnl commented Nov 16, 2023

Anindyadeep commented Nov 16, 2023 •

edited

Loading

loubnabnl commented Nov 16, 2023

Anindyadeep commented Nov 16, 2023 •

edited

Loading

loubnabnl commented Nov 17, 2023

Anindyadeep commented Nov 17, 2023

Inconsistencies in results in Mistral, CodeLlama and some strange behavior. #165

Inconsistencies in results in Mistral, CodeLlama and some strange behavior. #165

Comments

Anindyadeep commented Nov 13, 2023 • edited Loading

Part 1: Inconsistencies in paper results.

Part 2: Mistral giving higher result with int-4 (bnb) rather fp32.

loubnabnl commented Nov 15, 2023

Anindyadeep commented Nov 15, 2023

phqtuyen commented Nov 16, 2023

phqtuyen commented Nov 16, 2023

Anindyadeep commented Nov 16, 2023

loubnabnl commented Nov 16, 2023

Anindyadeep commented Nov 16, 2023 • edited Loading

loubnabnl commented Nov 16, 2023

Anindyadeep commented Nov 16, 2023 • edited Loading

loubnabnl commented Nov 17, 2023

Anindyadeep commented Nov 17, 2023

Anindyadeep commented Nov 13, 2023 •

edited

Loading

Anindyadeep commented Nov 16, 2023 •

edited

Loading

Anindyadeep commented Nov 16, 2023 •

edited

Loading