Skip to content

[ggml-quants] Add memsets and other fixes for IQ quants#19861

Merged
ggerganov merged 4 commits intoggml-org:masterfrom
bartowski1182:master
Mar 6, 2026
Merged

[ggml-quants] Add memsets and other fixes for IQ quants#19861
ggerganov merged 4 commits intoggml-org:masterfrom
bartowski1182:master

Conversation

@bartowski1182
Copy link
Contributor

@bartowski1182 bartowski1182 commented Feb 24, 2026

While trying to stop my Qwen3.5 quants from getting a ton of "Oops: found point X not on grid ...", I (and claude) came across a potential big issue

Using gdb, it seems that L is often initialized to non-zero memory, and so when it's read, it has garbage data in it that's causing the quantizations to go awry when there's no candidates found during the search

With this change, with Qwen3.5, I no longer saw ANY "Oops: found point.." errors, and the PPL seems totally as expected

This affected IQ2_XXS, IQ2_XS, IQ3_XXS, and IQ3_S. It seems that it's most often caused when the imatrix data contains a full block of 0s (as is common in MoE models with a large number of experts) or when the block itself is all 0s (for a particularly sparse model)

Now, memset is called on L before it is used.

Additional changes:

L -> Laux: basically a no-op, we only use the returned scale so we can use L or Laux here, since L is immediately memset right after this change makes it more obvious that this is a buffer, but I don't care if we leave this line out

eff_max <= 0 check: guards against dividing by 0 when scale is 0, we set the scales to 0 and exit early like when max < GROUP_MAX_EPS is triggered

PPL tested on Qwen3-Code-Next:

PPL IQ2_XS before:

Final estimate: PPL = 9.3646 +/- 0.16983

PPL IQ2_XS after:

Final estimate: PPL = 9.3703 +/- 0.17014

Difference: +0.0057

IQ2_XXS is now possible to make (with my imatrix at least) whereas before I got:

Oops: found point 16384 not on grid: 0 0 0 0 0 0 0 1
[   9/ 843]           blk.0.ffn_gate_exps.weight - [ 2048,   512,   512,     1], type =   bf16, converting to iq2_xs .. /llama.cpp/ggml/src/ggml-quants.c:3355: fatal error
libggml-base.so.0(+0x1848b)[0x70bdefd5648b]
libggml-base.so.0(ggml_print_backtrace+0x21f)[0x70bdefd568ef]
libggml-base.so.0(ggml_abort+0x152)[0x70bdefd56ac2]
libggml-base.so.0(+0x46b8c)[0x70bdefd84b8c]
libggml-base.so.0(quantize_iq2_xs+0x81)[0x70bdefda3d21]

For sanity, PPL IQ2_XXS:

Final estimate: PPL = 10.4755 +/- 0.19246

For IQ1_S and IQ1_M:

Similar to above, the assert happens when we have all-zero weights meaning the search loop failed to find a candidate. Instead now we treat is as a zero block, and we populate the scales and shifts to avoid uninitialized memory

This was the assert I was seeing:

blk.7.ffn_down_exps.weight - [ 1024,  4096,   512,     1], type =   bf16, converting to iq1_m .. /llama.cpp/ggml/src/ggml-quants.c:4534: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed                                                                                          
  /llama.cpp/ggml/src/ggml-quants.c:4534: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed

PPL IQ1_S before:

Final estimate: PPL = 15.6349 +/- 0.31023

PPL IQ1_S after:

Final estimate: PPL = 15.6509 +/- 0.31119

Difference: +0.016

ALL PPL calculations done with:

./llama-perplexity -m /models/Qwen3-Coder-Next.gguf --flash-attn on -f /training_data/wiki.test.raw --chunks 100 --batch-size 4096 --ubatch-size 4096

I used Qwen3-Coder-Next because I had existing "before" quants to compare against, and IQ2_XXS previously failed with the "Oops" crash.

Note, I think the PPL change is from changes to the model conversion, I ran PPL against the full bf16 model

Conversion done with old release b7936:

Final estimate: PPL = 8.0452 +/- 0.15022

Conversion from master:

Final estimate: PPL = 8.0508 +/- 0.15029

Difference: +0.0056

(in hindsight, wikitext for a coding model wasn't the best dataset... willing to redo the PPLs against a different dataset if necessary)

Disclaimer:

I don't fully understand this area of code yet, so while the memset stuff is obvious and had immediate tangible improvements, the rest is outside my wheelhouse. I really would appreciate another set of experienced eyes looking at this, I provide my test results in an attempt to prove nothing has been broken by this, and have reviewed this to the best of my ability

If any additional tests should be run before people are willing to review I'll more than happily run them

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 24, 2026
@bartowski1182 bartowski1182 changed the title [WIP] Add memsets and other fixes for IQ quants - NEED REVIEW [ggml-quants] Add memsets and other fixes for IQ quants - NEED REVIEW Feb 24, 2026
@bartowski1182 bartowski1182 changed the title [ggml-quants] Add memsets and other fixes for IQ quants - NEED REVIEW [ggml-quants] Add memsets and other fixes for IQ quants Feb 24, 2026
@compilade compilade self-requested a review February 24, 2026 20:24
@bartowski1182
Copy link
Contributor Author

bartowski1182 commented Feb 25, 2026

So strangely after that change, IQ2_XXS PPL went down:

~~Final estimate: PPL = 10.2589 +/- 0.18574~~

Difference: -0.2166

but IQ2_XS PPL went up (pretty substantially):

~~Final estimate: PPL = 9.6630 +/- 0.17577~~

Difference: +0.2927

Not sure what to take from that..

Wait, I think I may have used a wrong branch versus the original ones I was testing, ignore this for now

@bartowski1182
Copy link
Contributor Author

Okay so with the change, IQ2_XS PPL:

Final estimate: PPL = 9.3747 +/- 0.17023

Difference: +0.0044

IQ2_XXS PPL:

10.4916 +/- 0.19312

Difference: +0.0161

So strangely it's worse but only just barely 🤷

@bartowski1182
Copy link
Contributor Author

bartowski1182 commented Feb 25, 2026

Okay so after compiling again with -O0 to avoid optimizations introducing random noise, the PPL from my initial commit compared to my most recent is identical, so I think the PPL differences (from iterating) can be safely ignored.

I think clearing the L value after make_qp_quants was also effectively a no-op

If scale is negative, as @TheAIBot noted, eff_max will be negative and we'll eject early, so scale is guaranteed to be positive

Then, scale is updated right before Laux is copied onto L

L is then updated if scale > 0, and the only time that couldn't happen is if this made scale negative:

                if (sumq2 > 0 && sumqx*sumqx > best*sumq2) {
                    scale = sumqx/sumq2; best = scale*sumqx;
                    memcpy(L, Laux, 32);
                }

But then as I mentioned, we memcpy onto L anyways

so in summation, the current state is better and more clear than it was, but it was effectively the exact same result :)

@bartowski1182 bartowski1182 marked this pull request as ready for review February 25, 2026 18:11
@bartowski1182
Copy link
Contributor Author

@ggerganov ready to review, I can provide more of a TLDR if you don't want to dig through my larger posts

@bartowski1182
Copy link
Contributor Author

@ggerganov bump in case you missed it :)

@ggerganov
Copy link
Member

For normal models that do not have degenerate blocks full of 0s does the PPL match exactly before and after?

@bartowski1182
Copy link
Contributor Author

bartowski1182 commented Mar 5, 2026

For normal models that do not have degenerate blocks full of 0s does the PPL match exactly before and after?

Tested on Qwen3.5-35B-A3B IQ2_XXS and actually the sha256sum is identical with and without this change (compiled with -O0)

PPL also matches (which I went to test before trying the sha256sum)

edit: Will also test IQ1_M/S since that change is different

@bartowski1182
Copy link
Contributor Author

Decided to test all just to be sure @ggerganov

IQ1_M sha256sum didn't match, but the PPL did:

Final estimate: PPL = 13.4711 +/- 0.09659

manually setting shifts[ib] = 0; in the max < GROUP_MAX_EPS_IQ1_M must result in a tiny but irrelevant change

IQ1_S can't be different because the only time it can be was when there was previously an assert, that said I ran them anyways

IQ1_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ2_S, IQ2_XS, and IQ2_XXS all have identical sha256sum before and after this PR on Qwen3.5-35B-A3B

@ggerganov ggerganov merged commit 649f064 into ggml-org:master Mar 6, 2026
78 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants