[ggml-quants] Add memsets and other fixes for IQ quants#19861
[ggml-quants] Add memsets and other fixes for IQ quants#19861ggerganov merged 4 commits intoggml-org:masterfrom
Conversation
|
Wait, I think I may have used a wrong branch versus the original ones I was testing, ignore this for now |
|
Okay so with the change, Difference: +0.0044
Difference: +0.0161 So strangely it's worse but only just barely 🤷 |
|
Okay so after compiling again with I think clearing the If Then,
But then as I mentioned, we so in summation, the current state is better and more clear than it was, but it was effectively the exact same result :) |
|
@ggerganov ready to review, I can provide more of a TLDR if you don't want to dig through my larger posts |
|
@ggerganov bump in case you missed it :) |
|
For normal models that do not have degenerate blocks full of 0s does the PPL match exactly before and after? |
Tested on Qwen3.5-35B-A3B PPL also matches (which I went to test before trying the sha256sum) edit: Will also test IQ1_M/S since that change is different |
|
Decided to test all just to be sure @ggerganov
manually setting
|
While trying to stop my Qwen3.5 quants from getting a ton of "Oops: found point X not on grid ...", I (and claude) came across a potential big issue
Using gdb, it seems that
Lis often initialized to non-zero memory, and so when it's read, it has garbage data in it that's causing the quantizations to go awry when there's no candidates found during the searchWith this change, with Qwen3.5, I no longer saw ANY "Oops: found point.." errors, and the PPL seems totally as expected
This affected
IQ2_XXS,IQ2_XS,IQ3_XXS, andIQ3_S. It seems that it's most often caused when the imatrix data contains a full block of 0s (as is common in MoE models with a large number of experts) or when the block itself is all 0s (for a particularly sparse model)Now,
memsetis called onLbefore it is used.Additional changes:
L->Laux: basically a no-op, we only use the returned scale so we can useLorLauxhere, sinceLis immediately memset right after this change makes it more obvious that this is a buffer, but I don't care if we leave this line outeff_max <= 0check: guards against dividing by 0 whenscaleis 0, we set the scales to 0 and exit early like whenmax < GROUP_MAX_EPSis triggeredPPL tested on Qwen3-Code-Next:
PPL
IQ2_XSbefore:PPL
IQ2_XSafter:Difference: +0.0057
IQ2_XXSis now possible to make (with my imatrix at least) whereas before I got:For sanity, PPL
IQ2_XXS:For IQ1_S and IQ1_M:
Similar to above, the assert happens when we have all-zero weights meaning the search loop failed to find a candidate. Instead now we treat is as a zero block, and we populate the scales and shifts to avoid uninitialized memory
This was the assert I was seeing:
PPL
IQ1_Sbefore:PPL
IQ1_Safter:Difference: +0.016
ALL PPL calculations done with:
I used
Qwen3-Coder-Nextbecause I had existing "before" quants to compare against, and IQ2_XXS previously failed with the "Oops" crash.Note, I think the PPL change is from changes to the model conversion, I ran PPL against the full bf16 model
Conversion done with old release b7936:
Conversion from master:
Difference: +0.0056
(in hindsight, wikitext for a coding model wasn't the best dataset... willing to redo the PPLs against a different dataset if necessary)
Disclaimer:
I don't fully understand this area of code yet, so while the
memsetstuff is obvious and had immediate tangible improvements, the rest is outside my wheelhouse. I really would appreciate another set of experienced eyes looking at this, I provide my test results in an attempt to prove nothing has been broken by this, and have reviewed this to the best of my abilityIf any additional tests should be run before people are willing to review I'll more than happily run them