-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DARE TIES merge of mixed fp16 and bf16 models cannot be exl2 quanted #204
Comments
Can mergekit operate correctly on a mix of fp16 and bf16 models by automatically converting to the target dtype? |
It does convert everything to the target dtype you specify if you have I do know that DARE TIES with lower density values can result in an unusual distribution of parameter magnitudes. The rescaling step tends to introduce large outliers that are critical to the behavior of the model. I saw this happen more with densities around 0.1 but it could be happening here too. Maybe this is throwing exl2 for a loop? I'm not sure. @turboderp any thoughts? |
ExLlama converts everything to FP16 when loading or quantizing, and the difference in dynamic range compared to BF16 could conceivably be an issue, but I've never seen it in practice. I tried downloading that model and converting it here, and it seems to work fine? (Both in 4bpw h6 and 8bpw h8.) |
I updated to the current version and retried. The 8.0bpw h8 exl2 quant completed, but the truncation length that got read in was only 2048 when I loaded it in with ooba. The Q8_0 GGUF truncation length remained at 8192. I forgot to mention I ran this under Windows 11. Ryzen 3 2200G, if that matters. |
I wonder if this is a bug in TGW. From v0.0.15 EXL2 adds a "quantization_config" key to the config.json, which is the only place a length of 2048 would be mentioned in the model. It's only under that key, though, as the calibration length. The model itself still lists "max_position_embeddings" of 8192.
|
I looked at the config.json in the generated result for the exl2 quant. Not sure why the length under calibration ended up at 2048. Manually adjusting it to 8192 resulted in ooba reporting the truncation length to be 8192 rather than 2048. 2048 does not show up at all in the config.json of the merged result output from mergekit, so I should probably take this issue up with exllamav2 at this point.
|
The length listed under quantization config has nothing to do with inference. It's just metadata for troubleshooting purposes. You'd have to ask ooba what config option they're (not) reading to arrive at 2048 as the default. |
I did a DARE TIES merge of 3 fp16 and 1 bf16 model, outputting fp16 or bf16. The result can be converted to Q8_0 GGUF, but there are runtime errors when converting to 8.0bpw h8 exl2. This has happened for two different yaml formulas. Example formula for bf16 output on this model card: https://huggingface.co/grimjim/kunoichi-lemon-royale-7B
The text was updated successfully, but these errors were encountered: