convert: Add compressed-tensors NVFP4 conversion#21095
convert: Add compressed-tensors NVFP4 conversion#21095michaelw9999 wants to merge 2 commits intoggml-org:masterfrom
Conversation
|
This version does not fail, but it produced 11Mb gguf out of this repo. |
Check your /mnt/nfs-esxi/LLM/Qwen3.5-122B-A10B-NVFP4/ folder ? Are all the shards and files in there? Can you show me a |
Yep, that was problem with git lfs, fixed, converted fine, it runs now. Seems like performance is worse than Q4, benchmarking now. |
|
So here what I'm getting: Qwen3.5-122B-A10B-MXFP4_MOE: Qwen3.5-122B-A10B-NVFP4: |
Great! Are you running that with PR #21074 21074 or still on the baseline? Either way, I have not yet posted the real Blackwell kernel so that is not at all surprising. I do not have enough VRAM to run these for testing or optimizing so we'll have to see how it goes on those. |
I can test those as well I'm also converting this - it's still going, but so far it's good |
This comment was marked as off-topic.
This comment was marked as off-topic.
|
Uploaded 122B quants here - https://huggingface.co/DrRos/Qwen3.5-122B-A10B-NVFP4-GGUF/tree/main |
ff285d8 to
ea499e9
Compare
|
Also uploaded 397B - https://huggingface.co/DrRos/Qwen3.5-397B-A17B-NVFP4-GGUF/tree/main - if someone wants to test. |
| if nvfp4_compressed_tensors: | ||
| # Convert compressed-tensors 'global' scales into the reciprocal | ||
| def inverse_scale(gen): | ||
| def load(): | ||
| scale = LazyTorchTensor.to_eager(gen()).float() | ||
| return torch.where(torch.isfinite(scale) & (scale > 0), 1.0 / scale, torch.ones_like(scale)) | ||
| return load | ||
| # Change the compressed-tensors names to the ModelOpt names for handling consistently later | ||
| for name in list(self.model_tensors.keys()): | ||
| if name.endswith(".weight_packed"): | ||
| if name.removesuffix("_packed") not in self.model_tensors: | ||
| self.model_tensors[name.removesuffix("_packed")] = self.model_tensors.pop(name) | ||
| elif name.endswith(".weight_global_scale"): | ||
| scale2_name = name.replace(".weight_global_scale", ".weight_scale_2") | ||
| if scale2_name not in self.model_tensors: | ||
| self.model_tensors[scale2_name] = inverse_scale(self.model_tensors.pop(name)) | ||
| elif name.endswith(".input_global_scale"): | ||
| input_scale_name = name.replace(".input_global_scale", ".input_scale") | ||
| if input_scale_name not in self.model_tensors: | ||
| self.model_tensors[input_scale_name] = inverse_scale(self.model_tensors.pop(name)) |
There was a problem hiding this comment.
Are there no 1D .weight_scale? As the ones handled here:
llama.cpp/convert_hf_to_gguf.py
Lines 489 to 501 in ea499e9
There was a problem hiding this comment.
@CISC went and verified in @BaseCompressor.register(name=CompressionFormat.nvfp4_pack_quantized.value) from compressed-tensors - there are no 1D weight scales, so no fallback needed when in nvfp4-pack-quantized.
The only thing I noticed going through there again which might warrant a new adjustment to the PR is the distinction between NVFP4A16 and NVFP4. Right now the script will still find NVFP4A16, and just call it NVFP4 and set the input_scale to 1.0f if it's absent. With Q8 as default it will not be doing NVFP4 / 16-bit activations.
For the Blackwell MMA/MMVQ kernels I have W4A4 (NVFP4 x NVFP4) as the default and at the moment only option.
I realize I haven't written any code yet to check if input_scale is 1.0f and to then make an appropriate input scale, so that will be on my to-do there before I post PR.
So if you think we need it, we could add metadata to retain the recipe used, and/or or label it as NVFP4A16 in the model name. I do not think it's needed though. The user could do themselves with args. If input_scale == 1.0f we will know in the code side what to do.
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
|
@michaelw9999 @CISC do I need to requantize model after recent changes? |
No. |
This update expands the
convert_hf_to_ggufscript to support converting Huggingface NVFP4 models quantized with compressed-tensors. Previously, only ModelOpt quantized models were compatible and an error was raised.It finds the values and names used by compressed-tensors (eg,
weight_global_scaleinstead ofweight_scale_2for the tensor scale) and renames them to the ModelOpt equivalents so that the rest of the conversion remains identical. This keeps the update small. The weights themselves do not need any adaptation; the only other difference is that the scales become reciprocal values.