Skip to content

convert: Add compressed-tensors NVFP4 conversion#21095

Open
michaelw9999 wants to merge 2 commits intoggml-org:masterfrom
michaelw9999:nvfp4-hf-comptens
Open

convert: Add compressed-tensors NVFP4 conversion#21095
michaelw9999 wants to merge 2 commits intoggml-org:masterfrom
michaelw9999:nvfp4-hf-comptens

Conversation

@michaelw9999
Copy link
Copy Markdown
Contributor

This update expands the convert_hf_to_gguf script to support converting Huggingface NVFP4 models quantized with compressed-tensors. Previously, only ModelOpt quantized models were compatible and an error was raised.

It finds the values and names used by compressed-tensors (eg, weight_global_scale instead of weight_scale_2 for the tensor scale) and renames them to the ModelOpt equivalents so that the rest of the conversion remains identical. This keeps the update small. The weights themselves do not need any adaptation; the only other difference is that the scales become reciprocal values.

@drrros
Copy link
Copy Markdown
Contributor

drrros commented Mar 28, 2026

This version does not fail, but it produced 11Mb gguf out of this repo.
Full logs:

(llama.cpp) drros@epyc-ws:~/llama.cpp$ ./convert_hf_to_gguf.py --verbose --outfile ../Qwen3.5-122B-A10B-NVFP4.gguf /mnt/nfs-esxi/LLM/Qwen3.5-122B-A10B-NVFP4/
INFO:hf-to-gguf:Loading model: Qwen3.5-122B-A10B-NVFP4
INFO:hf-to-gguf:Model architecture: Qwen3_5MoeForConditionalGeneration
INFO:hf-to-gguf:heuristics unable to detect tensor dtype, defaulting to --outtype f16
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 262144
INFO:hf-to-gguf:gguf: embedding length = 3072
INFO:hf-to-gguf:gguf: head count = 32
INFO:hf-to-gguf:gguf: key-value head count = 2
WARNING:hf-to-gguf:Unknown RoPE type: default
INFO:hf-to-gguf:gguf: rope scaling type = NONE
INFO:hf-to-gguf:gguf: mrope sections: [11, 11, 10, 0]
INFO:hf-to-gguf:gguf: rope theta = 10000000
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-06
INFO:hf-to-gguf:gguf: expert count = 256
INFO:hf-to-gguf:gguf: experts used count = 8
INFO:hf-to-gguf:gguf: file type = 39
INFO:hf-to-gguf:gguf: expert feed forward length = 1024
INFO:hf-to-gguf:gguf: expert shared feed forward length = 1024
INFO:hf-to-gguf:Set model quantization version
INFO:hf-to-gguf:Set model tokenizer
DEBUG:hf-to-gguf:chktok: [198, 4558, 14305, 63288, 7599, 1517, 2228, 75981, 10628, 9008, 248, 222, 318, 7994, 8, 25677, 114, 373, 235, 9008, 234, 104, 29545, 318, 34493, 95637, 94170, 8, 189551, 10838, 99, 247, 9008, 99, 247, 220, 18, 220, 18, 18, 220, 18, 18, 18, 220, 18, 18, 18, 18, 220, 18, 18, 18, 18, 18, 220, 18, 18, 18, 18, 18, 18, 220, 18, 18, 18, 18, 18, 18, 18, 220, 18, 18, 18, 18, 18, 18, 18, 18, 220, 18, 13, 18, 220, 18, 486, 18, 220, 18, 1076, 18, 220, 151346, 157035, 152107, 158785, 154980, 158842, 152108, 154892, 158855, 154892, 237401, 169130, 72790, 223, 907, 108543, 95772, 21680, 96069, 16, 18, 16, 19, 16, 20, 16, 95913, 20565, 53598, 51448, 232742, 12696, 61326, 154824, 76306, 3246, 4456, 4456, 13475, 13475, 71093, 2918, 2918, 27235, 16582, 2834, 25758, 7403, 353, 2908, 978, 359, 83, 787, 551, 579, 1017, 11, 359, 762, 488, 2617, 30, 359, 44, 524, 2617, 353, 3172, 1236, 424, 11, 359, 35, 488, 1040, 1010, 14799, 30, 1165, 6, 41197, 264, 61709, 43]
DEBUG:hf-to-gguf:chkhsh: d30d75d9059f1aa2c19359de71047b3ae408c70875e8a3ccf8c5fba56c9d8af4
DEBUG:hf-to-gguf:tokenizer.ggml.pre: 'qwen35'
DEBUG:hf-to-gguf:chkhsh: d30d75d9059f1aa2c19359de71047b3ae408c70875e8a3ccf8c5fba56c9d8af4
INFO:gguf.vocab:Adding 247587 merge(s).
INFO:gguf.vocab:Setting special token type eos to 248046
INFO:gguf.vocab:Setting special token type pad to 248044
INFO:gguf.vocab:Setting chat_template to {%- set image_count = namespace(value=0) %}
{%- set video_count = namespace(value=0) %}
{%- macro render_content(content, do_vision_count, is_system_content=false) %}
    {%- if content is string %}
        {{- content }}
    {%- elif content is iterable and content is not mapping %}
        {%- for item in content %}
            {%- if 'image' in item or 'image_url' in item or item.type == 'image' %}
                {%- if is_system_content %}
                    {{- raise_exception('System message cannot contain images.') }}
                {%- endif %}
                {%- if do_vision_count %}
                    {%- set image_count.value = image_count.value + 1 %}
                {%- endif %}
                {%- if add_vision_id %}
                    {{- 'Picture ' ~ image_count.value ~ ': ' }}
                {%- endif %}
                {{- '<|vision_start|><|image_pad|><|vision_end|>' }}
            {%- elif 'video' in item or item.type == 'video' %}
                {%- if is_system_content %}
                    {{- raise_exception('System message cannot contain videos.') }}
                {%- endif %}
                {%- if do_vision_count %}
                    {%- set video_count.value = video_count.value + 1 %}
                {%- endif %}
                {%- if add_vision_id %}
                    {{- 'Video ' ~ video_count.value ~ ': ' }}
                {%- endif %}
                {{- '<|vision_start|><|video_pad|><|vision_end|>' }}
            {%- elif 'text' in item %}
                {{- item.text }}
            {%- else %}
                {{- raise_exception('Unexpected item type in content.') }}
            {%- endif %}
        {%- endfor %}
    {%- elif content is none or content is undefined %}
        {{- '' }}
    {%- else %}
        {{- raise_exception('Unexpected content type.') }}
    {%- endif %}
{%- endmacro %}
{%- if not messages %}
    {{- raise_exception('No messages provided.') }}
{%- endif %}
{%- if tools and tools is iterable and tools is not mapping %}
    {{- '<|im_start|>system\n' }}
    {{- "# Tools\n\nYou have access to the following functions:\n\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>" }}
    {{- '\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags\n- Required parameters MUST be specified\n- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after\n- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls\n</IMPORTANT>' }}
    {%- if messages[0].role == 'system' %}
        {%- set content = render_content(messages[0].content, false, true)|trim %}
        {%- if content %}
            {{- '\n\n' + content }}
        {%- endif %}
    {%- endif %}
    {{- '<|im_end|>\n' }}
{%- else %}
    {%- if messages[0].role == 'system' %}
        {%- set content = render_content(messages[0].content, false, true)|trim %}
        {{- '<|im_start|>system\n' + content + '<|im_end|>\n' }}
    {%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
    {%- set index = (messages|length - 1) - loop.index0 %}
    {%- if ns.multi_step_tool and message.role == "user" %}
        {%- set content = render_content(message.content, false)|trim %}
        {%- if not(content.startswith('<tool_response>') and content.endswith('</tool_response>')) %}
            {%- set ns.multi_step_tool = false %}
            {%- set ns.last_query_index = index %}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- if ns.multi_step_tool %}
    {{- raise_exception('No user query found in messages.') }}
{%- endif %}
{%- for message in messages %}
    {%- set content = render_content(message.content, true)|trim %}
    {%- if message.role == "system" %}
        {%- if not loop.first %}
            {{- raise_exception('System message must be at the beginning.') }}
        {%- endif %}
    {%- elif message.role == "user" %}
        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {%- set reasoning_content = '' %}
        {%- if message.reasoning_content is string %}
            {%- set reasoning_content = message.reasoning_content %}
        {%- else %}
            {%- if '</think>' in content %}
                {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
                {%- set content = content.split('</think>')[-1].lstrip('\n') %}
            {%- endif %}
        {%- endif %}
        {%- set reasoning_content = reasoning_content|trim %}
        {%- if loop.index0 > ns.last_query_index %}
            {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content + '\n</think>\n\n' + content }}
        {%- else %}
            {{- '<|im_start|>' + message.role + '\n' + content }}
        {%- endif %}
        {%- if message.tool_calls and message.tool_calls is iterable and message.tool_calls is not mapping %}
            {%- for tool_call in message.tool_calls %}
                {%- if tool_call.function is defined %}
                    {%- set tool_call = tool_call.function %}
                {%- endif %}
                {%- if loop.first %}
                    {%- if content|trim %}
                        {{- '\n\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
                    {%- else %}
                        {{- '<tool_call>\n<function=' + tool_call.name + '>\n' }}
                    {%- endif %}
                {%- else %}
                    {{- '\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
                {%- endif %}
                {%- if tool_call.arguments is defined %}
                    {%- for args_name, args_value in tool_call.arguments|items %}
                        {{- '<parameter=' + args_name + '>\n' }}
                        {%- set args_value = args_value | tojson | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
                        {{- args_value }}
                        {{- '\n</parameter>\n' }}
                    {%- endfor %}
                {%- endif %}
                {{- '</function>\n</tool_call>' }}
            {%- endfor %}
        {%- endif %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if loop.previtem and loop.previtem.role != "tool" %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- content }}
        {{- '\n</tool_response>' }}
        {%- if not loop.last and loop.nextitem.role != "tool" %}
            {{- '<|im_end|>\n' }}
        {%- elif loop.last %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- else %}
        {{- raise_exception('Unexpected message role.') }}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
    {%- if enable_thinking is defined and enable_thinking is false %}
        {{- '<think>\n\n</think>\n\n' }}
    {%- else %}
        {{- '<think>\n' }}
    {%- endif %}
{%- endif %}
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:../Qwen3.5-122B-A10B-NVFP4.gguf: n_tensors = 0, total_size = negligible - metadata only
Writing: 0.00byte [00:00, ?byte/s]
INFO:hf-to-gguf:Model successfully exported to ../Qwen3.5-122B-A10B-NVFP4.gguf
(llama.cpp) drros@epyc-ws:~/llama.cpp$ ls -alh ../Qwen3.5-122B-A10B-NVFP4.gguf 
-rw-rw-r-- 1 drros drros 11M Mar 28 09:30 ../Qwen3.5-122B-A10B-NVFP4.gguf

@michaelw9999
Copy link
Copy Markdown
Contributor Author

This version does not fail, but it produced 11Mb gguf out of this repo. Full logs:

(llama.cpp) drros@epyc-ws:~/llama.cpp$ ./convert_hf_to_gguf.py --verbose --outfile ../Qwen3.5-122B-A10B-NVFP4.gguf /mnt/nfs-esxi/LLM/Qwen3.5-122B-A10B-NVFP4/
INFO:hf-to-gguf:Loading model: Qwen3.5-122B-A10B-NVFP4
INFO:hf-to-gguf:Model architecture: Qwen3_5MoeForConditionalGeneration
INFO:hf-to-gguf:heuristics unable to detect tensor dtype, defaulting to --outtype f16
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 262144
INFO:hf-to-gguf:gguf: embedding length = 3072
INFO:hf-to-gguf:gguf: head count = 32
INFO:hf-to-gguf:gguf: key-value head count = 2
WARNING:hf-to-gguf:Unknown RoPE type: default
INFO:hf-to-gguf:gguf: rope scaling type = NONE
INFO:hf-to-gguf:gguf: mrope sections: [11, 11, 10, 0]
INFO:hf-to-gguf:gguf: rope theta = 10000000
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-06
INFO:hf-to-gguf:gguf: expert count = 256
INFO:hf-to-gguf:gguf: experts used count = 8
INFO:hf-to-gguf:gguf: file type = 39
INFO:hf-to-gguf:gguf: expert feed forward length = 1024
INFO:hf-to-gguf:gguf: expert shared feed forward length = 1024
INFO:hf-to-gguf:Set model quantization version
INFO:hf-to-gguf:Set model tokenizer
DEBUG:hf-to-gguf:chktok: [198, 4558, 14305, 63288, 7599, 1517, 2228, 75981, 10628, 9008, 248, 222, 318, 7994, 8, 25677, 114, 373, 235, 9008, 234, 104, 29545, 318, 34493, 95637, 94170, 8, 189551, 10838, 99, 247, 9008, 99, 247, 220, 18, 220, 18, 18, 220, 18, 18, 18, 220, 18, 18, 18, 18, 220, 18, 18, 18, 18, 18, 220, 18, 18, 18, 18, 18, 18, 220, 18, 18, 18, 18, 18, 18, 18, 220, 18, 18, 18, 18, 18, 18, 18, 18, 220, 18, 13, 18, 220, 18, 486, 18, 220, 18, 1076, 18, 220, 151346, 157035, 152107, 158785, 154980, 158842, 152108, 154892, 158855, 154892, 237401, 169130, 72790, 223, 907, 108543, 95772, 21680, 96069, 16, 18, 16, 19, 16, 20, 16, 95913, 20565, 53598, 51448, 232742, 12696, 61326, 154824, 76306, 3246, 4456, 4456, 13475, 13475, 71093, 2918, 2918, 27235, 16582, 2834, 25758, 7403, 353, 2908, 978, 359, 83, 787, 551, 579, 1017, 11, 359, 762, 488, 2617, 30, 359, 44, 524, 2617, 353, 3172, 1236, 424, 11, 359, 35, 488, 1040, 1010, 14799, 30, 1165, 6, 41197, 264, 61709, 43]
DEBUG:hf-to-gguf:chkhsh: d30d75d9059f1aa2c19359de71047b3ae408c70875e8a3ccf8c5fba56c9d8af4
DEBUG:hf-to-gguf:tokenizer.ggml.pre: 'qwen35'
DEBUG:hf-to-gguf:chkhsh: d30d75d9059f1aa2c19359de71047b3ae408c70875e8a3ccf8c5fba56c9d8af4
INFO:gguf.vocab:Adding 247587 merge(s).
INFO:gguf.vocab:Setting special token type eos to 248046
INFO:gguf.vocab:Setting special token type pad to 248044
INFO:gguf.vocab:Setting chat_template to {%- set image_count = namespace(value=0) %}
{%- set video_count = namespace(value=0) %}
{%- macro render_content(content, do_vision_count, is_system_content=false) %}
    {%- if content is string %}
        {{- content }}
    {%- elif content is iterable and content is not mapping %}
        {%- for item in content %}
            {%- if 'image' in item or 'image_url' in item or item.type == 'image' %}
                {%- if is_system_content %}
                    {{- raise_exception('System message cannot contain images.') }}
                {%- endif %}
                {%- if do_vision_count %}
                    {%- set image_count.value = image_count.value + 1 %}
                {%- endif %}
                {%- if add_vision_id %}
                    {{- 'Picture ' ~ image_count.value ~ ': ' }}
                {%- endif %}
                {{- '<|vision_start|><|image_pad|><|vision_end|>' }}
            {%- elif 'video' in item or item.type == 'video' %}
                {%- if is_system_content %}
                    {{- raise_exception('System message cannot contain videos.') }}
                {%- endif %}
                {%- if do_vision_count %}
                    {%- set video_count.value = video_count.value + 1 %}
                {%- endif %}
                {%- if add_vision_id %}
                    {{- 'Video ' ~ video_count.value ~ ': ' }}
                {%- endif %}
                {{- '<|vision_start|><|video_pad|><|vision_end|>' }}
            {%- elif 'text' in item %}
                {{- item.text }}
            {%- else %}
                {{- raise_exception('Unexpected item type in content.') }}
            {%- endif %}
        {%- endfor %}
    {%- elif content is none or content is undefined %}
        {{- '' }}
    {%- else %}
        {{- raise_exception('Unexpected content type.') }}
    {%- endif %}
{%- endmacro %}
{%- if not messages %}
    {{- raise_exception('No messages provided.') }}
{%- endif %}
{%- if tools and tools is iterable and tools is not mapping %}
    {{- '<|im_start|>system\n' }}
    {{- "# Tools\n\nYou have access to the following functions:\n\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>" }}
    {{- '\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags\n- Required parameters MUST be specified\n- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after\n- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls\n</IMPORTANT>' }}
    {%- if messages[0].role == 'system' %}
        {%- set content = render_content(messages[0].content, false, true)|trim %}
        {%- if content %}
            {{- '\n\n' + content }}
        {%- endif %}
    {%- endif %}
    {{- '<|im_end|>\n' }}
{%- else %}
    {%- if messages[0].role == 'system' %}
        {%- set content = render_content(messages[0].content, false, true)|trim %}
        {{- '<|im_start|>system\n' + content + '<|im_end|>\n' }}
    {%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
    {%- set index = (messages|length - 1) - loop.index0 %}
    {%- if ns.multi_step_tool and message.role == "user" %}
        {%- set content = render_content(message.content, false)|trim %}
        {%- if not(content.startswith('<tool_response>') and content.endswith('</tool_response>')) %}
            {%- set ns.multi_step_tool = false %}
            {%- set ns.last_query_index = index %}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- if ns.multi_step_tool %}
    {{- raise_exception('No user query found in messages.') }}
{%- endif %}
{%- for message in messages %}
    {%- set content = render_content(message.content, true)|trim %}
    {%- if message.role == "system" %}
        {%- if not loop.first %}
            {{- raise_exception('System message must be at the beginning.') }}
        {%- endif %}
    {%- elif message.role == "user" %}
        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {%- set reasoning_content = '' %}
        {%- if message.reasoning_content is string %}
            {%- set reasoning_content = message.reasoning_content %}
        {%- else %}
            {%- if '</think>' in content %}
                {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
                {%- set content = content.split('</think>')[-1].lstrip('\n') %}
            {%- endif %}
        {%- endif %}
        {%- set reasoning_content = reasoning_content|trim %}
        {%- if loop.index0 > ns.last_query_index %}
            {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content + '\n</think>\n\n' + content }}
        {%- else %}
            {{- '<|im_start|>' + message.role + '\n' + content }}
        {%- endif %}
        {%- if message.tool_calls and message.tool_calls is iterable and message.tool_calls is not mapping %}
            {%- for tool_call in message.tool_calls %}
                {%- if tool_call.function is defined %}
                    {%- set tool_call = tool_call.function %}
                {%- endif %}
                {%- if loop.first %}
                    {%- if content|trim %}
                        {{- '\n\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
                    {%- else %}
                        {{- '<tool_call>\n<function=' + tool_call.name + '>\n' }}
                    {%- endif %}
                {%- else %}
                    {{- '\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
                {%- endif %}
                {%- if tool_call.arguments is defined %}
                    {%- for args_name, args_value in tool_call.arguments|items %}
                        {{- '<parameter=' + args_name + '>\n' }}
                        {%- set args_value = args_value | tojson | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
                        {{- args_value }}
                        {{- '\n</parameter>\n' }}
                    {%- endfor %}
                {%- endif %}
                {{- '</function>\n</tool_call>' }}
            {%- endfor %}
        {%- endif %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if loop.previtem and loop.previtem.role != "tool" %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- content }}
        {{- '\n</tool_response>' }}
        {%- if not loop.last and loop.nextitem.role != "tool" %}
            {{- '<|im_end|>\n' }}
        {%- elif loop.last %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- else %}
        {{- raise_exception('Unexpected message role.') }}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
    {%- if enable_thinking is defined and enable_thinking is false %}
        {{- '<think>\n\n</think>\n\n' }}
    {%- else %}
        {{- '<think>\n' }}
    {%- endif %}
{%- endif %}
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:../Qwen3.5-122B-A10B-NVFP4.gguf: n_tensors = 0, total_size = negligible - metadata only
Writing: 0.00byte [00:00, ?byte/s]
INFO:hf-to-gguf:Model successfully exported to ../Qwen3.5-122B-A10B-NVFP4.gguf
(llama.cpp) drros@epyc-ws:~/llama.cpp$ ls -alh ../Qwen3.5-122B-A10B-NVFP4.gguf 
-rw-rw-r-- 1 drros drros 11M Mar 28 09:30 ../Qwen3.5-122B-A10B-NVFP4.gguf

Check your /mnt/nfs-esxi/LLM/Qwen3.5-122B-A10B-NVFP4/ folder ? Are all the shards and files in there? Can you show me a ls -lR ?

@drrros
Copy link
Copy Markdown
Contributor

drrros commented Mar 28, 2026

Check your /mnt/nfs-esxi/LLM/Qwen3.5-122B-A10B-NVFP4/ folder ? Are all the shards and files in there? Can you show me a ls -lR ?

Yep, that was problem with git lfs, fixed, converted fine, it runs now. Seems like performance is worse than Q4, benchmarking now.

@drrros
Copy link
Copy Markdown
Contributor

drrros commented Mar 28, 2026

So here what I'm getting:
Qwen3.5-122B-A10B-UD-Q4_K_XL:

drros@epyc-ws:~/llama.cpp$ ./build/bin/llama-bench -m /mnt/ds1nfs/codellamaweights/qwen3.5-122b-ud-q4-k-xl/Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf  -ts 35/20/20 -ncmoe 13 -p 8096 -ub 2048
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 71963 MiB):
  Device 0: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
  Device 1: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
  Device 2: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
| model                          |       size |     params | backend    | ngl |  n_cpu_moe | n_ubatch | ts           |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | -------: | ------------ | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | CUDA       |  99 |         13 |     2048 | 35.00/20.00/20.00 |          pp8096 |       1072.58 ± 3.54 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | CUDA       |  99 |         13 |     2048 | 35.00/20.00/20.00 |           tg128 |         48.92 ± 0.07 |

build: 59d840209 (8559)

Qwen3.5-122B-A10B-MXFP4_MOE:

drros@epyc-ws:~/llama.cpp$ ./build/bin/llama-bench -m /mnt/ds1nfs/codellamaweights/qwen3.5-122b-mxfp4/Qwen3.5-122B-A10B-MXFP4_MOE-00001-of-00003.gguf -ts 35/20/20 -ncmoe 13 -p 8096 -ub 2048
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 71963 MiB):
  Device 0: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
  Device 1: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
  Device 2: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
| model                          |       size |     params | backend    | ngl |  n_cpu_moe | n_ubatch | ts           |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | -------: | ------------ | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium |  69.53 GiB |   122.11 B | CUDA       |  99 |         13 |     2048 | 35.00/20.00/20.00 |          pp8096 |       1153.86 ± 3.50 |
| qwen35moe 122B.A10B Q4_K - Medium |  69.53 GiB |   122.11 B | CUDA       |  99 |         13 |     2048 | 35.00/20.00/20.00 |           tg128 |         48.30 ± 0.07 |

build: 59d840209 (8559)

Qwen3.5-122B-A10B-NVFP4:

drros@epyc-ws:~/llama.cpp$ ./build/bin/llama-bench -m ../Qwen3.5-122B-A10B-NVFP4.gguf -ts 35/20/20 -ncmoe 13 -p 8096 -ub 2048
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 71963 MiB):
  Device 0: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
  Device 1: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
  Device 2: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
| model                          |       size |     params | backend    | ngl |  n_cpu_moe | n_ubatch | ts           |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | -------: | ------------ | --------------: | -------------------: |
| qwen35moe 122B.A10B NVFP4      |  70.41 GiB |   122.11 B | CUDA       |  99 |         13 |     2048 | 35.00/20.00/20.00 |          pp8096 |        650.37 ± 2.85 |
| qwen35moe 122B.A10B NVFP4      |  70.41 GiB |   122.11 B | CUDA       |  99 |         13 |     2048 | 35.00/20.00/20.00 |           tg128 |         17.10 ± 0.16 |

build: 59d840209 (8559)

@michaelw9999
Copy link
Copy Markdown
Contributor Author

Check your /mnt/nfs-esxi/LLM/Qwen3.5-122B-A10B-NVFP4/ folder ? Are all the shards and files in there? Can you show me a ls -lR ?

Yep, that was problem with git lfs, fixed, converted fine, it runs now. Seems like performance is worse than Q4, benchmarking now.

Great! Are you running that with PR #21074 21074 or still on the baseline? Either way, I have not yet posted the real Blackwell kernel so that is not at all surprising. I do not have enough VRAM to run these for testing or optimizing so we'll have to see how it goes on those.

@drrros
Copy link
Copy Markdown
Contributor

drrros commented Mar 28, 2026

Are you running that with PR #21074

Yes, this is on PR #21074

@drrros
Copy link
Copy Markdown
Contributor

drrros commented Mar 28, 2026

I have not yet posted the real Blackwell kernel

I can test those as well

I'm also converting this - it's still going, but so far it's good

@michaelw9999

This comment was marked as off-topic.

@drrros
Copy link
Copy Markdown
Contributor

drrros commented Mar 28, 2026

@drrros
Copy link
Copy Markdown
Contributor

drrros commented Mar 30, 2026

Also uploaded 397B - https://huggingface.co/DrRos/Qwen3.5-397B-A17B-NVFP4-GGUF/tree/main - if someone wants to test.

Comment thread convert_hf_to_gguf.py Outdated
Comment thread convert_hf_to_gguf.py
Comment on lines +775 to +794
if nvfp4_compressed_tensors:
# Convert compressed-tensors 'global' scales into the reciprocal
def inverse_scale(gen):
def load():
scale = LazyTorchTensor.to_eager(gen()).float()
return torch.where(torch.isfinite(scale) & (scale > 0), 1.0 / scale, torch.ones_like(scale))
return load
# Change the compressed-tensors names to the ModelOpt names for handling consistently later
for name in list(self.model_tensors.keys()):
if name.endswith(".weight_packed"):
if name.removesuffix("_packed") not in self.model_tensors:
self.model_tensors[name.removesuffix("_packed")] = self.model_tensors.pop(name)
elif name.endswith(".weight_global_scale"):
scale2_name = name.replace(".weight_global_scale", ".weight_scale_2")
if scale2_name not in self.model_tensors:
self.model_tensors[scale2_name] = inverse_scale(self.model_tensors.pop(name))
elif name.endswith(".input_global_scale"):
input_scale_name = name.replace(".input_global_scale", ".input_scale")
if input_scale_name not in self.model_tensors:
self.model_tensors[input_scale_name] = inverse_scale(self.model_tensors.pop(name))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there no 1D .weight_scale? As the ones handled here:

elif quant_method == "modelopt":
# Mixed-precision ModelOpt models: NVFP4 tensors are handled by
# _generate_nvfp4_tensors; FP8 tensors have 1D weight_scale and
# are dequantized here. k/v scale tensors are unused.
for name in self.model_tensors.keys():
if name.endswith(".weight_scale"):
weight_name = name.removesuffix("_scale")
w = self.model_tensors[weight_name]
s = self.model_tensors[name]
self.model_tensors[weight_name] = lambda w=w, s=s: dequant_simple(w(), s(), None)
tensors_to_remove.append(name)
if name.endswith((".k_scale", ".v_scale")):
tensors_to_remove.append(name)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CISC went and verified in @BaseCompressor.register(name=CompressionFormat.nvfp4_pack_quantized.value) from compressed-tensors - there are no 1D weight scales, so no fallback needed when in nvfp4-pack-quantized.

The only thing I noticed going through there again which might warrant a new adjustment to the PR is the distinction between NVFP4A16 and NVFP4. Right now the script will still find NVFP4A16, and just call it NVFP4 and set the input_scale to 1.0f if it's absent. With Q8 as default it will not be doing NVFP4 / 16-bit activations.

For the Blackwell MMA/MMVQ kernels I have W4A4 (NVFP4 x NVFP4) as the default and at the moment only option.
I realize I haven't written any code yet to check if input_scale is 1.0f and to then make an appropriate input scale, so that will be on my to-do there before I post PR.

So if you think we need it, we could add metadata to retain the recipe used, and/or or label it as NVFP4A16 in the model name. I do not think it's needed though. The user could do themselves with args. If input_scale == 1.0f we will know in the code side what to do.

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
@drrros
Copy link
Copy Markdown
Contributor

drrros commented Apr 2, 2026

@michaelw9999 @CISC do I need to requantize model after recent changes?

@CISC
Copy link
Copy Markdown
Member

CISC commented Apr 2, 2026

@michaelw9999 @CISC do I need to requantize model after recent changes?

No.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants