convert: Add compressed-tensors NVFP4 conversion by michaelw9999 · Pull Request #21095 · ggml-org/llama.cpp

michaelw9999 · 2026-03-28T04:09:20Z

This update expands the convert_hf_to_gguf script to support converting Huggingface NVFP4 models quantized with compressed-tensors. Previously, only ModelOpt quantized models were compatible and an error was raised.

It finds the values and names used by compressed-tensors (eg, weight_global_scale instead of weight_scale_2 for the tensor scale) and renames them to the ModelOpt equivalents so that the rest of the conversion remains identical. This keeps the update small. The weights themselves do not need any adaptation; the only other difference is that the scales become reciprocal values.

drrros · 2026-03-28T06:35:15Z

This version does not fail, but it produced 11Mb gguf out of this repo.
Full logs:

(llama.cpp) drros@epyc-ws:~/llama.cpp$ ./convert_hf_to_gguf.py --verbose --outfile ../Qwen3.5-122B-A10B-NVFP4.gguf /mnt/nfs-esxi/LLM/Qwen3.5-122B-A10B-NVFP4/
INFO:hf-to-gguf:Loading model: Qwen3.5-122B-A10B-NVFP4
INFO:hf-to-gguf:Model architecture: Qwen3_5MoeForConditionalGeneration
INFO:hf-to-gguf:heuristics unable to detect tensor dtype, defaulting to --outtype f16
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 262144
INFO:hf-to-gguf:gguf: embedding length = 3072
INFO:hf-to-gguf:gguf: head count = 32
INFO:hf-to-gguf:gguf: key-value head count = 2
WARNING:hf-to-gguf:Unknown RoPE type: default
INFO:hf-to-gguf:gguf: rope scaling type = NONE
INFO:hf-to-gguf:gguf: mrope sections: [11, 11, 10, 0]
INFO:hf-to-gguf:gguf: rope theta = 10000000
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-06
INFO:hf-to-gguf:gguf: expert count = 256
INFO:hf-to-gguf:gguf: experts used count = 8
INFO:hf-to-gguf:gguf: file type = 39
INFO:hf-to-gguf:gguf: expert feed forward length = 1024
INFO:hf-to-gguf:gguf: expert shared feed forward length = 1024
INFO:hf-to-gguf:Set model quantization version
INFO:hf-to-gguf:Set model tokenizer
DEBUG:hf-to-gguf:chktok: [198, 4558, 14305, 63288, 7599, 1517, 2228, 75981, 10628, 9008, 248, 222, 318, 7994, 8, 25677, 114, 373, 235, 9008, 234, 104, 29545, 318, 34493, 95637, 94170, 8, 189551, 10838, 99, 247, 9008, 99, 247, 220, 18, 220, 18, 18, 220, 18, 18, 18, 220, 18, 18, 18, 18, 220, 18, 18, 18, 18, 18, 220, 18, 18, 18, 18, 18, 18, 220, 18, 18, 18, 18, 18, 18, 18, 220, 18, 18, 18, 18, 18, 18, 18, 18, 220, 18, 13, 18, 220, 18, 486, 18, 220, 18, 1076, 18, 220, 151346, 157035, 152107, 158785, 154980, 158842, 152108, 154892, 158855, 154892, 237401, 169130, 72790, 223, 907, 108543, 95772, 21680, 96069, 16, 18, 16, 19, 16, 20, 16, 95913, 20565, 53598, 51448, 232742, 12696, 61326, 154824, 76306, 3246, 4456, 4456, 13475, 13475, 71093, 2918, 2918, 27235, 16582, 2834, 25758, 7403, 353, 2908, 978, 359, 83, 787, 551, 579, 1017, 11, 359, 762, 488, 2617, 30, 359, 44, 524, 2617, 353, 3172, 1236, 424, 11, 359, 35, 488, 1040, 1010, 14799, 30, 1165, 6, 41197, 264, 61709, 43]
DEBUG:hf-to-gguf:chkhsh: d30d75d9059f1aa2c19359de71047b3ae408c70875e8a3ccf8c5fba56c9d8af4
DEBUG:hf-to-gguf:tokenizer.ggml.pre: 'qwen35'
DEBUG:hf-to-gguf:chkhsh: d30d75d9059f1aa2c19359de71047b3ae408c70875e8a3ccf8c5fba56c9d8af4
INFO:gguf.vocab:Adding 247587 merge(s).
INFO:gguf.vocab:Setting special token type eos to 248046
INFO:gguf.vocab:Setting special token type pad to 248044
INFO:gguf.vocab:Setting chat_template to {%- set image_count = namespace(value=0) %}
{%- set video_count = namespace(value=0) %}
{%- macro render_content(content, do_vision_count, is_system_content=false) %}
    {%- if content is string %}
        {{- content }}
    {%- elif content is iterable and content is not mapping %}
        {%- for item in content %}
            {%- if 'image' in item or 'image_url' in item or item.type == 'image' %}
                {%- if is_system_content %}
                    {{- raise_exception('System message cannot contain images.') }}
                {%- endif %}
                {%- if do_vision_count %}
                    {%- set image_count.value = image_count.value + 1 %}
                {%- endif %}
                {%- if add_vision_id %}
                    {{- 'Picture ' ~ image_count.value ~ ': ' }}
                {%- endif %}
                {{- '<|vision_start|><|image_pad|><|vision_end|>' }}
            {%- elif 'video' in item or item.type == 'video' %}
                {%- if is_system_content %}
                    {{- raise_exception('System message cannot contain videos.') }}
                {%- endif %}
                {%- if do_vision_count %}
                    {%- set video_count.value = video_count.value + 1 %}
                {%- endif %}
                {%- if add_vision_id %}
                    {{- 'Video ' ~ video_count.value ~ ': ' }}
                {%- endif %}
                {{- '<|vision_start|><|video_pad|><|vision_end|>' }}
            {%- elif 'text' in item %}
                {{- item.text }}
            {%- else %}
                {{- raise_exception('Unexpected item type in content.') }}
            {%- endif %}
        {%- endfor %}
    {%- elif content is none or content is undefined %}
        {{- '' }}
    {%- else %}
        {{- raise_exception('Unexpected content type.') }}
    {%- endif %}
{%- endmacro %}
{%- if not messages %}
    {{- raise_exception('No messages provided.') }}
{%- endif %}
{%- if tools and tools is iterable and tools is not mapping %}
    {{- '<|im_start|>system\n' }}
    {{- "# Tools\n\nYou have access to the following functions:\n\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>" }}
    {{- '\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags\n- Required parameters MUST be specified\n- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after\n- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls\n</IMPORTANT>' }}
    {%- if messages[0].role == 'system' %}
        {%- set content = render_content(messages[0].content, false, true)|trim %}
        {%- if content %}
            {{- '\n\n' + content }}
        {%- endif %}
    {%- endif %}
    {{- '<|im_end|>\n' }}
{%- else %}
    {%- if messages[0].role == 'system' %}
        {%- set content = render_content(messages[0].content, false, true)|trim %}
        {{- '<|im_start|>system\n' + content + '<|im_end|>\n' }}
    {%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
    {%- set index = (messages|length - 1) - loop.index0 %}
    {%- if ns.multi_step_tool and message.role == "user" %}
        {%- set content = render_content(message.content, false)|trim %}
        {%- if not(content.startswith('<tool_response>') and content.endswith('</tool_response>')) %}
            {%- set ns.multi_step_tool = false %}
            {%- set ns.last_query_index = index %}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- if ns.multi_step_tool %}
    {{- raise_exception('No user query found in messages.') }}
{%- endif %}
{%- for message in messages %}
    {%- set content = render_content(message.content, true)|trim %}
    {%- if message.role == "system" %}
        {%- if not loop.first %}
            {{- raise_exception('System message must be at the beginning.') }}
        {%- endif %}
    {%- elif message.role == "user" %}
        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {%- set reasoning_content = '' %}
        {%- if message.reasoning_content is string %}
            {%- set reasoning_content = message.reasoning_content %}
        {%- else %}
            {%- if '</think>' in content %}
                {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
                {%- set content = content.split('</think>')[-1].lstrip('\n') %}
            {%- endif %}
        {%- endif %}
        {%- set reasoning_content = reasoning_content|trim %}
        {%- if loop.index0 > ns.last_query_index %}
            {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content + '\n</think>\n\n' + content }}
        {%- else %}
            {{- '<|im_start|>' + message.role + '\n' + content }}
        {%- endif %}
        {%- if message.tool_calls and message.tool_calls is iterable and message.tool_calls is not mapping %}
            {%- for tool_call in message.tool_calls %}
                {%- if tool_call.function is defined %}
                    {%- set tool_call = tool_call.function %}
                {%- endif %}
                {%- if loop.first %}
                    {%- if content|trim %}
                        {{- '\n\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
                    {%- else %}
                        {{- '<tool_call>\n<function=' + tool_call.name + '>\n' }}
                    {%- endif %}
                {%- else %}
                    {{- '\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
                {%- endif %}
                {%- if tool_call.arguments is defined %}
                    {%- for args_name, args_value in tool_call.arguments|items %}
                        {{- '<parameter=' + args_name + '>\n' }}
                        {%- set args_value = args_value | tojson | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
                        {{- args_value }}
                        {{- '\n</parameter>\n' }}
                    {%- endfor %}
                {%- endif %}
                {{- '</function>\n</tool_call>' }}
            {%- endfor %}
        {%- endif %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if loop.previtem and loop.previtem.role != "tool" %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- content }}
        {{- '\n</tool_response>' }}
        {%- if not loop.last and loop.nextitem.role != "tool" %}
            {{- '<|im_end|>\n' }}
        {%- elif loop.last %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- else %}
        {{- raise_exception('Unexpected message role.') }}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
    {%- if enable_thinking is defined and enable_thinking is false %}
        {{- '<think>\n\n</think>\n\n' }}
    {%- else %}
        {{- '<think>\n' }}
    {%- endif %}
{%- endif %}
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:../Qwen3.5-122B-A10B-NVFP4.gguf: n_tensors = 0, total_size = negligible - metadata only
Writing: 0.00byte [00:00, ?byte/s]
INFO:hf-to-gguf:Model successfully exported to ../Qwen3.5-122B-A10B-NVFP4.gguf
(llama.cpp) drros@epyc-ws:~/llama.cpp$ ls -alh ../Qwen3.5-122B-A10B-NVFP4.gguf 
-rw-rw-r-- 1 drros drros 11M Mar 28 09:30 ../Qwen3.5-122B-A10B-NVFP4.gguf

michaelw9999 · 2026-03-28T06:52:26Z

This version does not fail, but it produced 11Mb gguf out of this repo. Full logs:

(llama.cpp) drros@epyc-ws:~/llama.cpp$ ./convert_hf_to_gguf.py --verbose --outfile ../Qwen3.5-122B-A10B-NVFP4.gguf /mnt/nfs-esxi/LLM/Qwen3.5-122B-A10B-NVFP4/
INFO:hf-to-gguf:Loading model: Qwen3.5-122B-A10B-NVFP4
INFO:hf-to-gguf:Model architecture: Qwen3_5MoeForConditionalGeneration
INFO:hf-to-gguf:heuristics unable to detect tensor dtype, defaulting to --outtype f16
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 262144
INFO:hf-to-gguf:gguf: embedding length = 3072
INFO:hf-to-gguf:gguf: head count = 32
INFO:hf-to-gguf:gguf: key-value head count = 2
WARNING:hf-to-gguf:Unknown RoPE type: default
INFO:hf-to-gguf:gguf: rope scaling type = NONE
INFO:hf-to-gguf:gguf: mrope sections: [11, 11, 10, 0]
INFO:hf-to-gguf:gguf: rope theta = 10000000
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-06
INFO:hf-to-gguf:gguf: expert count = 256
INFO:hf-to-gguf:gguf: experts used count = 8
INFO:hf-to-gguf:gguf: file type = 39
INFO:hf-to-gguf:gguf: expert feed forward length = 1024
INFO:hf-to-gguf:gguf: expert shared feed forward length = 1024
INFO:hf-to-gguf:Set model quantization version
INFO:hf-to-gguf:Set model tokenizer
DEBUG:hf-to-gguf:chktok: [198, 4558, 14305, 63288, 7599, 1517, 2228, 75981, 10628, 9008, 248, 222, 318, 7994, 8, 25677, 114, 373, 235, 9008, 234, 104, 29545, 318, 34493, 95637, 94170, 8, 189551, 10838, 99, 247, 9008, 99, 247, 220, 18, 220, 18, 18, 220, 18, 18, 18, 220, 18, 18, 18, 18, 220, 18, 18, 18, 18, 18, 220, 18, 18, 18, 18, 18, 18, 220, 18, 18, 18, 18, 18, 18, 18, 220, 18, 18, 18, 18, 18, 18, 18, 18, 220, 18, 13, 18, 220, 18, 486, 18, 220, 18, 1076, 18, 220, 151346, 157035, 152107, 158785, 154980, 158842, 152108, 154892, 158855, 154892, 237401, 169130, 72790, 223, 907, 108543, 95772, 21680, 96069, 16, 18, 16, 19, 16, 20, 16, 95913, 20565, 53598, 51448, 232742, 12696, 61326, 154824, 76306, 3246, 4456, 4456, 13475, 13475, 71093, 2918, 2918, 27235, 16582, 2834, 25758, 7403, 353, 2908, 978, 359, 83, 787, 551, 579, 1017, 11, 359, 762, 488, 2617, 30, 359, 44, 524, 2617, 353, 3172, 1236, 424, 11, 359, 35, 488, 1040, 1010, 14799, 30, 1165, 6, 41197, 264, 61709, 43]
DEBUG:hf-to-gguf:chkhsh: d30d75d9059f1aa2c19359de71047b3ae408c70875e8a3ccf8c5fba56c9d8af4
DEBUG:hf-to-gguf:tokenizer.ggml.pre: 'qwen35'
DEBUG:hf-to-gguf:chkhsh: d30d75d9059f1aa2c19359de71047b3ae408c70875e8a3ccf8c5fba56c9d8af4
INFO:gguf.vocab:Adding 247587 merge(s).
INFO:gguf.vocab:Setting special token type eos to 248046
INFO:gguf.vocab:Setting special token type pad to 248044
INFO:gguf.vocab:Setting chat_template to {%- set image_count = namespace(value=0) %}
{%- set video_count = namespace(value=0) %}
{%- macro render_content(content, do_vision_count, is_system_content=false) %}
    {%- if content is string %}
        {{- content }}
    {%- elif content is iterable and content is not mapping %}
        {%- for item in content %}
            {%- if 'image' in item or 'image_url' in item or item.type == 'image' %}
                {%- if is_system_content %}
                    {{- raise_exception('System message cannot contain images.') }}
                {%- endif %}
                {%- if do_vision_count %}
                    {%- set image_count.value = image_count.value + 1 %}
                {%- endif %}
                {%- if add_vision_id %}
                    {{- 'Picture ' ~ image_count.value ~ ': ' }}
                {%- endif %}
                {{- '<|vision_start|><|image_pad|><|vision_end|>' }}
            {%- elif 'video' in item or item.type == 'video' %}
                {%- if is_system_content %}
                    {{- raise_exception('System message cannot contain videos.') }}
                {%- endif %}
                {%- if do_vision_count %}
                    {%- set video_count.value = video_count.value + 1 %}
                {%- endif %}
                {%- if add_vision_id %}
                    {{- 'Video ' ~ video_count.value ~ ': ' }}
                {%- endif %}
                {{- '<|vision_start|><|video_pad|><|vision_end|>' }}
            {%- elif 'text' in item %}
                {{- item.text }}
            {%- else %}
                {{- raise_exception('Unexpected item type in content.') }}
            {%- endif %}
        {%- endfor %}
    {%- elif content is none or content is undefined %}
        {{- '' }}
    {%- else %}
        {{- raise_exception('Unexpected content type.') }}
    {%- endif %}
{%- endmacro %}
{%- if not messages %}
    {{- raise_exception('No messages provided.') }}
{%- endif %}
{%- if tools and tools is iterable and tools is not mapping %}
    {{- '<|im_start|>system\n' }}
    {{- "# Tools\n\nYou have access to the following functions:\n\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>" }}
    {{- '\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags\n- Required parameters MUST be specified\n- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after\n- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls\n</IMPORTANT>' }}
    {%- if messages[0].role == 'system' %}
        {%- set content = render_content(messages[0].content, false, true)|trim %}
        {%- if content %}
            {{- '\n\n' + content }}
        {%- endif %}
    {%- endif %}
    {{- '<|im_end|>\n' }}
{%- else %}
    {%- if messages[0].role == 'system' %}
        {%- set content = render_content(messages[0].content, false, true)|trim %}
        {{- '<|im_start|>system\n' + content + '<|im_end|>\n' }}
    {%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
    {%- set index = (messages|length - 1) - loop.index0 %}
    {%- if ns.multi_step_tool and message.role == "user" %}
        {%- set content = render_content(message.content, false)|trim %}
        {%- if not(content.startswith('<tool_response>') and content.endswith('</tool_response>')) %}
            {%- set ns.multi_step_tool = false %}
            {%- set ns.last_query_index = index %}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- if ns.multi_step_tool %}
    {{- raise_exception('No user query found in messages.') }}
{%- endif %}
{%- for message in messages %}
    {%- set content = render_content(message.content, true)|trim %}
    {%- if message.role == "system" %}
        {%- if not loop.first %}
            {{- raise_exception('System message must be at the beginning.') }}
        {%- endif %}
    {%- elif message.role == "user" %}
        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {%- set reasoning_content = '' %}
        {%- if message.reasoning_content is string %}
            {%- set reasoning_content = message.reasoning_content %}
        {%- else %}
            {%- if '</think>' in content %}
                {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
                {%- set content = content.split('</think>')[-1].lstrip('\n') %}
            {%- endif %}
        {%- endif %}
        {%- set reasoning_content = reasoning_content|trim %}
        {%- if loop.index0 > ns.last_query_index %}
            {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content + '\n</think>\n\n' + content }}
        {%- else %}
            {{- '<|im_start|>' + message.role + '\n' + content }}
        {%- endif %}
        {%- if message.tool_calls and message.tool_calls is iterable and message.tool_calls is not mapping %}
            {%- for tool_call in message.tool_calls %}
                {%- if tool_call.function is defined %}
                    {%- set tool_call = tool_call.function %}
                {%- endif %}
                {%- if loop.first %}
                    {%- if content|trim %}
                        {{- '\n\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
                    {%- else %}
                        {{- '<tool_call>\n<function=' + tool_call.name + '>\n' }}
                    {%- endif %}
                {%- else %}
                    {{- '\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
                {%- endif %}
                {%- if tool_call.arguments is defined %}
                    {%- for args_name, args_value in tool_call.arguments|items %}
                        {{- '<parameter=' + args_name + '>\n' }}
                        {%- set args_value = args_value | tojson | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
                        {{- args_value }}
                        {{- '\n</parameter>\n' }}
                    {%- endfor %}
                {%- endif %}
                {{- '</function>\n</tool_call>' }}
            {%- endfor %}
        {%- endif %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if loop.previtem and loop.previtem.role != "tool" %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- content }}
        {{- '\n</tool_response>' }}
        {%- if not loop.last and loop.nextitem.role != "tool" %}
            {{- '<|im_end|>\n' }}
        {%- elif loop.last %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- else %}
        {{- raise_exception('Unexpected message role.') }}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
    {%- if enable_thinking is defined and enable_thinking is false %}
        {{- '<think>\n\n</think>\n\n' }}
    {%- else %}
        {{- '<think>\n' }}
    {%- endif %}
{%- endif %}
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:../Qwen3.5-122B-A10B-NVFP4.gguf: n_tensors = 0, total_size = negligible - metadata only
Writing: 0.00byte [00:00, ?byte/s]
INFO:hf-to-gguf:Model successfully exported to ../Qwen3.5-122B-A10B-NVFP4.gguf
(llama.cpp) drros@epyc-ws:~/llama.cpp$ ls -alh ../Qwen3.5-122B-A10B-NVFP4.gguf 
-rw-rw-r-- 1 drros drros 11M Mar 28 09:30 ../Qwen3.5-122B-A10B-NVFP4.gguf

Check your /mnt/nfs-esxi/LLM/Qwen3.5-122B-A10B-NVFP4/ folder ? Are all the shards and files in there? Can you show me a ls -lR ?

drrros · 2026-03-28T09:28:29Z

Check your /mnt/nfs-esxi/LLM/Qwen3.5-122B-A10B-NVFP4/ folder ? Are all the shards and files in there? Can you show me a ls -lR ?

Yep, that was problem with git lfs, fixed, converted fine, it runs now. Seems like performance is worse than Q4, benchmarking now.

drrros · 2026-03-28T09:49:31Z

So here what I'm getting:
Qwen3.5-122B-A10B-UD-Q4_K_XL:

drros@epyc-ws:~/llama.cpp$ ./build/bin/llama-bench -m /mnt/ds1nfs/codellamaweights/qwen3.5-122b-ud-q4-k-xl/Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf  -ts 35/20/20 -ncmoe 13 -p 8096 -ub 2048
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 71963 MiB):
  Device 0: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
  Device 1: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
  Device 2: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
| model                          |       size |     params | backend    | ngl |  n_cpu_moe | n_ubatch | ts           |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | -------: | ------------ | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | CUDA       |  99 |         13 |     2048 | 35.00/20.00/20.00 |          pp8096 |       1072.58 ± 3.54 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.73 GiB |   122.11 B | CUDA       |  99 |         13 |     2048 | 35.00/20.00/20.00 |           tg128 |         48.92 ± 0.07 |

build: 59d840209 (8559)

Qwen3.5-122B-A10B-MXFP4_MOE:

drros@epyc-ws:~/llama.cpp$ ./build/bin/llama-bench -m /mnt/ds1nfs/codellamaweights/qwen3.5-122b-mxfp4/Qwen3.5-122B-A10B-MXFP4_MOE-00001-of-00003.gguf -ts 35/20/20 -ncmoe 13 -p 8096 -ub 2048
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 71963 MiB):
  Device 0: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
  Device 1: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
  Device 2: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
| model                          |       size |     params | backend    | ngl |  n_cpu_moe | n_ubatch | ts           |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | -------: | ------------ | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium |  69.53 GiB |   122.11 B | CUDA       |  99 |         13 |     2048 | 35.00/20.00/20.00 |          pp8096 |       1153.86 ± 3.50 |
| qwen35moe 122B.A10B Q4_K - Medium |  69.53 GiB |   122.11 B | CUDA       |  99 |         13 |     2048 | 35.00/20.00/20.00 |           tg128 |         48.30 ± 0.07 |

build: 59d840209 (8559)

Qwen3.5-122B-A10B-NVFP4:

drros@epyc-ws:~/llama.cpp$ ./build/bin/llama-bench -m ../Qwen3.5-122B-A10B-NVFP4.gguf -ts 35/20/20 -ncmoe 13 -p 8096 -ub 2048
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 71963 MiB):
  Device 0: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
  Device 1: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
  Device 2: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
| model                          |       size |     params | backend    | ngl |  n_cpu_moe | n_ubatch | ts           |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | -------: | ------------ | --------------: | -------------------: |
| qwen35moe 122B.A10B NVFP4      |  70.41 GiB |   122.11 B | CUDA       |  99 |         13 |     2048 | 35.00/20.00/20.00 |          pp8096 |        650.37 ± 2.85 |
| qwen35moe 122B.A10B NVFP4      |  70.41 GiB |   122.11 B | CUDA       |  99 |         13 |     2048 | 35.00/20.00/20.00 |           tg128 |         17.10 ± 0.16 |

build: 59d840209 (8559)

michaelw9999 · 2026-03-28T09:53:17Z

Check your /mnt/nfs-esxi/LLM/Qwen3.5-122B-A10B-NVFP4/ folder ? Are all the shards and files in there? Can you show me a ls -lR ?

Yep, that was problem with git lfs, fixed, converted fine, it runs now. Seems like performance is worse than Q4, benchmarking now.

Great! Are you running that with PR #21074 21074 or still on the baseline? Either way, I have not yet posted the real Blackwell kernel so that is not at all surprising. I do not have enough VRAM to run these for testing or optimizing so we'll have to see how it goes on those.

drrros · 2026-03-28T09:55:48Z

Are you running that with PR #21074

Yes, this is on PR #21074

drrros · 2026-03-28T10:03:00Z

I have not yet posted the real Blackwell kernel

I can test those as well

I'm also converting this - it's still going, but so far it's good

drrros · 2026-03-28T17:34:35Z

Uploaded 122B quants here - https://huggingface.co/DrRos/Qwen3.5-122B-A10B-NVFP4-GGUF/tree/main

drrros · 2026-03-30T06:37:00Z

Also uploaded 397B - https://huggingface.co/DrRos/Qwen3.5-397B-A17B-NVFP4-GGUF/tree/main - if someone wants to test.

CISC · 2026-03-30T19:37:29Z

+            if nvfp4_compressed_tensors:
+                # Convert compressed-tensors 'global' scales into the reciprocal
+                def inverse_scale(gen):
+                    def load():
+                        scale = LazyTorchTensor.to_eager(gen()).float()
+                        return torch.where(torch.isfinite(scale) & (scale > 0), 1.0 / scale, torch.ones_like(scale))
+                    return load
+                # Change the compressed-tensors names to the ModelOpt names for handling consistently later
+                for name in list(self.model_tensors.keys()):
+                    if name.endswith(".weight_packed"):
+                        if name.removesuffix("_packed") not in self.model_tensors:
+                            self.model_tensors[name.removesuffix("_packed")] = self.model_tensors.pop(name)
+                    elif name.endswith(".weight_global_scale"):
+                        scale2_name = name.replace(".weight_global_scale", ".weight_scale_2")
+                        if scale2_name not in self.model_tensors:
+                            self.model_tensors[scale2_name] = inverse_scale(self.model_tensors.pop(name))
+                    elif name.endswith(".input_global_scale"):
+                        input_scale_name = name.replace(".input_global_scale", ".input_scale")
+                        if input_scale_name not in self.model_tensors:
+                            self.model_tensors[input_scale_name] = inverse_scale(self.model_tensors.pop(name))


Are there no 1D .weight_scale? As the ones handled here:

llama.cpp/convert_hf_to_gguf.py

Lines 489 to 501 in ea499e9

elif quant_method == "modelopt":

# Mixed-precision ModelOpt models: NVFP4 tensors are handled by

# _generate_nvfp4_tensors; FP8 tensors have 1D weight_scale and

# are dequantized here. k/v scale tensors are unused.

for name in self.model_tensors.keys():

if name.endswith(".weight_scale"):

weight_name = name.removesuffix("_scale")

w = self.model_tensors[weight_name]

s = self.model_tensors[name]

self.model_tensors[weight_name] = lambda w=w, s=s: dequant_simple(w(), s(), None)

tensors_to_remove.append(name)

if name.endswith((".k_scale", ".v_scale")):

tensors_to_remove.append(name)

@CISC went and verified in @BaseCompressor.register(name=CompressionFormat.nvfp4_pack_quantized.value) from compressed-tensors - there are no 1D weight scales, so no fallback needed when in nvfp4-pack-quantized.

The only thing I noticed going through there again which might warrant a new adjustment to the PR is the distinction between NVFP4A16 and NVFP4. Right now the script will still find NVFP4A16, and just call it NVFP4 and set the input_scale to 1.0f if it's absent. With Q8 as default it will not be doing NVFP4 / 16-bit activations.

For the Blackwell MMA/MMVQ kernels I have W4A4 (NVFP4 x NVFP4) as the default and at the moment only option.
I realize I haven't written any code yet to check if input_scale is 1.0f and to then make an appropriate input scale, so that will be on my to-do there before I post PR.

So if you think we need it, we could add metadata to retain the recipe used, and/or or label it as NVFP4A16 in the model name. I do not think it's needed though. The user could do themselves with args. If input_scale == 1.0f we will know in the code side what to do.

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

drrros · 2026-04-02T13:21:55Z

@michaelw9999 @CISC do I need to requantize model after recent changes?

CISC · 2026-04-02T14:40:30Z

@michaelw9999 @CISC do I need to requantize model after recent changes?

No.

github-actions bot added the python python script changes label Mar 28, 2026

michaelw9999 mentioned this pull request Mar 28, 2026

convert: Fix Qwen3.5/Qwen3.5 Moe NVFP4 Conversions #20505

Merged

This comment was marked as off-topic.

Sign in to view

Support compressed-tensors NVFP4 conversion

ea499e9

michaelw9999 force-pushed the nvfp4-hf-comptens branch from ff285d8 to ea499e9 Compare March 30, 2026 01:34

michaelw9999 requested a review from CISC as a code owner March 30, 2026 01:34

loci-dev mentioned this pull request Mar 30, 2026

UPSTREAM PR #21095: convert: Add compressed-tensors NVFP4 conversion auroralabs-loci/llama.cpp#1317

Open

CISC reviewed Mar 30, 2026

View reviewed changes

Update convert_hf_to_gguf.py

91156e2

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

drrros mentioned this pull request Mar 31, 2026

ggml-cuda: Add generic NVFP4 MMQ kernel #21074

Merged

	elif quant_method == "modelopt":
	# Mixed-precision ModelOpt models: NVFP4 tensors are handled by
	# _generate_nvfp4_tensors; FP8 tensors have 1D weight_scale and
	# are dequantized here. k/v scale tensors are unused.
	for name in self.model_tensors.keys():
	if name.endswith(".weight_scale"):
	weight_name = name.removesuffix("_scale")
	w = self.model_tensors[weight_name]
	s = self.model_tensors[name]
	self.model_tensors[weight_name] = lambda w=w, s=s: dequant_simple(w(), s(), None)
	tensors_to_remove.append(name)
	if name.endswith((".k_scale", ".v_scale")):
	tensors_to_remove.append(name)

Conversation

michaelw9999 commented Mar 28, 2026

Uh oh!

drrros commented Mar 28, 2026

Uh oh!

michaelw9999 commented Mar 28, 2026

Uh oh!

drrros commented Mar 28, 2026

Uh oh!

drrros commented Mar 28, 2026

Uh oh!

michaelw9999 commented Mar 28, 2026

Uh oh!

drrros commented Mar 28, 2026

Uh oh!

drrros commented Mar 28, 2026

Uh oh!

This comment was marked as off-topic.

drrros commented Mar 28, 2026

Uh oh!

drrros commented Mar 30, 2026

Uh oh!

Uh oh!

CISC Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

michaelw9999 Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

drrros commented Apr 2, 2026

Uh oh!

CISC commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants