Tooling for converting BF16/FP16 HuggingFace models to NVFP4 (NVIDIA's 4-bit float format with native Blackwell tensor-core support).
~/nvfp4_conversion/
├── venv/ # dedicated venv (modelopt 0.43, torch 2.11+cu130)
├── scripts/
│ └── convert_to_nvfp4.py # the conversion CLI
├── logs/ # one log per conversion run
└── README.md # this file
Source model must be a local HF directory (BF16/FP16, NOT pre-quantized).
~/nvfp4_conversion/venv/bin/python ~/nvfp4_conversion/scripts/convert_to_nvfp4.py \
--source ~/vLLM_Servers/models_awq/Qwen3-4B \
--output ~/vLLM_Servers/models_awq/Qwen3-4B-NVFP4 \
--calib-samples 256Pre-flight checks fail loudly if the source is already quantized, the output already exists, the disk is too full, or GPU memory is too tight.
Use the start-nvfp4.sh wrapper (in ~/bin/) — it bakes in the curand
header path and the Blackwell PCIe TP=2 fixes:
~/bin/start-nvfp4.sh ~/vLLM_Servers/models_awq/Qwen3-4B-NVFP4Defaults: port 8011, TP=1, util 0.30, max_len 4096. Override via flags.
modelopt pulls torch 2.11+cu130, transformers 4.57, deepspeed, and
~5 GB of dependencies. Keeping it isolated from ~/vLLM_Servers/vllm_env
prevents accidental upgrades that would break the production vLLM.
-
flashinfer JIT compile fails on
curand_kernel.h: No such file. System CUDA 13 install is missing curand-dev headers. The wrapper scriptstart-nvfp4.shworks around it viaNVCC_PREPEND_FLAGSpointing at the pip-installednvidia/cu13/include/. -
First vLLM launch on any NVFP4 model takes ~2 min for the JIT compile of FP4 GEMM kernels for sm_120. Cached after first run.
-
Cosmetic teardown error at end of conversion (
Fatal Python error: PyGILState_Release). Harmless — happens after the export already succeeded. Output is intact. -
"torch_dtype is deprecated" warning during model load — newer transformers wants
dtype=instead. Cosmetic, will fix in a future pass.
- Architectures supported: Qwen2 / Qwen3 / Qwen2.5 families, plus LlamaForCausalLM / MistralForCausalLM (latter two: code path written but untested — phase 2 work).
- Calibration dataset: cnn_dailymail (open, no auth needed). Override
via
--calib-dataset. - Default 256 calibration samples (decent quality, ~30-60s extra).
- Phase 2: test 14B-class models, add architecture auto-detection.
- Phase 3: validate quantized output against original (perplexity / MMLU diff to catch regressions).
- Phase 4: web UI in the Control Center for click-to-convert.
- Phase 5: handle larger models that need TP=2 just to load BF16 weights (e.g., Qwen3-72B BF16 = 144 GB → split across both GPUs).