Skip to content

Unable to set vllm serve parameters using hf_overrides #912

@devdev-automation

Description

@devdev-automation

It appears that parameters set using --hf_overrides are ignored by the vllm backend. Either when certain parameters are supplied (ie. tensor_parallel_size or max_model_len)

Is there any other way to provide runtime arguments to the vllm backend outside using --hf_overrides? vllm includes separate parameters for --tensor-parallel-size and --max-model-len that could be passed directly to the serve command.

Steps to reproduce:

Purposefully using a large model that will not fix in an average GPU's memory.

  1. docker model pull huggingface.co/qwen/qwen3-coder-next
  2. docker model configure --hf_overrides '{"tensor_parallel_size": 4, "max_model_len": 32768}' huggingface.co/qwen/qwen3-coder-next
  3. Monitor logs as configure starts the model: docker model logs -f 2>&1 | grep tensor_parallel_size

Alternatively run the model and monitor logs directly.

  • `docker model run huggingface.co/qwen/qwen3-coder-next

Error Logs:

Lines 1/2 show the hf_overrides being properly parsed and line 3 shows both options being ignored.

time=2026-05-12T17:02:55.284Z level=INFO msg="backend args" backend=vLLM args="[serve /models/bundles/sha256/f374f0d45337a1d329ec4d81e70e3adb72beb2affff0080181e33b939c654e33/model --uds inference-runner-0.sock --chat-template /models/bundles/sha256/f374f0d45337a1d329ec4d81e70e3adb72beb2affff0080181e33b939c654e33/model/chat_template.jinja --max-model-len 262144 --hf-overrides {\"max_model_len\":32768,\"tensor_parallel_size\":4} --served-model-name sha256:f374f0d45337a1d329ec4d81e70e3adb72beb2affff0080181e33b939c654e33 huggingface.co/qwen/qwen3-coder-next]"
time=2026-05-12T17:03:02.298Z level=INFO msg="(APIServer pid=4743) INFO 05-12 17:03:02 [utils.py:233] non-default args: {'model_tag': '/models/bundles/sha256/f374f0d45337a1d329ec4d81e70e3adb72beb2affff0080181e33b939c654e33/model', 'chat_template': '/models/bundles/sha256/f374f0d45337a1d329ec4d81e70e3adb72beb2affff0080181e33b939c654e33/model/chat_template.jinja', 'uds': 'inference-runner-0.sock', 'model': '/models/bundles/sha256/f374f0d45337a1d329ec4d81e70e3adb72beb2affff0080181e33b939c654e33/model', 'max_model_len': 262144, 'served_model_name': ['sha256:f374f0d45337a1d329ec4d81e70e3adb72beb2affff0080181e33b939c654e33', 'huggingface.co/qwen/qwen3-coder-next'], 'hf_overrides': {'max_model_len': 32768, 'tensor_parallel_size': 4}}"
time=2026-05-12T17:03:09.121Z level=INFO msg="(EngineCore pid=4912) INFO 05-12 17:03:09 [core.py:109] Initializing a V1 LLM engine (v0.20.0) with config: model='/models/bundles/sha256/f374f0d45337a1d329ec4d81e70e3adb72beb2affff0080181e33b939c654e33/model', speculative_config=None, tokenizer='/models/bundles/sha256/f374f0d45337a1d329ec4d81e70e3adb72beb2affff0080181e33b939c654e33/model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=sha256:f374f0d45337a1d329ec4d81e70e3adb72beb2affff0080181e33b939c654e33, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions