Is there any other way to provide runtime arguments to the vllm backend outside using --hf_overrides? vllm includes separate parameters for --tensor-parallel-size and --max-model-len that could be passed directly to the serve command.
Purposefully using a large model that will not fix in an average GPU's memory.
Alternatively run the model and monitor logs directly.
time=2026-05-12T17:02:55.284Z level=INFO msg="backend args" backend=vLLM args="[serve /models/bundles/sha256/f374f0d45337a1d329ec4d81e70e3adb72beb2affff0080181e33b939c654e33/model --uds inference-runner-0.sock --chat-template /models/bundles/sha256/f374f0d45337a1d329ec4d81e70e3adb72beb2affff0080181e33b939c654e33/model/chat_template.jinja --max-model-len 262144 --hf-overrides {\"max_model_len\":32768,\"tensor_parallel_size\":4} --served-model-name sha256:f374f0d45337a1d329ec4d81e70e3adb72beb2affff0080181e33b939c654e33 huggingface.co/qwen/qwen3-coder-next]"
time=2026-05-12T17:03:02.298Z level=INFO msg="(APIServer pid=4743) INFO 05-12 17:03:02 [utils.py:233] non-default args: {'model_tag': '/models/bundles/sha256/f374f0d45337a1d329ec4d81e70e3adb72beb2affff0080181e33b939c654e33/model', 'chat_template': '/models/bundles/sha256/f374f0d45337a1d329ec4d81e70e3adb72beb2affff0080181e33b939c654e33/model/chat_template.jinja', 'uds': 'inference-runner-0.sock', 'model': '/models/bundles/sha256/f374f0d45337a1d329ec4d81e70e3adb72beb2affff0080181e33b939c654e33/model', 'max_model_len': 262144, 'served_model_name': ['sha256:f374f0d45337a1d329ec4d81e70e3adb72beb2affff0080181e33b939c654e33', 'huggingface.co/qwen/qwen3-coder-next'], 'hf_overrides': {'max_model_len': 32768, 'tensor_parallel_size': 4}}"
time=2026-05-12T17:03:09.121Z level=INFO msg="(EngineCore pid=4912) INFO 05-12 17:03:09 [core.py:109] Initializing a V1 LLM engine (v0.20.0) with config: model='/models/bundles/sha256/f374f0d45337a1d329ec4d81e70e3adb72beb2affff0080181e33b939c654e33/model', speculative_config=None, tokenizer='/models/bundles/sha256/f374f0d45337a1d329ec4d81e70e3adb72beb2affff0080181e33b939c654e33/model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=sha256:f374f0d45337a1d329ec4d81e70e3adb72beb2affff0080181e33b939c654e33, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')"
It appears that parameters set using
--hf_overridesare ignored by the vllm backend. Either when certain parameters are supplied (ie.tensor_parallel_sizeormax_model_len)Is there any other way to provide runtime arguments to the vllm backend outside using
--hf_overrides? vllm includes separate parameters for--tensor-parallel-sizeand--max-model-lenthat could be passed directly to the serve command.Steps to reproduce:
Purposefully using a large model that will not fix in an average GPU's memory.
docker model pull huggingface.co/qwen/qwen3-coder-nextdocker model configure --hf_overrides '{"tensor_parallel_size": 4, "max_model_len": 32768}' huggingface.co/qwen/qwen3-coder-nextdocker model logs -f 2>&1 | grep tensor_parallel_sizeAlternatively run the model and monitor logs directly.
Error Logs:
Lines 1/2 show the
hf_overridesbeing properly parsed and line 3 shows both options being ignored.