Skip to content

Commit e38137d

Browse files
author
白永斌
committed
Merge branch 'model_register' of https://github.com/dsxsteven/vllm_splitPR into model_register
* 'model_register' of https://github.com/dsxsteven/vllm_splitPR: (138 commits) Retrieve `sliding_window` from text config in Gemma3 MM (vllm-project#25085) [Docs] Fix API Reference (vllm-project#25140) [Kernel] Better inf handling for grouped topk cu (vllm-project#24886) [CLI] Use streaming in CLI chat and completion commands (vllm-project#23769) [benchmark] add peak throughput metrics and plot (vllm-project#23867) [Spec Decode] Efficient padded speculation (vllm-project#24539) [V0 Deprecation] Remove more V0 tests (vllm-project#25117) [EPLB] Add EPLB support for hunyuan_v1 (vllm-project#23078) [XPU] Whisper model support on XPU Platform (vllm-project#25123) Mark prompt logprobs as incompatible with prompt embeds at API level (vllm-project#25077) [Model] enable data parallel for InternVL vision encoder (vllm-project#23909) [Kernels] Overlap shared experts with combine instead of dispatch (vllm-project#24254) [Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models (vllm-project#24960) [Core][MM] Cleanup `MultiModalCache` (vllm-project#25006) [Docs] Clean up the contributing README (vllm-project#25099) [MM Encoder] Apply DP ViT for Qwen3-VL model series (vllm-project#24955) [Kernels] Enable DeepGEMM by default (vllm-project#24462) [V0 Deprecation] Skip PP test (vllm-project#25128) [V0 Deprecation] Remove misc V0 tests (vllm-project#25118) [V0 Deprecation] Remove V0 Tracing & Metrics tests (vllm-project#25115) ...
2 parents 860c924 + 7051ce6 commit e38137d

File tree

526 files changed

+20833
-29953
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

526 files changed

+20833
-29953
lines changed

.buildkite/nightly-benchmarks/nightly-descriptions.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ This benchmark aims to:
88

99
Latest results: [results link](https://blog.vllm.ai/2024/09/05/perf-update.html), scroll to the end.
1010

11-
Latest reproduction guilde: [github issue link](https://github.com/vllm-project/vllm/issues/8176)
11+
Latest reproduction guide: [github issue link](https://github.com/vllm-project/vllm/issues/8176)
1212

1313
## Setup
1414

.buildkite/release-pipeline.yaml

Lines changed: 4 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,22 @@
11
steps:
22
# aarch64 + CUDA builds. PyTorch 2.8 aarch64 + CUDA wheel is only available on CUDA 12.9
33
- label: "Build arm64 wheel - CUDA 12.9"
4+
depends_on: ~
45
id: build-wheel-arm64-cuda-12-9
56
agents:
67
queue: arm64_cpu_queue_postmerge
78
commands:
89
# #NOTE: torch_cuda_arch_list is derived from upstream PyTorch build files here:
910
# https://github.com/pytorch/pytorch/blob/main/.ci/aarch64_linux/aarch64_ci_build.sh#L7
10-
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.9.1 --build-arg torch_cuda_arch_list='8.7 9.0 10.0+PTX 12.0' --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
11+
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.9.1 --build-arg VLLM_MAIN_CUDA_VERSION=12.9 --build-arg torch_cuda_arch_list='8.7 9.0 10.0+PTX 12.0' --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
1112
- "mkdir artifacts"
1213
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
1314
- "bash .buildkite/scripts/upload-wheels.sh"
1415
env:
1516
DOCKER_BUILDKIT: "1"
1617

17-
- block: "Build CUDA 12.8 wheel"
18-
key: block-build-cu128-wheel
19-
2018
- label: "Build wheel - CUDA 12.8"
21-
depends_on: block-build-cu128-wheel
19+
depends_on: ~
2220
id: build-wheel-cuda-12-8
2321
agents:
2422
queue: cpu_queue_postmerge
@@ -30,12 +28,8 @@ steps:
3028
env:
3129
DOCKER_BUILDKIT: "1"
3230

33-
- block: "Build CUDA 12.6 wheel"
34-
key: block-build-cu126-wheel
35-
depends_on: ~
36-
3731
- label: "Build wheel - CUDA 12.6"
38-
depends_on: block-build-cu126-wheel
32+
depends_on: ~
3933
id: build-wheel-cuda-12-6
4034
agents:
4135
queue: cpu_queue_postmerge
@@ -102,8 +96,6 @@ steps:
10296
depends_on:
10397
- create-multi-arch-manifest
10498
- build-wheel-cuda-12-8
105-
- build-wheel-cuda-12-6
106-
- build-wheel-cuda-12-9
10799
id: annotate-release-workflow
108100
agents:
109101
queue: cpu_queue_postmerge

.buildkite/scripts/annotate-release.sh

Lines changed: 22 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -14,18 +14,33 @@ buildkite-agent annotate --style 'info' --context 'release-workflow' << EOF
1414
To download the wheel:
1515
\`\`\`
1616
aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux1_x86_64.whl .
17+
aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux2014_aarch64.whl .
18+
1719
aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}+cu126/vllm-${RELEASE_VERSION}+cu126-cp38-abi3-manylinux1_x86_64.whl .
18-
aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}+cu118/vllm-${RELEASE_VERSION}+cu118-cp38-abi3-manylinux1_x86_64.whl .
20+
aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}+cu129/vllm-${RELEASE_VERSION}+cu129-cp38-abi3-manylinux1_x86_64.whl .
1921
\`\`\`
2022
2123
To download and upload the image:
2224
2325
\`\`\`
24-
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}
25-
docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT} vllm/vllm-openai
26-
docker tag vllm/vllm-openai vllm/vllm-openai:latest
27-
docker tag vllm/vllm-openai vllm/vllm-openai:v${RELEASE_VERSION}
28-
docker push vllm/vllm-openai:latest
29-
docker push vllm/vllm-openai:v${RELEASE_VERSION}
26+
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64
27+
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64
28+
29+
docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64 vllm/vllm-openai:x86_64
30+
docker tag vllm/vllm-openai:x86_64 vllm/vllm-openai:latest-x86_64
31+
docker tag vllm/vllm-openai:x86_64 vllm/vllm-openai:v${RELEASE_VERSION}-x86_64
32+
docker push vllm/vllm-openai:latest-x86_64
33+
docker push vllm/vllm-openai:v${RELEASE_VERSION}-x86_64
34+
35+
docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64 vllm/vllm-openai:aarch64
36+
docker tag vllm/vllm-openai:aarch64 vllm/vllm-openai:latest-aarch64
37+
docker tag vllm/vllm-openai:aarch64 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64
38+
docker push vllm/vllm-openai:latest-aarch64
39+
docker push vllm/vllm-openai:v${RELEASE_VERSION}-aarch64
40+
41+
docker manifest create vllm/vllm-openai:latest vllm/vllm-openai:latest-x86_64 vllm/vllm-openai:latest-aarch64 --amend
42+
docker manifest create vllm/vllm-openai:v${RELEASE_VERSION} vllm/vllm-openai:v${RELEASE_VERSION}-x86_64 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64 --amend
43+
docker manifest push vllm/vllm-openai:latest
44+
docker manifest push vllm/vllm-openai:v${RELEASE_VERSION}
3045
\`\`\`
3146
EOF

.buildkite/scripts/hardware_ci/run-cpu-test.sh

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,6 @@ function cpu_tests() {
6666
6767
pytest -x -v -s tests/models/language/pooling -m cpu_model
6868
pytest -x -v -s tests/models/multimodal/generation \
69-
--ignore=tests/models/multimodal/generation/test_mllama.py \
7069
--ignore=tests/models/multimodal/generation/test_pixtral.py \
7170
-m cpu_model"
7271

.buildkite/test-pipeline.yaml

Lines changed: 31 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -46,24 +46,18 @@ steps:
4646
mirror_hardwares: [amdexperimental]
4747
source_file_dependencies:
4848
- vllm/
49-
- tests/mq_llm_engine
50-
- tests/async_engine
5149
- tests/test_inputs.py
5250
- tests/test_outputs.py
5351
- tests/multimodal
5452
- tests/utils_
55-
- tests/worker
5653
- tests/standalone_tests/lazy_imports.py
5754
- tests/transformers_utils
5855
commands:
5956
- python3 standalone_tests/lazy_imports.py
60-
- pytest -v -s mq_llm_engine # MQLLMEngine
61-
- pytest -v -s async_engine # AsyncLLMEngine
6257
- pytest -v -s test_inputs.py
6358
- pytest -v -s test_outputs.py
6459
- pytest -v -s multimodal
6560
- pytest -v -s utils_ # Utils
66-
- pytest -v -s worker # Worker
6761
- pytest -v -s transformers_utils # transformers_utils
6862

6963
- label: Python-only Installation Test # 10min
@@ -84,25 +78,12 @@ steps:
8478
- vllm/
8579
- tests/basic_correctness/test_basic_correctness
8680
- tests/basic_correctness/test_cpu_offload
87-
- tests/basic_correctness/test_preemption
8881
- tests/basic_correctness/test_cumem.py
8982
commands:
9083
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
9184
- pytest -v -s basic_correctness/test_cumem.py
9285
- pytest -v -s basic_correctness/test_basic_correctness.py
9386
- pytest -v -s basic_correctness/test_cpu_offload.py
94-
- VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1 pytest -v -s basic_correctness/test_preemption.py
95-
96-
- label: Core Test # 22min
97-
timeout_in_minutes: 35
98-
mirror_hardwares: [amdexperimental]
99-
fast_check: true
100-
source_file_dependencies:
101-
- vllm/core
102-
- vllm/distributed
103-
- tests/core
104-
commands:
105-
- pytest -v -s core
10687

10788
- label: Entrypoints Unit Tests # 5min
10889
timeout_in_minutes: 10
@@ -230,16 +211,14 @@ steps:
230211
num_gpus: 2
231212
source_file_dependencies:
232213
- vllm/
233-
- tests/metrics
234214
- tests/v1/tracing
235215
commands:
236-
- pytest -v -s metrics
237216
- "pip install \
238217
'opentelemetry-sdk>=1.26.0' \
239218
'opentelemetry-api>=1.26.0' \
240219
'opentelemetry-exporter-otlp>=1.26.0' \
241220
'opentelemetry-semantic-conventions-ai>=0.4.1'"
242-
- pytest -v -s tracing
221+
- pytest -v -s v1/tracing
243222

244223
##### fast check tests #####
245224
##### 1 GPU test #####
@@ -394,6 +373,7 @@ steps:
394373
- pytest -v -s compile/test_async_tp.py
395374
- pytest -v -s compile/test_fusion_all_reduce.py
396375
- pytest -v -s compile/test_decorator.py
376+
- pytest -v -s compile/test_noop_elimination.py
397377

398378
- label: PyTorch Fullgraph Smoke Test # 15min
399379
timeout_in_minutes: 30
@@ -548,15 +528,6 @@ steps:
548528
commands: # LMEval+Transcription WER check
549529
- pytest -s entrypoints/openai/correctness/
550530

551-
- label: Encoder Decoder tests # 12min
552-
timeout_in_minutes: 20
553-
mirror_hardwares: [amdexperimental]
554-
source_file_dependencies:
555-
- vllm/
556-
- tests/encoder_decoder
557-
commands:
558-
- pytest -v -s encoder_decoder
559-
560531
- label: OpenAI-Compatible Tool Use # 23 min
561532
timeout_in_minutes: 35
562533
mirror_hardwares: [amdexperimental]
@@ -817,7 +788,7 @@ steps:
817788
# Quantization
818789
- pytest -v -s tests/kernels/quantization/test_cutlass_scaled_mm.py -k 'fp8'
819790
- pytest -v -s tests/kernels/quantization/test_nvfp4_quant.py
820-
- pytest -v -s tests/kernels/quantization/test_silu_nvfp4_quant_fusion.py
791+
- pytest -v -s tests/kernels/quantization/test_silu_mul_nvfp4_quant.py
821792
- pytest -v -s tests/kernels/quantization/test_nvfp4_scaled_mm.py
822793
- pytest -v -s tests/kernels/quantization/test_flashinfer_scaled_mm.py
823794
- pytest -v -s tests/kernels/quantization/test_flashinfer_nvfp4_scaled_mm.py
@@ -829,6 +800,20 @@ steps:
829800
- pytest -v -s tests/kernels/moe/test_flashinfer.py
830801
- pytest -v -s tests/compile/test_silu_mul_quant_fusion.py
831802

803+
- label: GPT-OSS Eval (Blackwell)
804+
timeout_in_minutes: 60
805+
working_dir: "/vllm-workspace/"
806+
gpu: b200
807+
optional: true # disable while debugging
808+
source_file_dependencies:
809+
- tests/evals/gpt_oss
810+
- vllm/model_executor/models/gpt_oss.py
811+
- vllm/model_executor/layers/quantization/mxfp4.py
812+
- vllm/v1/attention/backends/flashinfer.py
813+
commands:
814+
- uv pip install --system 'gpt-oss[eval]==0.0.5'
815+
- pytest -s -v tests/evals/gpt_oss/test_gpqa_correctness.py --model openai/gpt-oss-20b --metric 0.58 --server-args '--tensor-parallel-size 2'
816+
832817
##### 1 GPU test #####
833818
##### multi gpus test #####
834819

@@ -954,7 +939,6 @@ steps:
954939
commands:
955940
- pytest -v -s distributed/test_pp_cudagraph.py
956941
- pytest -v -s distributed/test_pipeline_parallel.py
957-
# - pytest -v -s distributed/test_context_parallel.py # TODO: enable it on Hopper runners or add triton MLA support
958942

959943
- label: LoRA TP Test (Distributed) # 17 min
960944
timeout_in_minutes: 30
@@ -1028,9 +1012,21 @@ steps:
10281012
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
10291013
- pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-large.txt --tp-size=4
10301014

1031-
- label: Qwen MoE EP Test # optional
1015+
##### H200 test #####
1016+
- label: Distrubted Tests (H200) # optional
10321017
gpu: h200
10331018
optional: true
1019+
working_dir: "/vllm-workspace/"
1020+
num_gpus: 2
1021+
commands:
1022+
- pytest -v -s tests/distributed/test_context_parallel.py
1023+
- CUDA_VISIBLE_DEVICES=1,2 VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048
1024+
1025+
##### B200 test #####
1026+
- label: Distributed Tests (B200) # optional
1027+
gpu: b200
1028+
optional: true
1029+
working_dir: "/vllm-workspace/"
10341030
num_gpus: 2
10351031
commands:
1036-
- CUDA_VISIBLE_DEVICES=1,2 VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 /vllm-workspace/examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048
1032+
- pytest -v -s tests/distributed/test_context_parallel.py

.coveragerc

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
[run]
2+
source = vllm
3+
omit =
4+
*/tests/*
5+
*/test_*
6+
*/__pycache__/*
7+
*/build/*
8+
*/dist/*
9+
*/vllm.egg-info/*
10+
*/third_party/*
11+
*/examples/*
12+
*/benchmarks/*
13+
*/docs/*
14+
15+
[report]
16+
exclude_lines =
17+
pragma: no cover
18+
def __repr__
19+
if self.debug:
20+
if settings.DEBUG
21+
raise AssertionError
22+
raise NotImplementedError
23+
if 0:
24+
if __name__ == .__main__.:
25+
class .*\bProtocol\):
26+
@(abc\.)?abstractmethod
27+
28+
[html]
29+
directory = htmlcov
30+
31+
[xml]
32+
output = coverage.xml

.github/CODEOWNERS

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,24 +2,27 @@
22
# for more info about CODEOWNERS file
33

44
# This lists cover the "core" components of vLLM that require careful review
5+
/vllm/attention @LucasWilkinson
56
/vllm/attention/backends/abstract.py @WoosukKwon @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
67
/vllm/core @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
78
/vllm/engine/llm_engine.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
89
/vllm/executor/executor_base.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill @22quinn
910
/vllm/worker/worker_base.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill @22quinn
1011
/vllm/worker/worker.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
12+
/vllm/model_executor/layers/fused_moe @mgoin
1113
/vllm/model_executor/layers/sampler.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill @NickLucche
1214
/vllm/model_executor/layers/quantization @mgoin @robertgshaw2-redhat @tlrmchlsmth @yewentao256
1315
/vllm/model_executor/layers/mamba @tdoublep
1416
/vllm/model_executor/model_loader @22quinn
1517
/vllm/multimodal @DarkLight1337 @ywang96 @NickLucche
18+
/vllm/v1/attention @LucasWilkinson
1619
/vllm/v1/sample @22quinn @houseroad
1720
/vllm/vllm_flash_attn @LucasWilkinson
1821
/vllm/lora @jeejeelee
1922
/vllm/reasoning @aarnphm @chaunceyjiang
2023
/vllm/entrypoints @aarnphm @chaunceyjiang
2124
/vllm/compilation @zou3519 @youkaichao @ProExpertProg
22-
/vllm/distributed/kv_transfer @NickLucche
25+
/vllm/distributed/kv_transfer @NickLucche @ApostaC
2326
CMakeLists.txt @tlrmchlsmth @LucasWilkinson
2427

2528
# Any change to the VllmConfig changes can have a large user-facing impact,
@@ -30,30 +33,33 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson
3033
/vllm/v1 @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @comaniac @alexm-redhat
3134
/vllm/v1/structured_output @mgoin @russellb @aarnphm @benchislett
3235
/vllm/v1/spec_decode @benchislett @luccafong
36+
/vllm/v1/attention/backends/flashinfer.py @mgoin
3337
/vllm/v1/attention/backends/triton_attn.py @tdoublep
34-
/vllm/v1/core @heheda12345
38+
/vllm/v1/core @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @comaniac @alexm-redhat @heheda12345 @ApostaC
3539
/vllm/v1/kv_cache_interface.py @heheda12345
40+
/vllm/v1/offloading @ApostaC
3641

3742
# Test ownership
3843
/.buildkite/lm-eval-harness @mgoin @simon-mo
39-
/tests/async_engine @njhill @robertgshaw2-redhat @simon-mo
4044
/tests/distributed/test_multi_node_assignment.py @youkaichao
4145
/tests/distributed/test_pipeline_parallel.py @youkaichao
4246
/tests/distributed/test_same_node.py @youkaichao
4347
/tests/entrypoints @DarkLight1337 @robertgshaw2-redhat @simon-mo @aarnphm @NickLucche
44-
/tests/kernels @tlrmchlsmth @WoosukKwon @yewentao256
48+
/tests/evals @mgoin
49+
/tests/kernels @mgoin @tlrmchlsmth @WoosukKwon @yewentao256
4550
/tests/models @DarkLight1337 @ywang96
4651
/tests/multimodal @DarkLight1337 @ywang96 @NickLucche
47-
/tests/prefix_caching @comaniac @KuntaiDu
4852
/tests/quantization @mgoin @robertgshaw2-redhat @yewentao256
4953
/tests/test_inputs.py @DarkLight1337 @ywang96
5054
/tests/v1/entrypoints/llm/test_struct_output_generate.py @mgoin @russellb @aarnphm
5155
/tests/v1/structured_output @mgoin @russellb @aarnphm
52-
/tests/v1/core @heheda12345
56+
/tests/v1/core @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @comaniac @alexm-redhat @heheda12345 @ApostaC
5357
/tests/weight_loading @mgoin @youkaichao @yewentao256
5458
/tests/lora @jeejeelee
5559
/tests/models/language/generation/test_hybrid.py @tdoublep
56-
/tests/v1/kv_connector/nixl_integration @NickLucche
60+
/tests/v1/kv_connector/nixl_integration @NickLucche
61+
/tests/v1/kv_connector @ApostaC
62+
/tests/v1/offloading @ApostaC
5763

5864
# Docs
5965
/docs @hmellor
@@ -101,4 +107,7 @@ mkdocs.yaml @hmellor
101107
/vllm/v1/worker/tpu* @NickLucche
102108
/vllm/platforms/tpu.py @NickLucche
103109
/vllm/v1/sample/tpu @NickLucche
104-
/vllm/tests/v1/tpu @NickLucche
110+
/vllm/tests/v1/tpu @NickLucche
111+
112+
# KVConnector installation files
113+
/requirements/kv_connectors.txt @NickLucche

0 commit comments

Comments
 (0)