-
- ```shell
- $ git clone https://github.com/dstackai/dstack
- $ cd dstack
- ```
-
-
+Here's an example of a service that deploys
+`nvidia/Qwen3-235B-A22B-FP8` using TensorRT-LLM.
-## Deployment
-
-### DeepSeek R1
-
-We normally use Triton with the TensorRT-LLM backend to serve models. While this works for the distilled Llama-based
-version, DeepSeek R1 isn’t yet compatible. So, for DeepSeek R1, we’ll use `trtllm-serve` with the PyTorch backend instead.
-
-To use `trtllm-serve`, we first need to build the TensorRT-LLM Docker image from the `main` branch.
-
-#### Build a Docker image
-
-Here’s the task config that builds the image and pushes it using the provided Docker credentials.
-
-
+
```yaml
-type: task
-name: build-image
+type: service
+name: qwen235
+
+image: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc11
-privileged: true
-image: dstackai/dind
env:
- - DOCKER_USERNAME
- - DOCKER_PASSWORD
+ - HF_HUB_ENABLE_HF_TRANSFER=1
+
commands:
- - start-dockerd
- - apt update && apt-get install -y build-essential make git git-lfs
- - git lfs install
- - git clone https://github.com/NVIDIA/TensorRT-LLM.git
- - cd TensorRT-LLM
- - git submodule update --init --recursive
- - git lfs pull
- # Limit compilation to Hopper for a smaller image
- - make -C docker release_build CUDA_ARCHS="90-real"
- - docker tag tensorrt_llm/release:latest $DOCKER_USERNAME/tensorrt_llm:latest
- - echo "$DOCKER_PASSWORD" | docker login -u "$DOCKER_USERNAME" --password-stdin
- - docker push "$DOCKER_USERNAME/tensorrt_llm:latest"
+ - pip install hf_transfer
+ - |
+ trtllm-serve serve nvidia/Qwen3-235B-A22B-FP8 \
+ --host 0.0.0.0 \
+ --port 8000 \
+ --backend pytorch \
+ --tp_size $DSTACK_GPUS_NUM \
+ --max_batch_size 32 \
+ --max_num_tokens 4096 \
+ --kv_cache_free_gpu_memory_fraction 0.75
+
+port: 8000
+model: nvidia/Qwen3-235B-A22B-FP8
+
+volumes:
+ - instance_path: /root/.cache
+ path: /root/.cache
+ optional: true
resources:
- cpu: 8
- disk: 500GB..
+ cpu: 96..
+ memory: 512GB..
+ shm_size: 32GB
+ disk: 1000GB..
+ gpu: H100:8
```
-To run it, pass the task configuration to `dstack apply`.
+Apply it with [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md):
```shell
-$ dstack apply -f examples/inference/trtllm/build-image.dstack.yml
-
- # BACKEND REGION RESOURCES SPOT PRICE
- 1 cudo ca-montreal-2 8xCPU, 25GB, (500.0GB) yes $0.1073
-
-Submit the run build-image? [y/n]: y
-
-Provisioning...
----> 100%
+$ dstack apply -f qwen235.dstack.yml
```
-
-#### Deploy the model
-
-Below is the service configuration that deploys DeepSeek R1 using the built TensorRT-LLM image.
-
-
-
- ```yaml
- type: service
- name: serve-r1
-
- # Specify the image built with `examples/inference/trtllm/build-image.dstack.yml`
- image: dstackai/tensorrt_llm:9b931c0f6305aefa3660e6fb84a76a42c0eef167
- env:
- - MAX_BATCH_SIZE=256
- - MAX_NUM_TOKENS=16384
- - MAX_SEQ_LENGTH=16384
- - EXPERT_PARALLEL=4
- - PIPELINE_PARALLEL=1
- - HF_HUB_ENABLE_HF_TRANSFER=1
- commands:
- - pip install -U "huggingface_hub[cli]"
- - pip install hf_transfer
- - huggingface-cli download deepseek-ai/DeepSeek-R1 --local-dir DeepSeek-R1
- - trtllm-serve
- --backend pytorch
- --max_batch_size $MAX_BATCH_SIZE
- --max_num_tokens $MAX_NUM_TOKENS
- --max_seq_len $MAX_SEQ_LENGTH
- --tp_size $DSTACK_GPUS_NUM
- --ep_size $EXPERT_PARALLEL
- --pp_size $PIPELINE_PARALLEL
- DeepSeek-R1
- port: 8000
- model: deepseek-ai/DeepSeek-R1
-
- resources:
- gpu: 8:H200
- shm_size: 32GB
- disk: 2000GB..
- ```
-
-
-
-To run it, pass the configuration to `dstack apply`.
-
-
-
-```shell
-$ dstack apply -f examples/inference/trtllm/serve-r1.dstack.yml
-
- # BACKEND REGION RESOURCES SPOT PRICE
- 1 vastai is-iceland 192xCPU, 2063GB, 8xH200 (141GB) yes $25.62
-
-Submit the run serve-r1? [y/n]: y
-
-Provisioning...
----> 100%
-```
-
-
-
-### DeepSeek R1 Distill Llama 8B
-
-To deploy DeepSeek R1 Distill Llama 8B, follow the steps below.
-
-#### Convert and upload checkpoints
-
-Here’s the task config that converts a Hugging Face model to a TensorRT-LLM checkpoint format
-and uploads it to S3 using the provided AWS credentials.
-
-
-
- ```yaml
- type: task
- name: convert-model
-
- image: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
- env:
- - HF_TOKEN
- - MODEL_REPO=https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B
- - S3_BUCKET_NAME
- - AWS_ACCESS_KEY_ID
- - AWS_SECRET_ACCESS_KEY
- - AWS_DEFAULT_REGION
- commands:
- # nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3 container uses TensorRT-LLM version 0.17.0,
- # therefore we are using branch v0.17.0
- - git clone --branch v0.17.0 --depth 1 https://github.com/triton-inference-server/tensorrtllm_backend.git
- - git clone --branch v0.17.0 --single-branch https://github.com/NVIDIA/TensorRT-LLM.git
- - git clone https://github.com/triton-inference-server/server.git
- - cd TensorRT-LLM/examples/llama
- - apt-get -y install git git-lfs
- - git lfs install
- - git config --global credential.helper store
- - huggingface-cli login --token $HF_TOKEN --add-to-git-credential
- - git clone $MODEL_REPO
- - python3 convert_checkpoint.py --model_dir DeepSeek-R1-Distill-Llama-8B --output_dir tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 --dtype bfloat16 --tp_size $DSTACK_GPUS_NUM
- # Download the AWS CLI
- - curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
- - unzip awscliv2.zip
- - ./aws/install
- - aws s3 sync tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 s3://${S3_BUCKET_NAME}/tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 --acl public-read
-
- resources:
- gpu: A100:40GB
-
- ```
-
-
-
-To run it, pass the configuration to `dstack apply`.
-
-
-
-```shell
-$ dstack apply -f examples/inference/trtllm/convert-model.dstack.yml
-
- # BACKEND REGION RESOURCES SPOT PRICE
- 1 vastai us-iowa 12xCPU, 85GB, 1xA100 (40GB) yes $0.66904
-
-Submit the run convert-model? [y/n]: y
-
-Provisioning...
----> 100%
-```
-
-
-
-#### Build and upload the model
-
-Here’s the task config that builds a TensorRT-LLM model and uploads it to S3 with the provided AWS credentials.
-
-
-
- ```yaml
- type: task
- name: build-model
-
- image: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
- env:
- - MODEL=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
- - S3_BUCKET_NAME
- - AWS_ACCESS_KEY_ID
- - AWS_SECRET_ACCESS_KEY
- - AWS_DEFAULT_REGION
- - MAX_SEQ_LEN=8192 # Sum of Max Input Length & Max Output Length
- - MAX_INPUT_LEN=4096
- - MAX_BATCH_SIZE=256
- - TRITON_MAX_BATCH_SIZE=1
- - INSTANCE_COUNT=1
- - MAX_QUEUE_DELAY_MS=0
- - MAX_QUEUE_SIZE=0
- - DECOUPLED_MODE=true # Set true for streaming
- commands:
- - huggingface-cli download $MODEL --exclude '*.safetensors' --local-dir tokenizer_dir
- - curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
- - unzip awscliv2.zip
- - ./aws/install
- - aws s3 sync s3://${S3_BUCKET_NAME}/tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 ./tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16
- - trtllm-build --checkpoint_dir tllm_checkpoint_${DSTACK_GPUS_NUM}gpu_bf16 --gemm_plugin bfloat16 --output_dir tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16 --max_seq_len $MAX_SEQ_LEN --max_input_len $MAX_INPUT_LEN --max_batch_size $MAX_BATCH_SIZE --gpt_attention_plugin bfloat16 --use_paged_context_fmha enable
- - git clone --branch v0.17.0 --single-branch https://github.com/NVIDIA/TensorRT-LLM.git
- - python3 TensorRT-LLM/examples/run.py --engine_dir tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16 --max_output_len 40 --tokenizer_dir tokenizer_dir --input_text "What is Deep Learning?"
- - git clone --branch v0.17.0 --depth 1 https://github.com/triton-inference-server/tensorrtllm_backend.git
- - mkdir triton_model_repo
- - cp -r tensorrtllm_backend/all_models/inflight_batcher_llm/* triton_model_repo/
- - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},logits_datatype:TYPE_BF16
- - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:tokenizer_dir,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
- - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16,max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,max_queue_size:${MAX_QUEUE_SIZE},encoder_input_features_data_type:TYPE_BF16,logits_datatype:TYPE_BF16
- - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/postprocessing/config.pbtxt tokenizer_dir:tokenizer_dir,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT},max_queue_size:${MAX_QUEUE_SIZE}
- - python3 tensorrtllm_backend/tools/fill_template.py -i triton_model_repo/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT},logits_datatype:TYPE_BF16
- - aws s3 sync triton_model_repo s3://${S3_BUCKET_NAME}/triton_model_repo --acl public-read
- - aws s3 sync tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16 s3://${S3_BUCKET_NAME}/tllm_engine_${DSTACK_GPUS_NUM}gpu_bf16 --acl public-read
-
- resources:
- gpu: A100:40GB
- ```
-
-
-To run it, pass the configuration to `dstack apply`.
-
-
-
-```shell
-$ dstack apply -f examples/inference/trtllm/build-model.dstack.yml
-
- # BACKEND REGION RESOURCES SPOT PRICE
- 1 vastai us-iowa 12xCPU, 85GB, 1xA100 (40GB) yes $0.66904
-
-Submit the run build-model? [y/n]: y
-
-Provisioning...
----> 100%
-```
-
-
-#### Deploy the model
-
-Below is the service configuration that deploys DeepSeek R1 Distill Llama 8B.
-
-
-
-```yaml
- type: service
- name: serve-distill
-
- image: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
- env:
- - MODEL=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
- - S3_BUCKET_NAME
- - AWS_ACCESS_KEY_ID
- - AWS_SECRET_ACCESS_KEY
- - AWS_DEFAULT_REGION
-
- commands:
- - huggingface-cli download $MODEL --exclude '*.safetensors' --local-dir tokenizer_dir
- - curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
- - unzip awscliv2.zip
- - ./aws/install
- - aws s3 sync s3://${S3_BUCKET_NAME}/tllm_engine_1gpu_bf16 ./tllm_engine_1gpu_bf16
- - git clone https://github.com/triton-inference-server/server.git
- - python3 server/python/openai/openai_frontend/main.py --model-repository s3://${S3_BUCKET_NAME}/triton_model_repo --tokenizer tokenizer_dir --openai-port 8000
- port: 8000
- model: ensemble
-
- resources:
- gpu: A100:40GB
-
-```
-
-
-To run it, pass the configuration to `dstack apply`.
-
-
-
-```shell
-$ dstack apply -f examples/inference/trtllm/serve-distill.dstack.yml
-
- # BACKEND REGION RESOURCES SPOT PRICE
- 1 vastai us-iowa 12xCPU, 85GB, 1xA100 (40GB) yes $0.66904
-
-Submit the run serve-distill? [y/n]: y
-
-Provisioning...
----> 100%
-```
## Access the endpoint
@@ -335,38 +70,30 @@ If no gateway is created, the service endpoint will be available at `
```shell
-$ curl http://127.0.0.1:3000/proxy/services/main/serve-distill/v1/chat/completions \
+$ curl http://127.0.0.1:3000/proxy/services/main/qwen235/v1/chat/completions \
-X POST \
-H 'Authorization: Bearer <dstack token>' \
-H 'Content-Type: application/json' \
-d '{
- "model": "deepseek-ai/DeepSeek-R1",
+ "model": "nvidia/Qwen3-235B-A22B-FP8",
"messages": [
- {
- "role": "system",
- "content": "You are a helpful assistant."
- },
{
"role": "user",
- "content": "What is Deep Learning?"
+ "content": "A bat and a ball cost $1.10 total. The bat costs $1.00 more than the ball. How much does the ball cost?"
}
],
- "stream": true,
- "max_tokens": 128
+ "chat_template_kwargs": {"enable_thinking": true},
+ "max_tokens": 1024,
+ "temperature": 0.0
}'
```
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://serve-distill.