-
- ```yaml
- type: service
- name: tgi
-
- image: ghcr.io/huggingface/tgi-gaudi:2.3.1
- env:
- - HF_TOKEN
- - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
- - PORT=8000
- - OMPI_MCA_btl_vader_single_copy_mechanism=none
- - TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true
- - PT_HPU_ENABLE_LAZY_COLLECTIVES=true
- - MAX_TOTAL_TOKENS=2048
- - BATCH_BUCKET_SIZE=256
- - PREFILL_BATCH_BUCKET_SIZE=4
- - PAD_SEQUENCE_TO_MULTIPLE_OF=64
- - ENABLE_HPU_GRAPH=true
- - LIMIT_HPU_GRAPH=true
- - USE_FLASH_ATTENTION=true
- - FLASH_ATTENTION_RECOMPUTE=true
- commands:
- - text-generation-launcher
- --sharded true
- --num-shard $DSTACK_GPUS_NUM
- --max-input-length 1024
- --max-total-tokens 2048
- --max-batch-prefill-tokens 4096
- --max-batch-total-tokens 524288
- --max-waiting-tokens 7
- --waiting-served-ratio 1.2
- --max-concurrent-requests 512
- port: 8000
- model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
-
- resources:
- gpu: gaudi2:8
-
- # Uncomment to cache downloaded models
- #volumes:
- # - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub
- ```
-
-
-
-=== "vLLM"
-
-
-
- ```yaml
- type: service
- name: deepseek-r1-gaudi
-
- image: vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest
- env:
- - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
- - HABANA_VISIBLE_DEVICES=all
- - OMPI_MCA_btl_vader_single_copy_mechanism=none
- commands:
- - git clone https://github.com/HabanaAI/vllm-fork.git
- - cd vllm-fork
- - git checkout habana_main
- - pip install -r requirements-hpu.txt
- - python setup.py develop
- - vllm serve $MODEL_ID
- --tensor-parallel-size 8
- --trust-remote-code
- --download-dir /data
- port: 8000
- model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
-
-
- resources:
- gpu: gaudi2:8
-
- # Uncomment to cache downloaded models
- #volumes:
- # - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub
- ```
-
-
-
-## Fine-tuning
-
-Below is an example of LoRA fine-tuning of [`DeepSeek-R1-Distill-Qwen-7B`](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)
-using [Optimum for Intel Gaudi](https://github.com/huggingface/optimum-habana)
-and [DeepSpeed](https://docs.habana.ai/en/latest/PyTorch/DeepSpeed/DeepSpeed_User_Guide/DeepSpeed_User_Guide.html#deepspeed-user-guide) with
-the [`lvwerra/stack-exchange-paired`](https://huggingface.co/datasets/lvwerra/stack-exchange-paired) dataset.
-
-
-
-```yaml
-type: task
-name: trl-train
-
-image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0
-env:
- - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
- - WANDB_API_KEY
- - WANDB_PROJECT
-commands:
- - pip install --upgrade-strategy eager optimum[habana]
- - pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0
- - git clone https://github.com/huggingface/optimum-habana.git
- - cd optimum-habana/examples/trl
- - pip install -r requirements.txt
- - pip install wandb
- - DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python ../gaudi_spawn.py --world_size $DSTACK_GPUS_NUM --use_deepspeed sft.py
- --model_name_or_path $MODEL_ID
- --dataset_name "lvwerra/stack-exchange-paired"
- --deepspeed ../language-modeling/llama2_ds_zero3_config.json
- --output_dir="./sft"
- --do_train
- --max_steps=500
- --logging_steps=10
- --save_steps=100
- --per_device_train_batch_size=1
- --per_device_eval_batch_size=1
- --gradient_accumulation_steps=2
- --learning_rate=1e-4
- --lr_scheduler_type="cosine"
- --warmup_steps=100
- --weight_decay=0.05
- --optim="paged_adamw_32bit"
- --lora_target_modules "q_proj" "v_proj"
- --bf16
- --remove_unused_columns=False
- --run_name="sft_deepseek_70"
- --report_to="wandb"
- --use_habana
- --use_lazy_mode
-
-resources:
- gpu: gaudi2:8
-```
-
-
-
-To finetune `DeepSeek-R1-Distill-Llama-70B` with eight Gaudi 2,
-you can partially offload parameters to CPU memory using the Deepspeed configuration file.
-For more details, refer to [parameter offloading](https://deepspeed.readthedocs.io/en/latest/zero3.html#deepspeedzerooffloadparamconfig).
-
-## Applying a configuration
-
-Once the configuration is ready, run `dstack apply -f `.
-
-
-
-```shell
-$ dstack apply -f examples/inference/vllm/.dstack.yml
-
- # BACKEND REGION RESOURCES SPOT PRICE
- 1 ssh remote 152xCPU,1007GB,8xGaudi2:96GB yes $0 idle
-
-Submit a new run? [y/n]: y
-
-Provisioning...
----> 100%
-```
-
-
-
-## Source code
-
-The source-code of this example can be found in
-[`examples/llms/deepseek/tgi/intel`](https://github.com/dstackai/dstack/blob/master/examples/llms/deepseek/tgi/intel),
-[`examples/llms/deepseek/vllm/intel`](https://github.com/dstackai/dstack/blob/master/examples/llms/deepseek/vllm/intel) and
-[`examples/llms/deepseek/trl/intel`](https://github.com/dstackai/dstack/blob/master/examples/llms/deepseek/trl/intel).
-
-!!! info "What's next?"
- 1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), and [services](https://dstack.ai/docs/services).
- 2. See also [Intel Gaudi Documentation](https://docs.habana.ai/en/latest/index.html), [vLLM Inference with Gaudi](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/vLLM_Inference.html)
- and [Optimum for Gaudi examples](https://github.com/huggingface/optimum-habana/blob/main/examples/trl/README.md).
diff --git a/examples/llms/deepseek/README.md b/examples/llms/deepseek/README.md
index 41d73e9e9..ae467891f 100644
--- a/examples/llms/deepseek/README.md
+++ b/examples/llms/deepseek/README.md
@@ -78,95 +78,6 @@ Here's an example of a service that deploys `Deepseek-R1-Distill-Llama-70B` usin
Note, when using `Deepseek-R1-Distill-Llama-70B` with `vLLM` with a 192GB GPU, we must limit the context size to 126432 tokens to fit the memory.
-### Intel Gaudi
-
-Here's an example of a service that deploys `Deepseek-R1-Distill-Llama-70B`
-using [TGI on Gaudi](https://github.com/huggingface/tgi-gaudi)
-and [vLLM](https://github.com/HabanaAI/vllm-fork) (Gaudi fork) with Intel Gaudi 2.
-
-> Both [TGI on Gaudi](https://github.com/huggingface/tgi-gaudi)
-> and [vLLM](https://github.com/HabanaAI/vllm-fork) do not support `Deepseek-V2-Lite`.
-> See [this](https://github.com/huggingface/tgi-gaudi/issues/271)
-> and [this](https://github.com/HabanaAI/vllm-fork/issues/809#issuecomment-2652454824) issues.
-
-=== "TGI"
-
-
- ```yaml
- type: service
-
- name: tgi
-
- image: ghcr.io/huggingface/tgi-gaudi:2.3.1
-
- auth: false
- port: 8000
-
- model: DeepSeek-R1-Distill-Llama-70B
-
- env:
- - HF_TOKEN
- - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
- - PORT=8000
- - OMPI_MCA_btl_vader_single_copy_mechanism=none
- - TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true
- - PT_HPU_ENABLE_LAZY_COLLECTIVES=true
- - MAX_TOTAL_TOKENS=2048
- - BATCH_BUCKET_SIZE=256
- - PREFILL_BATCH_BUCKET_SIZE=4
- - PAD_SEQUENCE_TO_MULTIPLE_OF=64
- - ENABLE_HPU_GRAPH=true
- - LIMIT_HPU_GRAPH=true
- - USE_FLASH_ATTENTION=true
- - FLASH_ATTENTION_RECOMPUTE=true
-
- commands:
- - text-generation-launcher
- --sharded true
- --num-shard 8
- --max-input-length 1024
- --max-total-tokens 2048
- --max-batch-prefill-tokens 4096
- --max-batch-total-tokens 524288
- --max-waiting-tokens 7
- --waiting-served-ratio 1.2
- --max-concurrent-requests 512
-
- resources:
- gpu: Gaudi2:8
- ```
-
-
-=== "vLLM"
-
-
- ```yaml
- type: service
- name: deepseek-r1-gaudi
-
- image: vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest
-
-
- env:
- - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
- - HABANA_VISIBLE_DEVICES=all
- - OMPI_MCA_btl_vader_single_copy_mechanism=none
-
- commands:
- - git clone https://github.com/HabanaAI/vllm-fork.git
- - cd vllm-fork
- - git checkout habana_main
- - pip install -r requirements-hpu.txt
- - python setup.py develop
- - vllm serve $MODEL_ID
- --tensor-parallel-size 8
- --trust-remote-code
- --download-dir /data
-
- port: 8000
- ```
-
-
### NVIDIA
Here's an example of a service that deploys `Deepseek-R1-Distill-Llama-8B`
@@ -241,7 +152,7 @@ Approximate memory requirements for loading the model (excluding context and CUD
| `DeepSeek-R1-Distill-Qwen` | **7B** | 16GB | 8GB | 4GB |
For example, the FP8 version of Deepseek-R1 671B fits on a single node of MI300X with eight 192GB GPUs, a single node of
-H200 with eight 141GB GPUs, or a single node of Intel Gaudi2 with eight 96GB GPUs.
+H200 with eight 141GB GPUs.
### Applying the configuration
@@ -400,65 +311,6 @@ Here are the examples of LoRA fine-tuning of `Deepseek-V2-Lite` and GRPO fine-tu
Note, the `GRPO` fine-tuning of `DeepSeek-R1-Distill-Qwen-1.5B` consumes up to 135GB of VRAM.
-### Intel Gaudi
-
-Here is an example of LoRA fine-tuning of `DeepSeek-R1-Distill-Qwen-7B` on Intel Gaudi 2 GPUs using
-HuggingFace's [Optimum for Intel Gaudi](https://github.com/huggingface/optimum-habana)
-and [DeepSpeed](https://github.com/deepspeedai/DeepSpeed). Both also support `LoRA`
-fine-tuning of `Deepseek-V2-Lite` with same configuration as below.
-
-=== "LoRA"
-
-
- ```yaml
- type: task
- name: trl-train
-
- image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0
-
- env:
- - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
- - WANDB_API_KEY
- - WANDB_PROJECT
- commands:
- - pip install --upgrade-strategy eager optimum[habana]
- - pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0
- - git clone https://github.com/huggingface/optimum-habana.git
- - cd optimum-habana/examples/trl
- - pip install -r requirements.txt
- - pip install wandb
- - DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python ../gaudi_spawn.py --world_size 8 --use_deepspeed sft.py
- --model_name_or_path $MODEL_ID
- --dataset_name "lvwerra/stack-exchange-paired"
- --deepspeed ../language-modeling/llama2_ds_zero3_config.json
- --output_dir="./sft"
- --do_train
- --max_steps=500
- --logging_steps=10
- --save_steps=100
- --per_device_train_batch_size=1
- --per_device_eval_batch_size=1
- --gradient_accumulation_steps=2
- --learning_rate=1e-4
- --lr_scheduler_type="cosine"
- --warmup_steps=100
- --weight_decay=0.05
- --optim="paged_adamw_32bit"
- --lora_target_modules "q_proj" "v_proj"
- --bf16
- --remove_unused_columns=False
- --run_name="sft_deepseek_70"
- --report_to="wandb"
- --use_habana
- --use_lazy_mode
-
- resources:
- gpu: gaudi2:8
- ```
-
-
-
-
### NVIDIA
Here are examples of LoRA fine-tuning of `DeepSeek-R1-Distill-Qwen-1.5B` and QLoRA fine-tuning of `DeepSeek-V2-Lite`
diff --git a/examples/llms/deepseek/tgi/intel/.dstack.yml b/examples/llms/deepseek/tgi/intel/.dstack.yml
deleted file mode 100644
index 16d083092..000000000
--- a/examples/llms/deepseek/tgi/intel/.dstack.yml
+++ /dev/null
@@ -1,45 +0,0 @@
-type: service
-
-name: tgi
-
-image: ghcr.io/huggingface/tgi-gaudi:2.3.1
-
-auth: false
-port: 8000
-
-model: DeepSeek-R1-Distill-Llama-70B
-
-env:
- - HF_TOKEN
- - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
- - PORT=8000
- - OMPI_MCA_btl_vader_single_copy_mechanism=none
- - TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true
- - PT_HPU_ENABLE_LAZY_COLLECTIVES=true
- - MAX_TOTAL_TOKENS=2048
- - BATCH_BUCKET_SIZE=256
- - PREFILL_BATCH_BUCKET_SIZE=4
- - PAD_SEQUENCE_TO_MULTIPLE_OF=64
- - ENABLE_HPU_GRAPH=true
- - LIMIT_HPU_GRAPH=true
- - USE_FLASH_ATTENTION=true
- - FLASH_ATTENTION_RECOMPUTE=true
-
-commands:
- - text-generation-launcher
- --sharded true
- --num-shard 8
- --max-input-length 1024
- --max-total-tokens 2048
- --max-batch-prefill-tokens 4096
- --max-batch-total-tokens 524288
- --max-waiting-tokens 7
- --waiting-served-ratio 1.2
- --max-concurrent-requests 512
-
-resources:
- gpu: Gaudi2:8
-
-# Uncomment to cache downloaded models
-#volumes:
-# - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub
diff --git a/examples/llms/deepseek/trl/intel/.dstack.yml b/examples/llms/deepseek/trl/intel/.dstack.yml
deleted file mode 100644
index 9963e4844..000000000
--- a/examples/llms/deepseek/trl/intel/.dstack.yml
+++ /dev/null
@@ -1,46 +0,0 @@
-type: task
-# The name is optional, if not specified, generated randomly
-name: trl-train
-
-image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0
-
-# Required environment variables
-env:
- - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
- - WANDB_API_KEY
- - WANDB_PROJECT
-# Commands of the task
-commands:
- - pip install --upgrade-strategy eager optimum[habana]
- - pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0
- - git clone https://github.com/huggingface/optimum-habana.git
- - cd optimum-habana/examples/trl
- - pip install -r requirements.txt
- - pip install wandb
- - DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python ../gaudi_spawn.py --world_size 8 --use_deepspeed sft.py
- --model_name_or_path $MODEL_ID
- --dataset_name "lvwerra/stack-exchange-paired"
- --deepspeed ../language-modeling/llama2_ds_zero3_config.json
- --output_dir="./sft"
- --do_train
- --max_steps=500
- --logging_steps=10
- --save_steps=100
- --per_device_train_batch_size=1
- --per_device_eval_batch_size=1
- --gradient_accumulation_steps=2
- --learning_rate=1e-4
- --lr_scheduler_type="cosine"
- --warmup_steps=100
- --weight_decay=0.05
- --optim="paged_adamw_32bit"
- --lora_target_modules "q_proj" "v_proj"
- --bf16
- --remove_unused_columns=False
- --run_name="sft_deepseek_70"
- --report_to="wandb"
- --use_habana
- --use_lazy_mode
-
-resources:
- gpu: gaudi2:8
diff --git a/examples/llms/deepseek/trl/intel/deepseek_v2.dstack.yml b/examples/llms/deepseek/trl/intel/deepseek_v2.dstack.yml
deleted file mode 100644
index 7aa13d677..000000000
--- a/examples/llms/deepseek/trl/intel/deepseek_v2.dstack.yml
+++ /dev/null
@@ -1,45 +0,0 @@
-type: task
-# The name is optional, if not specified, generated randomly
-name: trl-train-deepseek-v2-lite
-
-image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0
-
-# Required environment variables
-env:
- - MODEL_ID=deepseek-ai/DeepSeek-V2-Lite
- - WANDB_API_KEY
- - WANDB_PROJECT
-# Commands of the task
-commands:
- - pip install git+https://github.com/huggingface/optimum-habana.git
- - pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0
- - git clone https://github.com/huggingface/optimum-habana.git
- - cd optimum-habana/examples/trl
- - pip install -r requirements.txt
- - DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python ../gaudi_spawn.py --world_size 8 --use_deepspeed sft.py
- --model_name_or_path $MODEL_ID
- --dataset_name "lvwerra/stack-exchange-paired"
- --deepspeed ../language-modeling/llama2_ds_zero3_config.json
- --output_dir="./sft"
- --do_train
- --max_steps=500
- --logging_steps=10
- --save_steps=100
- --per_device_train_batch_size=1
- --per_device_eval_batch_size=1
- --gradient_accumulation_steps=2
- --learning_rate=1e-4
- --lr_scheduler_type="cosine"
- --warmup_steps=100
- --weight_decay=0.05
- --optim="paged_adamw_32bit"
- --lora_target_modules "q_proj" "v_proj"
- --bf16
- --remove_unused_columns=False
- --run_name="sft_deepseek_v2lite"
- --report_to="wandb"
- --use_habana
- --use_lazy_mode
-
-resources:
- gpu: gaudi2:8
diff --git a/examples/llms/deepseek/vllm/intel/.dstack.yml b/examples/llms/deepseek/vllm/intel/.dstack.yml
deleted file mode 100644
index d28a0152d..000000000
--- a/examples/llms/deepseek/vllm/intel/.dstack.yml
+++ /dev/null
@@ -1,31 +0,0 @@
-type: service
-name: deepseek-r1-gaudi
-
-image: vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest
-
-env:
- - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
- - HABANA_VISIBLE_DEVICES=all
- - OMPI_MCA_btl_vader_single_copy_mechanism=none
-
-commands:
- - git clone https://github.com/HabanaAI/vllm-fork.git
- - cd vllm-fork
- - git checkout habana_main
- - pip install -r requirements-hpu.txt
- - python setup.py develop
- - vllm serve $MODEL_ID
- --tensor-parallel-size 8
- --trust-remote-code
- --download-dir /data
-
-port: 8000
-
-model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
-
-resources:
- gpu: gaudi2:8
-
-# Uncomment to cache downloaded models
-#volumes:
-# - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub
diff --git a/mkdocs.yml b/mkdocs.yml
index 3fcc531f2..d51482aac 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -310,7 +310,6 @@ nav:
- Accelerators:
- AMD: examples/accelerators/amd/index.md
- TPU: examples/accelerators/tpu/index.md
- - Intel Gaudi: examples/accelerators/intel/index.md
- Tenstorrent: examples/accelerators/tenstorrent/index.md
- Models:
- Wan2.2: examples/models/wan22/index.md
From fac3c2368a36344cdac33dd7db378ec1dc6f3f67 Mon Sep 17 00:00:00 2001
From: Andrey Cheptsov
Date: Sun, 12 Apr 2026 14:28:05 +0200
Subject: [PATCH 4/7] Make the `Examples` page a tree
---
docs/examples.md | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/docs/examples.md b/docs/examples.md
index baa5945a8..a9f840979 100644
--- a/docs/examples.md
+++ b/docs/examples.md
@@ -3,16 +3,16 @@ title: Examples
description: Collection of examples for training, inference, and clusters
#template: examples.html
hide:
- - navigation
-# - toc
- - footer
+# - navigation
+ - toc
+# - footer
---
-
+ -->
## Single-node training
From 7cfcc5d71edb1f271521c1807b532b7a546c4de8 Mon Sep 17 00:00:00 2001
From: Andrey Cheptsov
Date: Sun, 12 Apr 2026 14:33:59 +0200
Subject: [PATCH 5/7] Updated `Disable flags` in `contributing/DOCS.md`
---
contributing/DOCS.md | 13 +++++++++++++
scripts/docs/gen_cli_reference.py | 7 +++----
scripts/docs/gen_openapi_reference.py | 9 +++++++++
scripts/docs/gen_rest_plugin_spec_reference.py | 5 +++++
scripts/docs/hooks.py | 17 +++++++++++++++++
5 files changed, 47 insertions(+), 4 deletions(-)
diff --git a/contributing/DOCS.md b/contributing/DOCS.md
index a885c5f51..4fcc04d6d 100644
--- a/contributing/DOCS.md
+++ b/contributing/DOCS.md
@@ -52,6 +52,19 @@ uv run mkdocs build -s
The documentation uses a custom build system with MkDocs hooks to generate various files dynamically.
+### Disable flags
+
+Use these in `.envrc` to disable expensive docs regeneration, especially during `mkdocs serve` auto-reload. Set any of them to disable the corresponding artifact.
+
+```shell
+export DSTACK_DOCS_DISABLE_EXAMPLES=1
+export DSTACK_DOCS_DISABLE_LLM_TXT=1
+export DSTACK_DOCS_DISABLE_CLI_REFERENCE=1
+export DSTACK_DOCS_DISABLE_YAML_SCHEMAS=1
+export DSTACK_DOCS_DISABLE_OPENAPI_REFERENCE=1
+export DSTACK_DOCS_DISABLE_REST_PLUGIN_SPEC_REFERENCE=1
+```
+
### Build hooks
The build process is customized via hooks in `scripts/docs/hooks.py`:
diff --git a/scripts/docs/gen_cli_reference.py b/scripts/docs/gen_cli_reference.py
index 04db41df4..b72f48d1f 100644
--- a/scripts/docs/gen_cli_reference.py
+++ b/scripts/docs/gen_cli_reference.py
@@ -22,9 +22,6 @@
DISABLE_ENV = "DSTACK_DOCS_DISABLE_CLI_REFERENCE"
-logger.info("Generating CLI reference...")
-
-
@cache # TODO make caching work
def call_dstack(command: str) -> str:
return subprocess.check_output(shlex.split(command)).decode()
@@ -59,8 +56,10 @@ def process_file(file: File):
def main():
if os.environ.get(DISABLE_ENV):
- logger.warning(f"CLI reference generation is disabled: {DISABLE_ENV} is set")
+ logger.warning("CLI reference generation is disabled")
exit()
+
+ logger.info("Generating CLI reference...")
# Sequential processing take > 10s
with concurrent.futures.ThreadPoolExecutor() as pool:
futures = []
diff --git a/scripts/docs/gen_openapi_reference.py b/scripts/docs/gen_openapi_reference.py
index bb3a3d42f..847bf74c4 100644
--- a/scripts/docs/gen_openapi_reference.py
+++ b/scripts/docs/gen_openapi_reference.py
@@ -3,11 +3,20 @@
"""
import json
+import logging
+import os
from pathlib import Path
from dstack._internal.server.main import app
from dstack._internal.settings import DSTACK_VERSION
+disable_env = "DSTACK_DOCS_DISABLE_OPENAPI_REFERENCE"
+if os.environ.get(disable_env):
+ logging.getLogger("mkdocs.plugins.dstack.openapi").warning(
+ "OpenAPI reference generation is disabled"
+ )
+ exit(0)
+
app.title = "OpenAPI Spec"
app.servers = [
{"url": "http://localhost:3000", "description": "Local server"},
diff --git a/scripts/docs/gen_rest_plugin_spec_reference.py b/scripts/docs/gen_rest_plugin_spec_reference.py
index 6d9fa93c8..bfc5018dc 100644
--- a/scripts/docs/gen_rest_plugin_spec_reference.py
+++ b/scripts/docs/gen_rest_plugin_spec_reference.py
@@ -4,11 +4,16 @@
import json
import logging
+import os
from pathlib import Path
from dstack._internal.settings import DSTACK_VERSION
logger = logging.getLogger("mkdocs.plugins.dstack.rest_plugin_schema")
+disable_env = "DSTACK_DOCS_DISABLE_REST_PLUGIN_SPEC_REFERENCE"
+if os.environ.get(disable_env):
+ logger.warning("REST plugin spec reference generation is disabled")
+ exit(0)
try:
from example_plugin_server.main import app
diff --git a/scripts/docs/hooks.py b/scripts/docs/hooks.py
index 4530172d1..ce5b3740b 100644
--- a/scripts/docs/hooks.py
+++ b/scripts/docs/hooks.py
@@ -15,6 +15,8 @@
WELL_KNOWN_SKILLS_DIR = ".well-known/skills"
SKILL_PATH = ("skills", "dstack", "SKILL.md")
DISABLE_EXAMPLES_ENV = "DSTACK_DOCS_DISABLE_EXAMPLES"
+DISABLE_LLM_TXT_ENV = "DSTACK_DOCS_DISABLE_LLM_TXT"
+DISABLE_YAML_SCHEMAS_ENV = "DSTACK_DOCS_DISABLE_YAML_SCHEMAS"
SCHEMA_REFERENCE_PREFIX = "docs/reference/"
@@ -35,6 +37,8 @@ def _get_schema_expanded_content(rel_path, config, src_path=None):
"""Return expanded markdown for reference/**/*.md that contain #SCHEMA#, else None.
If src_path is given (e.g. from on_post_build loop), read from it; else build path from config.
"""
+ if os.environ.get(DISABLE_YAML_SCHEMAS_ENV):
+ return None
if not rel_path.startswith(SCHEMA_REFERENCE_PREFIX) or not rel_path.endswith(".md"):
log.debug(f"Skipping {rel_path}: not in {SCHEMA_REFERENCE_PREFIX} or not .md")
return None
@@ -88,6 +92,16 @@ def on_page_read_source(page, config):
return None
+def on_config(config):
+ if os.environ.get(DISABLE_EXAMPLES_ENV):
+ log.warning("Examples documentation is disabled")
+ if os.environ.get(DISABLE_YAML_SCHEMAS_ENV):
+ log.warning("YAML schema reference generation is disabled")
+ if os.environ.get(DISABLE_LLM_TXT_ENV):
+ log.warning("llms.txt generation is disabled")
+ return config
+
+
def on_page_context(context, page, config, nav):
"""Override edit_url only for example stubs so Edit points to the README; other pages use theme default from edit_uri."""
repo_url = (config.get("repo_url") or "").rstrip("/")
@@ -204,6 +218,9 @@ def _write_well_known_skills(config, site_dir):
def _generate_llms_files(config, site_dir):
"""Generate llms.txt and llms-full.txt using external script."""
+ if os.environ.get(DISABLE_LLM_TXT_ENV):
+ return
+
repo_root = os.path.dirname(config["config_file_path"])
# Import and run the generator
From 080f2b43fd433d0057ce30e6d5bb0a4223f298b2 Mon Sep 17 00:00:00 2001
From: Andrey Cheptsov
Date: Sun, 12 Apr 2026 14:43:46 +0200
Subject: [PATCH 6/7] Removed TGI references
---
docs/docs/concepts/services.md | 3 +-
docs/docs/reference/dstack.yml/service.md | 45 --------
docs/examples.md | 9 --
docs/examples/inference/tgi/index.md | 0
examples/accelerators/amd/README.md | 44 +-------
examples/inference/tgi/.dstack.yml | 32 ------
examples/inference/tgi/README.md | 124 ----------------------
examples/inference/tgi/amd/.dstack.yml | 21 ----
examples/inference/tgi/tpu/.dstack.yml | 27 -----
examples/inference/vllm/README.md | 4 +-
mkdocs.yml | 3 -
11 files changed, 8 insertions(+), 304 deletions(-)
delete mode 100644 docs/examples/inference/tgi/index.md
delete mode 100644 examples/inference/tgi/.dstack.yml
delete mode 100644 examples/inference/tgi/README.md
delete mode 100644 examples/inference/tgi/amd/.dstack.yml
delete mode 100644 examples/inference/tgi/tpu/.dstack.yml
diff --git a/docs/docs/concepts/services.md b/docs/docs/concepts/services.md
index 1eb63dd01..685b793bc 100644
--- a/docs/docs/concepts/services.md
+++ b/docs/docs/concepts/services.md
@@ -1093,6 +1093,5 @@ The rolling deployment stops when all replicas are updated or when a new deploym
1. Read about [dev environments](dev-environments.md) and [tasks](tasks.md)
2. Learn how to manage [fleets](fleets.md)
3. See how to set up [gateways](gateways.md)
- 4. Check the [TGI](../../examples/inference/tgi/index.md),
- [vLLM](../../examples/inference/vllm/index.md), and
+ 4. Check the [vLLM](../../examples/inference/vllm/index.md) and
[NIM](../../examples/inference/nim/index.md) examples
diff --git a/docs/docs/reference/dstack.yml/service.md b/docs/docs/reference/dstack.yml/service.md
index 59411a540..8aba6f827 100644
--- a/docs/docs/reference/dstack.yml/service.md
+++ b/docs/docs/reference/dstack.yml/service.md
@@ -20,51 +20,6 @@ The `service` configuration type allows running [services](../../concepts/servic
type:
required: true
-=== "TGI"
-
- > TGI provides an OpenAI-compatible API starting with version 1.4.0,
- so models served by TGI can be defined with `format: openai` too.
-
- #SCHEMA# dstack.api.TGIChatModel
- overrides:
- show_root_heading: false
- type:
- required: true
-
- ??? info "Chat template"
-
- By default, `dstack` loads the [chat template](https://huggingface.co/docs/transformers/main/en/chat_templating)
- from the model's repository. If it is not present there, manual configuration is required.
-
- ```yaml
- type: service
-
- image: ghcr.io/huggingface/text-generation-inference:latest
- env:
- - MODEL_ID=TheBloke/Llama-2-13B-chat-GPTQ
- commands:
- - text-generation-launcher --port 8000 --trust-remote-code --quantize gptq
- port: 8000
-
- resources:
- gpu: 80GB
-
- # Enable the OpenAI-compatible endpoint
- model:
- type: chat
- name: TheBloke/Llama-2-13B-chat-GPTQ
- format: tgi
- chat_template: "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<>\\n' + system_message + '\\n<>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' }}{% endif %}{% endfor %}"
- eos_token: ""
- ```
-
- Please note that model mapping is an experimental feature with the following limitations:
-
- 1. Doesn't work if your `chat_template` uses `bos_token`. As a workaround, replace `bos_token` inside `chat_template` with the token content itself.
- 2. Doesn't work if `eos_token` is defined in the model repository as a dictionary. As a workaround, set `eos_token` manually, as shown in the example above (see Chat template).
-
- If you encounter any ofther issues, please make sure to file a
- [GitHub issue](https://github.com/dstackai/dstack/issues/new/choose).
### `scaling`
diff --git a/docs/examples.md b/docs/examples.md
index a9f840979..cbecf2435 100644
--- a/docs/examples.md
+++ b/docs/examples.md
@@ -165,15 +165,6 @@ hide:
Deploy Llama 3.1 with vLLM