diff --git a/docs/examples.md b/docs/examples.md index c380b98bf..a4f63a7a8 100644 --- a/docs/examples.md +++ b/docs/examples.md @@ -98,6 +98,16 @@ hide: ## LLMs
+ +

+ Deepseek +

+ +

+ Deploy Deepseek models +

+

diff --git a/docs/examples/llms/deepseek/index.md b/docs/examples/llms/deepseek/index.md new file mode 100644 index 000000000..e69de29bb diff --git a/examples/.dstack.yml b/examples/.dstack.yml index 1079937ce..a77fbb4f8 100644 --- a/examples/.dstack.yml +++ b/examples/.dstack.yml @@ -2,9 +2,9 @@ type: dev-environment # The name is optional, if not specified, generated randomly name: vscode -python: "3.11" +#python: "3.11" # Uncomment to use a custom Docker image -#image: dstackai/base:py3.13-0.6-cuda-12.1 +image: dstackai/base:py3.13-0.6-cuda-12.1 ide: vscode diff --git a/examples/llms/deepseek/README.md b/examples/llms/deepseek/README.md new file mode 100644 index 000000000..ee7808c08 --- /dev/null +++ b/examples/llms/deepseek/README.md @@ -0,0 +1,210 @@ +# Deepseek +This example walks you through how to deploy Deepseek-r1 with `dstack`. + +??? info "Prerequisites" + Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`. + +
+ + ```shell + $ git clone https://github.com/dstackai/dstack + $ cd dstack + $ dstack init + ``` +
+ +## Deployment +### AMD +Here's an example of a service that deploys Deepseek-r1 using `SGLang` and `vLLM` with AMD `Mi300x` GPU. + +=== "SGLang" + +
+ ```yaml + type: service + name: deepseek-r1-amd + + image: lmsysorg/sglang:v0.4.1.post4-rocm620 + env: + - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B + commands: + - python3 -m sglang.launch_server + --model-path $MODEL_ID + --port 8000 + --trust-remote-code + + port: 8000 + model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B + + resources: + gpu: mi300x + disk: 300Gb + + ``` +
+ +=== "vLLM" + +
+ ```yaml + type: service + name: deepseek-r1-amd + + image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4 + env: + - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B + - MAX_MODEL_LEN=126432 + commands: + - vllm serve $MODEL_ID + --max-model-len $MAX_MODEL_LEN + + port: 8000 + + model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B + + + resources: + gpu: mi300x + disk: 300Gb + ``` +
+ +Note, when using Deepseek-70B with vLLM with a 192GB GPU, we must limit the context size to 126432 tokens to fit the memory. + + +### NVIDIA +Here's an example of a service that deploys Deepseek-r1 using `SGLang` and `vLLM` with NVIDIA `24GB` GPU. + +=== "SGLang" + +
+ ```yaml + type: service + name: deepseek-r1-nvidia + + image: lmsysorg/sglang:latest + env: + - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B + commands: + - python3 -m sglang.launch_server + --model-path $MODEL_ID + --port 8000 + --trust-remote-code + + port: 8000 + + model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + + resources: + gpu: 24GB + ``` +
+ +=== "vLLM" + +
+ ```yaml + type: service + name: deepseek-r1-nvidia + + image: vllm/vllm-openai:latest + env: + - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B + - MAX_MODEL_LEN=4096 + commands: + - vllm serve $MODEL_ID + --max-model-len $MAX_MODEL_LEN + + port: 8000 + + model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + + resources: + gpu: 24GB + ``` +
+ +Note, when using Deepseek-8B with vLLM with a 24GB GPU, we must limit the context size to 4096 tokens to fit the memory. + +### Memory requirements + +Below are the approximate memory requirements for loading the model. +This excludes memory for the model context and CUDA/ROCm kernel reservations. + +| Model size | FP16 | FP8 | INT4 | +|------------|---------|---------|---------| +| **671B** | ~1342GB | ~671GB | ~336GB | +| **70B** | ~161GB | ~80.5GB | ~40B | +| **32B** | ~74GB | ~37GB | ~18.5GB | +| **14B** | ~32GB | ~16GB | ~8GB | +| **8B** | ~18GB | ~9GB | ~4.5GB | +| **7B** | ~16GB | ~8GB | ~4GB | +| **1.5B** | ~3.5GB | ~2GB | ~1GB | + +For example, the FP16 version of Deepseek-r1 671B would fit into single node of `Mi300x` with eight 192GB GPUs or +two nodes of `H200` with eight 141GB GPUs. + + + +### Running a configuration + +To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command. + +
+ +```shell +$ dstack apply -f examples/llms/deepseek/sglang/amd/.dstack.yml + + # BACKEND REGION RESOURCES SPOT PRICE + 1 runpod EU-RO-1 24xCPU, 283GB, 1xMI300X (192GB) no $2.49 + +Submit the run deepseek-r1-amd? [y/n]: y + +Provisioning... +---> 100% +``` +
+ +Once the service is up, the model will be available via the OpenAI-compatible endpoint +at `/proxy/models//`. + +
+ +```shell +curl http://127.0.0.1:3000/proxy/models/main/chat/completions \ + -X POST \ + -H 'Authorization: Bearer <dstack token>' \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B", + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "What is Deep Learning?" + } + ], + "stream": true, + "max_tokens": 512 + }' +``` +
+ + +When a [gateway](https://dstack.ai/docs/concepts/gateways.md) is configured, the OpenAI-compatible endpoint +is available at `https://gateway./`. + +## Source code + +The source-code of this example can be found in +[`examples/llms/deepseek` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/llms/deepseek). + +## What's next? +1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), + [services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips). +2. Browse [AMD Instinct GPUs Power DeepSeek :material-arrow-top-right-thin:{ .external }](https://www.amd.com/en/developer/resources/technical-articles/amd-instinct-gpus-power-deepseek-v3-revolutionizing-ai-development-with-sglang.html) + + diff --git a/examples/llms/deepseek/sglang/amd/.dstack.yml b/examples/llms/deepseek/sglang/amd/.dstack.yml new file mode 100644 index 000000000..18086f287 --- /dev/null +++ b/examples/llms/deepseek/sglang/amd/.dstack.yml @@ -0,0 +1,18 @@ +type: service +name: deepseek-r1-amd + +image: lmsysorg/sglang:v0.4.1.post4-rocm620 +env: + - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B +commands: + - python3 -m sglang.launch_server + --model-path $MODEL_ID + --port 8000 + --trust-remote-code + +port: 8000 +model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B + +resources: + gpu: mi300x + disk: 300Gb diff --git a/examples/llms/deepseek/sglang/nvidia/.dstack.yml b/examples/llms/deepseek/sglang/nvidia/.dstack.yml new file mode 100644 index 000000000..fa5786965 --- /dev/null +++ b/examples/llms/deepseek/sglang/nvidia/.dstack.yml @@ -0,0 +1,18 @@ +type: service +name: deepseek-r1-nvidia + +image: lmsysorg/sglang:latest +env: + - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B +commands: + - python3 -m sglang.launch_server + --model-path $MODEL_ID + --port 8000 + --trust-remote-code + +port: 8000 + +model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + +resources: + gpu: 24GB diff --git a/examples/llms/deepseek/vllm/amd/.dstack.yml b/examples/llms/deepseek/vllm/amd/.dstack.yml new file mode 100644 index 000000000..a35fb4ace --- /dev/null +++ b/examples/llms/deepseek/vllm/amd/.dstack.yml @@ -0,0 +1,19 @@ +type: service +name: deepseek-r1-amd + +image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4 +env: + - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B + - MAX_MODEL_LEN=126432 +commands: + - vllm serve $MODEL_ID + --max-model-len $MAX_MODEL_LEN + +port: 8000 + +model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B + + +resources: + gpu: mi300x + disk: 300Gb diff --git a/examples/llms/deepseek/vllm/nvidia/.dstack.yml b/examples/llms/deepseek/vllm/nvidia/.dstack.yml new file mode 100644 index 000000000..62e41e207 --- /dev/null +++ b/examples/llms/deepseek/vllm/nvidia/.dstack.yml @@ -0,0 +1,17 @@ +type: service +name: deepseek-r1-nvidia + +image: vllm/vllm-openai:latest +env: + - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B + - MAX_MODEL_LEN=4096 +commands: + - vllm serve $MODEL_ID + --max-model-len $MAX_MODEL_LEN + +port: 8000 + +model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + +resources: + gpu: 24GB diff --git a/examples/llms/llama31/README.md b/examples/llms/llama31/README.md index a345d56b0..fb754e4fc 100644 --- a/examples/llms/llama31/README.md +++ b/examples/llms/llama31/README.md @@ -181,7 +181,7 @@ Provisioning...

Once the service is up, the model will be available via the OpenAI-compatible endpoint -at `/proxy/models//. +at `/proxy/models//`.
diff --git a/mkdocs.yml b/mkdocs.yml index 77d47e3ef..b63764789 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -267,6 +267,7 @@ nav: - AMD: examples/accelerators/amd/index.md - TPU: examples/accelerators/tpu/index.md - LLMs: + - Deepseek: examples/llms/deepseek/index.md - Llama 3.1: examples/llms/llama31/index.md - Llama 3.2: examples/llms/llama32/index.md - Misc: