diff --git a/docs/examples.md b/docs/examples.md index c380b98bf..2bcf1d2a7 100644 --- a/docs/examples.md +++ b/docs/examples.md @@ -82,7 +82,16 @@ hide: Deploy and fine-tune LLMs on AMD

+ +

+ INTEL +

+

+ Deploy and fine-tune LLMs on Intel +

+

diff --git a/docs/examples/accelerators/intel/index.md b/docs/examples/accelerators/intel/index.md new file mode 100644 index 000000000..e69de29bb diff --git a/examples/accelerators/intel/README.md b/examples/accelerators/intel/README.md new file mode 100644 index 000000000..c712637ae --- /dev/null +++ b/examples/accelerators/intel/README.md @@ -0,0 +1,194 @@ +# INTEL + +`dstack` supports running dev environments, tasks, and services on Intel Gaudi2 GPUs. +You can do that by setting up an [SSH fleet](https://dstack.ai/docs/concepts/fleets#ssh) +with on-prem Intel Gaudi2 GPUs. + +## Deployment + +Serving frameworks like vLLM and TGI have Intel Gaudi2 support. Here's an example of a [service](https://dstack.ai/docs/services) that deploys +[DeepSeek-R1-Distill-Llama-70B:material-arrow-top-right-thin:{ .external }](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B){:target="_blank"} in BF16 using [TGI :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/tgi-gaudi){:target="_blank"} and [vLLM :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/latest/getting_started/installation/ai_accelerator/index.html?device=hpu-gaudi){:target="_blank"}. + +=== "TGI" + +
+ + ```yaml + type: service + + name: tgi + + image: ghcr.io/huggingface/tgi-gaudi:2.3.1 + + port: 8000 + + model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B + + env: + - HF_TOKEN + - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B + - PORT=8000 + - OMPI_MCA_btl_vader_single_copy_mechanism=none + - TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true + - PT_HPU_ENABLE_LAZY_COLLECTIVES=true + - MAX_TOTAL_TOKENS=2048 + - BATCH_BUCKET_SIZE=256 + - PREFILL_BATCH_BUCKET_SIZE=4 + - PAD_SEQUENCE_TO_MULTIPLE_OF=64 + - ENABLE_HPU_GRAPH=true + - LIMIT_HPU_GRAPH=true + - USE_FLASH_ATTENTION=true + - FLASH_ATTENTION_RECOMPUTE=true + + commands: + - text-generation-launcher + --sharded true + --num-shard $DSTACK_GPUS_NUM + --max-input-length 1024 + --max-total-tokens 2048 + --max-batch-prefill-tokens 4096 + --max-batch-total-tokens 524288 + --max-waiting-tokens 7 + --waiting-served-ratio 1.2 + --max-concurrent-requests 512 + + resources: + gpu: Gaudi2:8 + + # Uncomment to cache downloaded models + #volumes: + # - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub + ``` + +
+ + +=== "vLLM" + +
+ + ```yaml + type: service + name: deepseek-r1-gaudi + + image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0 + env: + - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B + commands: + - git clone https://github.com/vllm-project/vllm.git + - cd vllm + - pip install -r requirements-hpu.txt + - python setup.py develop + - vllm serve $MODEL_ID + --tensor-parallel-size $DSTACK_GPUS_NUM + --download-dir /data/hub + + port: 8000 + + model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B + + + resources: + gpu: gaudi2:8 + + # Uncomment to cache downloaded models + #volumes: + # - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub + ``` +
+ + +## Fine-tuning + +Below is an example of distributed LoRA fine-tuning of [DeepSeek-R1-Distill-Qwen-7B :material-arrow-top-right-thin:{ .external }](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B){:target="_blank"} BF16 using [Optimum-habana :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/optimum-habana){:target="_blank"} +and [DeepSpeed:material-arrow-top-right-thin:{ .external }](https://docs.habana.ai/en/latest/PyTorch/DeepSpeed/DeepSpeed_User_Guide/DeepSpeed_User_Guide.html#deepspeed-user-guide){:target="_blank"} with the [`lvwerra/stack-exchange-paired` :material-arrow-top-right-thin:{ .external }](https://huggingface.co/datasets/lvwerra/stack-exchange-paired){:target="_blank"} dataset. Optimum-habana optimizes huggingface's TRL for Intel Gaudi accelerators. + +
+ +```yaml +type: task +# The name is optional, if not specified, generated randomly +name: trl-train + +image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0 + +# Required environment variables +env: + - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B + - WANDB_API_KEY + - WANDB_PROJECT +# Commands of the task +commands: + - pip install --upgrade-strategy eager optimum[habana] + - pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0 + - git clone https://github.com/huggingface/optimum-habana.git + - cd optimum-habana/examples/trl + - pip install -r requirements.txt + - pip install wandb + - DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python ../gaudi_spawn.py --world_size $DSTACK_GPUS_NUM --use_deepspeed sft.py + --model_name_or_path $MODEL_ID + --dataset_name "lvwerra/stack-exchange-paired" + --deepspeed ../language-modeling/llama2_ds_zero3_config.json + --output_dir="./sft" + --do_train + --max_steps=500 + --logging_steps=10 + --save_steps=100 + --per_device_train_batch_size=1 + --per_device_eval_batch_size=1 + --gradient_accumulation_steps=2 + --learning_rate=1e-4 + --lr_scheduler_type="cosine" + --warmup_steps=100 + --weight_decay=0.05 + --optim="paged_adamw_32bit" + --lora_target_modules "q_proj" "v_proj" + --bf16 + --remove_unused_columns=False + --run_name="sft_deepseek_70" + --report_to="wandb" + --use_habana + --use_lazy_mode + +resources: + gpu: gaudi2:8 +``` + +
+Note, To finetune DeepSeek-R1-Distill-Llama-70B BF16 with 8x Gaudi2, you can partially offload parameters to CPU memory using the Deepspeed configuration file. For more details, refer to [cpu_offload](https://deepspeed.readthedocs.io/en/latest/zero3.html#deepspeedzerooffloadparamconfig). + + +## Running a configuration + +Once the configuration is ready, run `dstack apply -f `, and `dstack` will automatically provision the +on prem resources and run the configuration. + +
+ +```shell +$ dstack apply -f examples/deployment/vllm/.dstack.yml + + # BACKEND REGION RESOURCES SPOT PRICE + 1 ssh remote 152xCPU,1007GB,8xGaudi2:96GB yes $0 idle + + +Submit a new run? [y/n]: y + +Provisioning... +---> 100% +``` + +
+ +## Source code + +The source-code of this example can be found in +[`examples/deployment/tgi/intel` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/deployment/tgi/intel){:target="_blank"}, +[`examples/deployment/vllm/intel` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/deployment/vllm/intel){:target="_blank"} and +[`examples/fine-tuning/trl/intel` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/trl/intel){:target="_blank"} + +## What's next? + +1. Browse [Intel Gaudi Documentation :material-arrow-top-right-thin:{ .external }](https://docs.habana.ai/en/latest/index.html) and [Optimum-habana TRL examples :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/optimum-habana/blob/main/examples/trl/README.md) +2. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), and + [services](https://dstack.ai/docs/services). diff --git a/examples/deployment/tgi/intel/.dstack.yml b/examples/deployment/tgi/intel/.dstack.yml new file mode 100644 index 000000000..16d083092 --- /dev/null +++ b/examples/deployment/tgi/intel/.dstack.yml @@ -0,0 +1,45 @@ +type: service + +name: tgi + +image: ghcr.io/huggingface/tgi-gaudi:2.3.1 + +auth: false +port: 8000 + +model: DeepSeek-R1-Distill-Llama-70B + +env: + - HF_TOKEN + - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B + - PORT=8000 + - OMPI_MCA_btl_vader_single_copy_mechanism=none + - TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true + - PT_HPU_ENABLE_LAZY_COLLECTIVES=true + - MAX_TOTAL_TOKENS=2048 + - BATCH_BUCKET_SIZE=256 + - PREFILL_BATCH_BUCKET_SIZE=4 + - PAD_SEQUENCE_TO_MULTIPLE_OF=64 + - ENABLE_HPU_GRAPH=true + - LIMIT_HPU_GRAPH=true + - USE_FLASH_ATTENTION=true + - FLASH_ATTENTION_RECOMPUTE=true + +commands: + - text-generation-launcher + --sharded true + --num-shard 8 + --max-input-length 1024 + --max-total-tokens 2048 + --max-batch-prefill-tokens 4096 + --max-batch-total-tokens 524288 + --max-waiting-tokens 7 + --waiting-served-ratio 1.2 + --max-concurrent-requests 512 + +resources: + gpu: Gaudi2:8 + +# Uncomment to cache downloaded models +#volumes: +# - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub diff --git a/examples/deployment/vllm/intel/.dstack.yml b/examples/deployment/vllm/intel/.dstack.yml new file mode 100644 index 000000000..b44681aff --- /dev/null +++ b/examples/deployment/vllm/intel/.dstack.yml @@ -0,0 +1,26 @@ +type: service +name: deepseek-r1-gaudi + +image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0 +env: + - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B +commands: + - git clone https://github.com/vllm-project/vllm.git + - cd vllm + - pip install -r requirements-hpu.txt + - python setup.py develop + - vllm serve $MODEL_ID + --tensor-parallel-size 8 + --download-dir /data/hub + +port: 8000 + +model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B + + +resources: + gpu: gaudi2:8 + +# Uncomment to cache downloaded models +#volumes: +# - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub diff --git a/examples/fine-tuning/trl/intel/.dstack.deepseek_distill.yml b/examples/fine-tuning/trl/intel/.dstack.deepseek_distill.yml new file mode 100644 index 000000000..9963e4844 --- /dev/null +++ b/examples/fine-tuning/trl/intel/.dstack.deepseek_distill.yml @@ -0,0 +1,46 @@ +type: task +# The name is optional, if not specified, generated randomly +name: trl-train + +image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0 + +# Required environment variables +env: + - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B + - WANDB_API_KEY + - WANDB_PROJECT +# Commands of the task +commands: + - pip install --upgrade-strategy eager optimum[habana] + - pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0 + - git clone https://github.com/huggingface/optimum-habana.git + - cd optimum-habana/examples/trl + - pip install -r requirements.txt + - pip install wandb + - DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python ../gaudi_spawn.py --world_size 8 --use_deepspeed sft.py + --model_name_or_path $MODEL_ID + --dataset_name "lvwerra/stack-exchange-paired" + --deepspeed ../language-modeling/llama2_ds_zero3_config.json + --output_dir="./sft" + --do_train + --max_steps=500 + --logging_steps=10 + --save_steps=100 + --per_device_train_batch_size=1 + --per_device_eval_batch_size=1 + --gradient_accumulation_steps=2 + --learning_rate=1e-4 + --lr_scheduler_type="cosine" + --warmup_steps=100 + --weight_decay=0.05 + --optim="paged_adamw_32bit" + --lora_target_modules "q_proj" "v_proj" + --bf16 + --remove_unused_columns=False + --run_name="sft_deepseek_70" + --report_to="wandb" + --use_habana + --use_lazy_mode + +resources: + gpu: gaudi2:8 diff --git a/examples/fine-tuning/trl/intel/.dstack.deepseek_v2_lite.yml b/examples/fine-tuning/trl/intel/.dstack.deepseek_v2_lite.yml new file mode 100644 index 000000000..7aa13d677 --- /dev/null +++ b/examples/fine-tuning/trl/intel/.dstack.deepseek_v2_lite.yml @@ -0,0 +1,45 @@ +type: task +# The name is optional, if not specified, generated randomly +name: trl-train-deepseek-v2-lite + +image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0 + +# Required environment variables +env: + - MODEL_ID=deepseek-ai/DeepSeek-V2-Lite + - WANDB_API_KEY + - WANDB_PROJECT +# Commands of the task +commands: + - pip install git+https://github.com/huggingface/optimum-habana.git + - pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0 + - git clone https://github.com/huggingface/optimum-habana.git + - cd optimum-habana/examples/trl + - pip install -r requirements.txt + - DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python ../gaudi_spawn.py --world_size 8 --use_deepspeed sft.py + --model_name_or_path $MODEL_ID + --dataset_name "lvwerra/stack-exchange-paired" + --deepspeed ../language-modeling/llama2_ds_zero3_config.json + --output_dir="./sft" + --do_train + --max_steps=500 + --logging_steps=10 + --save_steps=100 + --per_device_train_batch_size=1 + --per_device_eval_batch_size=1 + --gradient_accumulation_steps=2 + --learning_rate=1e-4 + --lr_scheduler_type="cosine" + --warmup_steps=100 + --weight_decay=0.05 + --optim="paged_adamw_32bit" + --lora_target_modules "q_proj" "v_proj" + --bf16 + --remove_unused_columns=False + --run_name="sft_deepseek_v2lite" + --report_to="wandb" + --use_habana + --use_lazy_mode + +resources: + gpu: gaudi2:8 diff --git a/mkdocs.yml b/mkdocs.yml index 77d47e3ef..0d409d2c7 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -265,6 +265,7 @@ nav: - TRL: examples/fine-tuning/trl/index.md - Accelerators: - AMD: examples/accelerators/amd/index.md + - INTEL: examples/accelerators/intel/index.md - TPU: examples/accelerators/tpu/index.md - LLMs: - Llama 3.1: examples/llms/llama31/index.md