dstackai · peterschmidt85 · Jul 31, 2024 · Jul 30, 2024 · Jul 31, 2024
diff --git a/examples/fine-tuning/alignment-handbook/.dstack.yml b/examples/fine-tuning/alignment-handbook/.dstack.yml
@@ -0,0 +1,28 @@
+type: dev-environment
+name: ah-vscode
+
+# If `image` is not specified, dstack uses its default image
+python: "3.10"
+
+# Required environment variables
+env:
+  - HUGGING_FACE_HUB_TOKEN
+  - ACCELERATE_LOG_LEVEL=info
+  - WANDB_API_KEY
+
+# Uncomment if you want the environment to be pre-installed
+#init:
+#  - conda install cuda
+#  - git clone https://github.com/huggingface/alignment-handbook.git
+#  - cd alignment-handbook
+#  - pip install .
+#  - pip install flash-attn --no-build-isolation
+#  - pip install wandb
+
+ide: vscode
+
+spot_policy: auto
+
+resources:
+  # Minimum 24GB, one or more GPU
+  gpu: 24GB..:1..
diff --git a/examples/fine-tuning/alignment-handbook/README.md b/examples/fine-tuning/alignment-handbook/README.md
@@ -1,182 +1,180 @@
 # Alignment Handbook
 
-[Alignment Handbook](https://github.com/huggingface/alignment-handbook) provides robust recipes to continue pretraining
-and to align language models with human and AI preferences. It basically comes with two types of recipes and four types
-of scripts that were used to create Hugging Face [Zephyr models](https://huggingface.co/HuggingFaceH4):
+[Alignment Handbook](https://github.com/huggingface/alignment-handbook) by HuggingFace offers recipes, configs, and
+scripts for fine-tuning LLMs. It includes all the code needed to run CPT, SFT, DPO, and ORPO leveraging HuggingFace's
+libraries like `transformers`, `peft`, `accelerate`, and `trl`. You just need to modify the recipes and run the 
+appropriate script.
 
-- Accelerate recipes: configurations for [DeepSpeed Zero3](https://huggingface.co/docs/accelerate/v0.11.0/en/deepspeed),
-- [FSDP(Fully Sharded Data Parallel)](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html), and multi GPU.
+This example shows how use Alignment Handbook to fine-tune Gemma 7B on your SFT dataset 
+with Alignment Handbook and `dstack`. 
 
-- Training recipes: configurations of how GPT2, StarChat2-15B, Zephyr-141B-A35B, Zephyr-7B-Beta, and Zephyr-7B-Gemma
-  models were fine-tuned.
+!!! info "Prerequisites"
+    Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo and run `dstack init`
 
-- Scripts: [`run_cpt.py`](https://github.com/huggingface/alignment-handbook/blob/main/scripts/run_cpt.py) for continual
-  pre-training, [`run_sft.py`](https://github.com/huggingface/alignment-handbook/blob/main/scripts/run_sft.py) for
-  supervised fine-tuning, [`run_dpo.py`](https://github.com/huggingface/alignment-handbook/blob/main/scripts/run_dpo.py)
-  for aligning with preferences via [DPO](https://arxiv.org/abs/2305.18290),
-  and [`run_orpo.py`](https://github.com/huggingface/alignment-handbook/blob/main/scripts/run_orpo.py) aligning
-  with [ORPO](https://arxiv.org/abs/2403.07691).
+    ```shell
+    git clone https://github.com/dstackai/dstack
+    cd dstack
+    dstack init
+    ```
 
-## Basics
+## Training configuration recipe
 
-Alignment Handbook provides all the code you need to run CPT, SFT, DPO, and ORPO within Hugging Face OSS ecosystem
-such as `transformers`, `peft`, `accelerate`, `trl`. All you need to do is to modify recipes for accelerate and
-training, and run appropriate script.
+Alignment Handbook's training script reads the model, LoRA, and dataset arguments, as well
+as trainer configuration from a YAML file.
+This file can be found at [`examples/fine-tuning/alignment-handbook/config.yaml`](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/alignment-handbook/config.yaml).
+You can modify it as needed.
 
-For instance, if you want to QLoRA fine-tune Gemma 7B model on your own SFT dataset hosted on Hugging Face Hub, you can
-prepare a `yaml` config file as [config.yaml](config.yaml). This config is based on the Zephyr-7B-Gemma recipe except
-the following modification:
+## Single-node training
 
-- `dataset_mixer` field to point which SFT dataset to be used.
-- `hub_model_id` and `output_dir` fields to point where the model and its checkpoints should be saved.
-- `LoRA arguments` related fields to indicate that this fine-tuning is based on QLoRA methodology.
-
-With the `config.yaml` file configured, you can run the following command to QLoRA fine-tune Gemma 7B model on 2
-GPUs:
-
-```shell
-ACCELERATE_LOG_LEVEL=info accelerate launch \
-  --config_file config.yaml \
-  --num_processes=2 \
-  scripts/run_sft.py \
-  recipes/{model_name}/{task}/config_qlora.yaml
-```
-
-For more details and other alignment methods, please check out the
-alignment-handbook's [official repository](https://github.com/huggingface/alignment-handbook).
-
-## Running via `dstack` {#running-via-dstack}
-
-This example demonstrate how to run an Alignment Handbook recipe via `dstack`.
-
-First, define the [`train.dstack.yaml`](train.dstack.yaml) task configuration file as following:
+The easiest way to run a training script with `dstack` is by creating a task configuration file.
+This file can be found at [`examples/fine-tuning/alignment-handbook/train.dstack.yml`](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/alignment-handbook/train.dstack.yml). Below is its content: 
 
 ```yaml
 type: task
+name: ah-train
 
-python: "3.11"
+# If `image` is not specified, dstack uses its default image
+python: "3.10"
 
+# Required environment variables
 env:
   - HUGGING_FACE_HUB_TOKEN
+  - ACCELERATE_LOG_LEVEL=info
   - WANDB_API_KEY
-
+# Commands of the task
 commands:
   - conda install cuda
   - git clone https://github.com/huggingface/alignment-handbook.git
-  - mkdir -p alignment-handbook/recipes/custom/
-  - cp config.yaml alignment-handbook/recipes/custom/config.yaml
-
   - cd alignment-handbook
-  - python -m pip install .
-  - python -m pip install flash-attn --no-build-isolation
-
+  - pip install .
+  - pip install flash-attn --no-build-isolation
   - pip install wandb
-  - wandb login $WANDB_API_KEY
-
-  - ACCELERATE_LOG_LEVEL=info accelerate launch
+  - accelerate launch
     --config_file recipes/accelerate_configs/multi_gpu.yaml
     --num_processes=$DSTACK_GPUS_NUM
     scripts/run_sft.py
-    recipes/custom/config.yaml
-    
+    ../examples/fine-tuning/alignment-handbook/config.yaml
+# Expose 6006 to access TensorBoard
 ports:
   - 6006
 
 resources:
-  gpu:
-    memory: 40GB
-    name: A6000
-    count: 2
+  # Required resources
+  gpu: 24GB
 ```
 
-> [!NOTE]
-> Feel free to adjust `resources` to specify the required resources.
-
-The task clones the `huggingface/alignment-handbook` repo, and copies our local `config.yaml` to the recipies subfolder.
-Then, the task installs dependencies, and launches the recipe.
+The task clones Alignment Handbook's repo, installs the dependencies,
+and runs the script.
 
-Our `config.yaml` sets `report_to` to `wandb` and `tensorboard`. That's why we the task also installs `wandb`.
+The `DSTACK_GPUS_NUM` environment variable is automatically passed to the container
+according to the `resoruce` property.
 
-To run the task, use the following command:
+To run the task, use `dstack apply`:
 
 ```shell
-HUGGING_FACE_HUB_TOKEN=<...> \
-WANDB_API_KEY=<...> \
-dstack run . -f examples/fine-tuning/alignment-handbook/train.dstack.yaml
-```
-
-## Multi-node 
+HUGGING_FACE_HUB_TOKEN=...
+ACCELERATE_LOG_LEVEL=...
+WANDB_API_KEY=...
 
-With `dstack`, we can easily manage multiple nodes with multiple GPUs. To leverage multiple nodes for Alignment Handbook with `dstack`, we need to adjust two things: the configurations of Hugging Face's `accelerate` and the `dstack`'s task description.
-
-### Accelerate configurations
-
-Basically, the configurations of the `accelerate` don't have to be changed. It could remain the same as the `multi_gpu.yaml` used in the previous [Running via `dstack`](#running-via-dstack) section. However, it is worth knowing about the `fsdp_sharding_strategy` configuration.
-
-```yaml
-compute_environment: LOCAL_MACHINE
-distributed_type: FSDP  # Use Fully Sharded Data Parallelism
-fsdp_config: 
-  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
-  fsdp_backward_prefetch: BACKWARD_PRE
-  fsdp_cpu_ram_efficient_loading: true
-  fsdp_use_orig_params: false 
-  fsdp_offload_params: true
-  fsdp_sharding_strategy: FULL_SHARD
-  # ... (other FSDP configurations)
-# ... (other configurations)
+dstack apply -f examples/fine-tuning/alignment-handbook/train.dstack.yml
 ```
 
-With the FSDP of `distributed_type` and `FULL_SHARD` of `fsdp_config`’s `fsdp_sharding_strategy`, a model will be sharded across multiple GPUs in a single machine. If there are multiple nodes, each node will have the same model sharded across multiple GPUs within itself. That means each sharded model instance in each node will learn different parts/batches of a given dataset. If you want to shard a model across multiple GPUs on multiple nodes, the value of `fsdp_sharding_strategy` should be set as HYBRID_SHARD.
+If you list `tensorbord` via `report_to` in [`examples/fine-tuning/alignment-handbook/config.yaml`](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/alignment-handbook/config.yaml),
+you'll be able to access experiment metrics via `http://localhost:6006` (while the task is running).
 
-### dstack task description
+## Multi-node training
 
-Fine-tuning LLMs on multiple nodes means each node should be connected and managed in the same network. `dstack` automatically comes with the features for these. The below `dstack`'s task description assumes that there are three nodes, and each node has two GPUs:
+The multi-node training task configuration file can be found at [`examples/fine-tuning/alignment-handbook/train-distrib.dstack.yml`](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/alignment-handbook/train-distrib.dstack.yml).
+Below is its content:
 
 ```yaml
 type: task
-python: "3.11" 
-nodes: 3
+name: ah-train-distrib
+
+# If `image` is not specified, dstack uses its default image
+python: "3.10"
+
+# Required environment variables
 env:
-  - ACCEL_CONFIG_PATH
-  - FT_MODEL_CONFIG_PATH
   - HUGGING_FACE_HUB_TOKEN
-  - WANDB_API_KEY 
+  - ACCELERATE_LOG_LEVEL=info
+  - WANDB_API_KEY
+# Commands of the task (dstack runs it on each node)
 commands:
-  # ... (setup steps, cloning repo, installing requirements)
-  - ACCELERATE_LOG_LEVEL=info accelerate launch \
-      --config_file examples/fine-tuning/alignment-handbook/fsdp_qlora_full_shard.yaml \
-      --main_process_ip=$DSTACK_MASTER_NODE_IP \
-      --main_process_port=8008 \
-      --machine_rank=$DSTACK_NODE_RANK \
-      --num_processes=$DSTACK_GPUS_NUM \
-      --num_machines=$DSTACK_NODES_NUM \
-      scripts/run_sft.py recipes/custom/config.yaml
+  - conda install cuda
+  - git clone https://github.com/huggingface/alignment-handbook.git
+  - cd alignment-handbook
+  - pip install .
+  - pip install flash-attn --no-build-isolation
+  - pip install wandb
+  - accelerate launch
+    --config_file ../examples/fine-tuning/alignment-handbook/fsdp_qlora_full_shard.yaml
+    --main_process_ip=$DSTACK_MASTER_NODE_IP
+    --main_process_port=8008
+    --machine_rank=$DSTACK_NODE_RANK
+    --num_processes=$DSTACK_GPUS_NUM
+    --num_machines=$DSTACK_NODES_NUM
+    scripts/run_sft.py 
+    ../examples/fine-tuning/alignment-handbook/config.yaml
+# Expose 6006 to access TensorBoard
 ports:
-  - 6006 
+  - 6006
+
+# The number of interconnected instances required
+nodes: 2
+
 resources:
-  gpu: 24GB:2
+  # Required resources
+  gpu: 24GB
+  # Shared memory size for inter-process communication
   shm_size: 24GB
 ```
 
-Once you set `nodes` to the number bigger than `1`, `dstack` magically sets up a multiple nodes' environment. Furthermore, within the yaml file, you can access special variables that `dstack` automatically provides for you. For instance, `$DSTACK_MASTER_NODE_IP`, `$DSTACK_NODE_RANK`, `$DSTACK_GPUS_NUM`, and `$DSTACK_NODES_NUM` variables are the essential pieces of information to run jobs across multiple nodes with `accelerage`. Hence, `dstack` effortlessly integrates with Hugging Face's open source ecosystem. 
+Here's how the multi-node task is different from the single-node one:
 
-Also, it is worth noting that those special variables are better to be determined at runtime instead of hard-coded. It is common to run a job within a cluster of cheaper machines for the unit-testing phase then run the same job with much bigger cluster of expensive machines for the actual fine-tuning phase. `dstack` allows us to focus on setting up the `nodes` and `resources` only. 
+1. The `nodes` property is specified with a number of required nodes (should match the fleet's nodes number).
+2. Under `resoruces`, `shm_size` is specified with the shared memory size used for the communication of parallel
+   processes within a node (in case multiple GPUs per node are used).
+3. Instead of Alignment Handbook's [`recipes/accelerate_configs/multi_gpu.yaml`](https://github.com/huggingface/alignment-handbook/blob/main/recipes/accelerate_configs/multi_gpu.yaml), we use [`examples/fine-tuning/alignment-handbook/fsdp_qlora_full_shard.yaml`](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/alignment-handbook/fsdp_qlora_full_shard.yaml) as an accelerate config.
+4. We use `DSTACK_MASTER_NODE_IP`, `DSTACK_NODE_RANK`, `DSTACK_GPUS_NUM`, and `DSTACK_NODES_NUM` environment variables to
+   configure `accelerate`. The environment variables are automatically passed
+   to the container for each node based on the task configuration.
 
-### Running multi-node task
+## Fleets
 
-This tutorial comes with pre-defined yaml files for [accelerate configurations](./fsdp_qlora_full_shard.yaml) and `dstack`'s [task description](./train-distrib.dstack.yml). You can experience multi-node fine-tuning with `dstack` by running the following commands:
+> By default, `dstack run` reuses `idle` instances from one of the existing [fleets](https://dstack.ai/docs/fleets). 
+If no `idle` instances meet the requirements, it creates a new fleet using one of the configured backends.
+
+The example folder includes two cloud fleet configurations: [`examples/fine-tuning/alignment-handbook/fleet.dstack.yml`](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/alignment-handbook/fleet.dstack.yml) (a single node with a `24GB` GPU),
+and a [`examples/fine-tuning/alignment-handbook/fleet-distrib.dstack.yml`](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/alignment-handbook/fleet-distrib.dstack.yml) (a cluster of two nodes each with a `24GB` GPU).
+
+You can update the fleet configurations to change the vRAM size, GPU model, number of GPUs per node, or number of nodes. 
+
+A fleet can be provisioned with `dstack apply`:
 
 ```shell
-HUGGING_FACE_HUB_TOKEN=<...> \
-WANDB_API_KEY=<...> \
-dstack run . -f examples/fine-tuning/alignment-handbook/train-distrib.dstack.yaml
+dstack apply -f examples/fine-tuning/alignment-handbook/fleet.dstack.yml
 ```
 
-> [!NOTE]
-> Weights and Biases doesn't log everything from multiple nodes since Hugging Face's library doesn't support it yet.
+Once provisioned, the fleet can run dev environments and fine-tuning tasks.
+To delete the fleet, use `dstack fleet delete`.
+
+> To ensure `dstack apply` always reuses an existing fleet,
+pass `--reuse` to `dstack apply` (or set `creation_policy` to `reuse` in the task configuration).
+The default policy is `reuse_or_create`.
+
+## Dev environment
+
+If you'd like to play with the example using a dev environment, run
+[.dstack.yml](.dstack.yml) via `dstack apply`:
+
+```shell
+dstack apply -f examples/fine-tuning/alignment-handbook/.dstack.yaml 
+```
 
-## Results
+## What's next?
 
-- [merged_ds_coding](https://huggingface.co/datasets/chansung/merged_ds_coding): SFT dataset for solely coding task. It roughly contains 60k training dataset.
-- [chansung/coding_llamaduo_60k_v0.2](https://huggingface.co/chansung/coding_llamaduo_60k_v0.2): QLoRA adapter for Gemma 7B with the exactly the same configuration as in [`config.yaml`](./config.yaml). This adapter is fine-tuned on the `merged_ds_coding` dataset with 2xA6000 GPUs via `dstack` Sky.
+1. Browse [Alignment Handbook](https://github.com/huggingface/alignment-handbook).
+2. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), 
+   [services](https://dstack.ai/docs/services), and [fleets](https://dstack.ai/docs/fleets).
+3. See other [examples](https://github.com/dstackai/dstack/blob/master/examples/).
diff --git a/examples/fine-tuning/alignment-handbook/config.yaml b/examples/fine-tuning/alignment-handbook/config.yaml
@@ -3,7 +3,8 @@ model_name_or_path: google/gemma-7b
 model_revision: main
 tokenizer_name_or_path: philschmid/gemma-tokenizer-chatml # Custom tokenizer with <|im_start|> and <|im_end|> tokens
 torch_dtype: bfloat16
-use_flash_attention_2: true
+attn_implementation: flash_attention_2
+bnb_4bit_quant_storage: bfloat16
 
 # LoRA arguments
 load_in_4bit: true
@@ -12,20 +13,20 @@ lora_r: 16
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_modules:
-- q_proj
-- k_proj
-- v_proj
-- o_proj
-- gate_proj
-- up_proj
-- down_proj
+  - q_proj
+  - k_proj
+  - v_proj
+  - o_proj
+  - gate_proj
+  - up_proj
+  - down_proj
 
 # Data training arguments
 dataset_mixer:
   chansung/merged_ds_coding: 1.0
 dataset_splits:
-- train_sft
-- test_sft
+  - train_sft
+  - test_sft
 preprocessing_num_workers: 12
 
 # SFT trainer config
@@ -55,7 +56,7 @@ per_device_eval_batch_size: 2
 per_device_train_batch_size: 2
 push_to_hub: true
 report_to:
-- tensorboard
+#- tensorboard (temporarily disabled due to the issue with Alignment Handbook throwing an exception)
 - wandb
 save_strategy: "steps"
 save_steps: 100