From ad9ccaea9abeabd27d6cb338ecbd380c91cfe418 Mon Sep 17 00:00:00 2001
From: Chansung Park <deep.diver.csp@gmail.com>
Date: Thu, 11 Jul 2024 02:02:00 +0900
Subject: [PATCH 1/3] update with multi-node mode

---
 .../fine-tuning/alignment-handbook/README.md  | 62 ++++++++++++++++++-
 1 file changed, 60 insertions(+), 2 deletions(-)

diff --git a/examples/fine-tuning/alignment-handbook/README.md b/examples/fine-tuning/alignment-handbook/README.md
index 1e3df4673..2ae293156 100644
--- a/examples/fine-tuning/alignment-handbook/README.md
+++ b/examples/fine-tuning/alignment-handbook/README.md
@@ -45,7 +45,7 @@ ACCELERATE_LOG_LEVEL=info accelerate launch \
 For more details and other alignment methods, please check out the
 alignment-handbook's [official repository](https://github.com/huggingface/alignment-handbook).
 
-## Running via `dstack`
+## Running via `dstack` {#running-via-dstack}
 
 This example demonstrate how to run an Alignment Handbook recipe via `dstack`.
 
@@ -105,7 +105,65 @@ WANDB_API_KEY=<...> \
 dstack run . -f examples/fine-tuning/alignment-handbook/train.dstack.yaml
 ```
 
+## Multi-node 
+
+With `dstack`, we can easily manage multiple nodes with multiple GPUs. To leverage multiple nodes for Alignment Handbook with `dstack`, we need to adjust two things: the configurations of Hugging Face's `accelerate` and the `dstack`'s task description.
+
+### Accelerate configurations
+
+Basically, the configurations of the `accelerate` don't have to be changed. It could remain the same as the `multi_gpu.yaml` used in the previous [Running via `dstack`](#running-via-dstack) section. However, it is worth knowing about the `fsdp_sharding_strategy` configuration.
+
+```yaml
+compute_environment: LOCAL_MACHINE
+distributed_type: FSDP  # Use Fully Sharded Data Parallelism
+fsdp_config: 
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_backward_prefetch: BACKWARD_PRE
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_use_orig_params: false 
+  fsdp_offload_params: true
+  fsdp_sharding_strategy: FULL_SHARD
+  # ... (other FSDP configurations)
+# ... (other configurations)
+```
+
+With the FSDP of `distributed_type` and `FULL_SHARD` of `fsdp_config`’s `fsdp_sharding_strategy`, a model will be sharded across multiple GPUs in a single machine. If there are multiple nodes, each node will have the same model sharded across multiple GPUs within itself. That means each sharded model instance in each node will learn different parts/batches of a given dataset. If you want to shard a model across multiple GPUs on multiple nodes, the value of `fsdp_sharding_strategy` should be set as HYBRID_SHARD.
+
+### dstack task description
+
+Fine-tuning LLMs on multiple nodes means each node should be connected and managed in the same network. `dstack` automatically comes with the features for these. The below `dstack`'s task description assumes that there are three nodes, and each node has two GPUs:
+
+```yaml
+type: task
+python: "3.11" 
+nodes: 3
+env:
+  - ACCEL_CONFIG_PATH
+  - FT_MODEL_CONFIG_PATH
+  - HUGGING_FACE_HUB_TOKEN
+  - WANDB_API_KEY 
+commands:
+  # ... (setup steps, cloning repo, installing requirements)
+  - ACCELERATE_LOG_LEVEL=info accelerate launch \
+      --config_file recipes/custom/accel_config.yaml \
+      --main_process_ip=$DSTACK_MASTER_NODE_IP \
+      --main_process_port=8008 \
+      --machine_rank=$DSTACK_NODE_RANK \
+      --num_processes=$DSTACK_GPUS_NUM \
+      --num_machines=$DSTACK_NODES_NUM \
+      scripts/run_sft.py recipes/custom/config.yaml
+ports:
+  - 6006 
+resources:
+  gpu: 1..2
+  shm_size: 24GB
+```
+
+Once you set `nodes` to the number bigger than `1`, `dstack` magically sets up a multiple nodes' environment. Furthermore, within the yaml file, you can access special variables that `dstack` automatically provides for you. For instance, `$DSTACK_MASTER_NODE_IP`, `$DSTACK_NODE_RANK`, `$DSTACK_GPUS_NUM`, and `$DSTACK_NODES_NUM` variables are the essential pieces of information to run jobs across multiple nodes with `accelerage`. Hence, `dstack` effortlessly integrates with Hugging Face's open source ecosystem. 
+
+Also, it is worth noting that those special variables are better to be determined at runtime instead of hard-coded. It is common to run a job within a cluster of cheaper machines for the unit-testing phase then run the same job with much bigger cluster of expensive machines for the actual fine-tuning phase. `dstack` allows us to focus on setting up the `nodes` and `resources` only. 
+
 ## Results
 
 - [merged_ds_coding](https://huggingface.co/datasets/chansung/merged_ds_coding): SFT dataset for solely coding task. It roughly contains 60k training dataset.
-- [chansung/coding_llamaduo_60k_v0.2](https://huggingface.co/chansung/coding_llamaduo_60k_v0.2): QLoRA adapter for Gemma 7B with the exactly the same configuration as in [`config.yaml`](./config.yaml). This adapter is fine-tuned on the `merged_ds_coding` dataset with 2xA6000 GPUs via `dstack` Sky.
\ No newline at end of file
+- [chansung/coding_llamaduo_60k_v0.2](https://huggingface.co/chansung/coding_llamaduo_60k_v0.2): QLoRA adapter for Gemma 7B with the exactly the same configuration as in [`config.yaml`](./config.yaml). This adapter is fine-tuned on the `merged_ds_coding` dataset with 2xA6000 GPUs via `dstack` Sky.

From 4b25a994253a4d5143cec9f0f16bb4b25e89076f Mon Sep 17 00:00:00 2001
From: Chansung Park <deep.diver.csp@gmail.com>
Date: Thu, 11 Jul 2024 16:20:30 +0900
Subject: [PATCH 2/3] update with the running command

---
 .../fine-tuning/alignment-handbook/README.md    | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/examples/fine-tuning/alignment-handbook/README.md b/examples/fine-tuning/alignment-handbook/README.md
index 2ae293156..b3033003c 100644
--- a/examples/fine-tuning/alignment-handbook/README.md
+++ b/examples/fine-tuning/alignment-handbook/README.md
@@ -145,7 +145,7 @@ env:
 commands:
   # ... (setup steps, cloning repo, installing requirements)
   - ACCELERATE_LOG_LEVEL=info accelerate launch \
-      --config_file recipes/custom/accel_config.yaml \
+      --config_file examples/fine-tuning/alignment-handbook/fsdp_qlora_full_shard.yaml \
       --main_process_ip=$DSTACK_MASTER_NODE_IP \
       --main_process_port=8008 \
       --machine_rank=$DSTACK_NODE_RANK \
@@ -155,7 +155,7 @@ commands:
 ports:
   - 6006 
 resources:
-  gpu: 1..2
+  gpu: 24GB:2
   shm_size: 24GB
 ```
 
@@ -163,6 +163,19 @@ Once you set `nodes` to the number bigger than `1`, `dstack` magically sets up a
 
 Also, it is worth noting that those special variables are better to be determined at runtime instead of hard-coded. It is common to run a job within a cluster of cheaper machines for the unit-testing phase then run the same job with much bigger cluster of expensive machines for the actual fine-tuning phase. `dstack` allows us to focus on setting up the `nodes` and `resources` only. 
 
+### Running multi-node task
+
+This tutorial comes with pre-defined yaml files for [accelerate configurations](./fsdp_qlora_full_shard.yaml) and `dstack`'s [task description](./train-distrib.dstack.yaml). You can experience multi-node fine-tuning with `dstack` by running the following commands:
+
+```shell
+HUGGING_FACE_HUB_TOKEN=<...> \
+WANDB_API_KEY=<...> \
+dstack run . -f examples/fine-tuning/alignment-handbook/train-distrib.dstack.yaml
+```
+
+> [!NOTE]
+> Weights and Biases doesn't log everything from multiple nodes since Hugging Face's library doesn't support it yet.
+
 ## Results
 
 - [merged_ds_coding](https://huggingface.co/datasets/chansung/merged_ds_coding): SFT dataset for solely coding task. It roughly contains 60k training dataset.

From c5004d264eb9518a87b79f602ff58e6218e8d76e Mon Sep 17 00:00:00 2001
From: Chansung Park <deep.diver.csp@gmail.com>
Date: Thu, 11 Jul 2024 07:24:26 +0000
Subject: [PATCH 3/3] add required materials

---
 .../fine-tuning/alignment-handbook/README.md  |  2 +-
 .../fsdp_qlora_full_shard.yaml                | 26 ++++++++++++
 .../train-distrib.dstack.yml                  | 41 +++++++++++++++++++
 3 files changed, 68 insertions(+), 1 deletion(-)
 create mode 100644 examples/fine-tuning/alignment-handbook/fsdp_qlora_full_shard.yaml
 create mode 100644 examples/fine-tuning/alignment-handbook/train-distrib.dstack.yml

diff --git a/examples/fine-tuning/alignment-handbook/README.md b/examples/fine-tuning/alignment-handbook/README.md
index b3033003c..3bfcdf173 100644
--- a/examples/fine-tuning/alignment-handbook/README.md
+++ b/examples/fine-tuning/alignment-handbook/README.md
@@ -165,7 +165,7 @@ Also, it is worth noting that those special variables are better to be determine
 
 ### Running multi-node task
 
-This tutorial comes with pre-defined yaml files for [accelerate configurations](./fsdp_qlora_full_shard.yaml) and `dstack`'s [task description](./train-distrib.dstack.yaml). You can experience multi-node fine-tuning with `dstack` by running the following commands:
+This tutorial comes with pre-defined yaml files for [accelerate configurations](./fsdp_qlora_full_shard.yaml) and `dstack`'s [task description](./train-distrib.dstack.yml). You can experience multi-node fine-tuning with `dstack` by running the following commands:
 
 ```shell
 HUGGING_FACE_HUB_TOKEN=<...> \
diff --git a/examples/fine-tuning/alignment-handbook/fsdp_qlora_full_shard.yaml b/examples/fine-tuning/alignment-handbook/fsdp_qlora_full_shard.yaml
new file mode 100644
index 000000000..ad51566b2
--- /dev/null
+++ b/examples/fine-tuning/alignment-handbook/fsdp_qlora_full_shard.yaml
@@ -0,0 +1,26 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: FSDP
+downcast_bf16: 'no'
+enable_cpu_affinity: false
+fsdp_config:
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_backward_prefetch: BACKWARD_PRE
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_forward_prefetch: false
+  fsdp_offload_params: false
+  fsdp_sharding_strategy: FULL_SHARD
+  fsdp_state_dict_type: SHARDED_STATE_DICT
+  fsdp_sync_module_states: true
+  fsdp_use_orig_params: false
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 2
+num_processes: 4
+rdzv_backend: static
+same_network: false
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
\ No newline at end of file
diff --git a/examples/fine-tuning/alignment-handbook/train-distrib.dstack.yml b/examples/fine-tuning/alignment-handbook/train-distrib.dstack.yml
new file mode 100644
index 000000000..d6f4cf2a0
--- /dev/null
+++ b/examples/fine-tuning/alignment-handbook/train-distrib.dstack.yml
@@ -0,0 +1,41 @@
+type: task
+
+python: "3.11"
+
+nodes: 3
+
+env:
+  - FT_MODEL_CONFIG_PATH
+  - ACCEL_CONFIG_PATH
+
+  - HUGGING_FACE_HUB_TOKEN
+  - WANDB_API_KEY
+
+commands:
+  - conda install cuda
+  - git clone https://github.com/huggingface/alignment-handbook.git
+  - mkdir -p alignment-handbook/recipes/custom/
+  - cp "$FT_MODEL_CONFIG_PATH" alignment-handbook/recipes/custom/config.yaml
+  - cp "$ACCEL_CONFIG_PATH" alignment-handbook/recipes/custom/accel_config.yaml
+
+  - cd alignment-handbook
+  - python -m pip -q install .
+  - python -m pip install -q flash-attn --no-build-isolation
+
+  - pip install -q wandb
+  - wandb login $WANDB_API_KEY
+
+  - ACCELERATE_LOG_LEVEL=info accelerate launch 
+      --config_file examples/fine-tuning/alignment-handbook/fsdp_qlora_full_shard.yaml \
+      --main_process_ip=$DSTACK_MASTER_NODE_IP
+      --main_process_port=8008
+      --machine_rank=$DSTACK_NODE_RANK
+      --num_processes=$DSTACK_GPUS_NUM
+      --num_machines=$DSTACK_NODES_NUM
+      scripts/run_sft.py recipes/custom/config.yaml
+ports:
+  - 6006
+  
+resources:
+  gpu: 24GB:2
+  shm_size: 24GB
\ No newline at end of file