Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 73 additions & 2 deletions examples/fine-tuning/alignment-handbook/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ ACCELERATE_LOG_LEVEL=info accelerate launch \
For more details and other alignment methods, please check out the
alignment-handbook's [official repository](https://github.com/huggingface/alignment-handbook).

## Running via `dstack`
## Running via `dstack` {#running-via-dstack}

This example demonstrate how to run an Alignment Handbook recipe via `dstack`.

Expand Down Expand Up @@ -105,7 +105,78 @@ WANDB_API_KEY=<...> \
dstack run . -f examples/fine-tuning/alignment-handbook/train.dstack.yaml
```

## Multi-node

With `dstack`, we can easily manage multiple nodes with multiple GPUs. To leverage multiple nodes for Alignment Handbook with `dstack`, we need to adjust two things: the configurations of Hugging Face's `accelerate` and the `dstack`'s task description.

### Accelerate configurations

Basically, the configurations of the `accelerate` don't have to be changed. It could remain the same as the `multi_gpu.yaml` used in the previous [Running via `dstack`](#running-via-dstack) section. However, it is worth knowing about the `fsdp_sharding_strategy` configuration.

```yaml
compute_environment: LOCAL_MACHINE
distributed_type: FSDP # Use Fully Sharded Data Parallelism
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_use_orig_params: false
fsdp_offload_params: true
fsdp_sharding_strategy: FULL_SHARD
# ... (other FSDP configurations)
# ... (other configurations)
```

With the FSDP of `distributed_type` and `FULL_SHARD` of `fsdp_config`’s `fsdp_sharding_strategy`, a model will be sharded across multiple GPUs in a single machine. If there are multiple nodes, each node will have the same model sharded across multiple GPUs within itself. That means each sharded model instance in each node will learn different parts/batches of a given dataset. If you want to shard a model across multiple GPUs on multiple nodes, the value of `fsdp_sharding_strategy` should be set as HYBRID_SHARD.

### dstack task description

Fine-tuning LLMs on multiple nodes means each node should be connected and managed in the same network. `dstack` automatically comes with the features for these. The below `dstack`'s task description assumes that there are three nodes, and each node has two GPUs:

```yaml
type: task
python: "3.11"
nodes: 3
env:
- ACCEL_CONFIG_PATH
- FT_MODEL_CONFIG_PATH
- HUGGING_FACE_HUB_TOKEN
- WANDB_API_KEY
commands:
# ... (setup steps, cloning repo, installing requirements)
- ACCELERATE_LOG_LEVEL=info accelerate launch \
--config_file examples/fine-tuning/alignment-handbook/fsdp_qlora_full_shard.yaml \
--main_process_ip=$DSTACK_MASTER_NODE_IP \
--main_process_port=8008 \
--machine_rank=$DSTACK_NODE_RANK \
--num_processes=$DSTACK_GPUS_NUM \
--num_machines=$DSTACK_NODES_NUM \
scripts/run_sft.py recipes/custom/config.yaml
ports:
- 6006
resources:
gpu: 24GB:2
shm_size: 24GB
```

Once you set `nodes` to the number bigger than `1`, `dstack` magically sets up a multiple nodes' environment. Furthermore, within the yaml file, you can access special variables that `dstack` automatically provides for you. For instance, `$DSTACK_MASTER_NODE_IP`, `$DSTACK_NODE_RANK`, `$DSTACK_GPUS_NUM`, and `$DSTACK_NODES_NUM` variables are the essential pieces of information to run jobs across multiple nodes with `accelerage`. Hence, `dstack` effortlessly integrates with Hugging Face's open source ecosystem.

Also, it is worth noting that those special variables are better to be determined at runtime instead of hard-coded. It is common to run a job within a cluster of cheaper machines for the unit-testing phase then run the same job with much bigger cluster of expensive machines for the actual fine-tuning phase. `dstack` allows us to focus on setting up the `nodes` and `resources` only.

### Running multi-node task

This tutorial comes with pre-defined yaml files for [accelerate configurations](./fsdp_qlora_full_shard.yaml) and `dstack`'s [task description](./train-distrib.dstack.yml). You can experience multi-node fine-tuning with `dstack` by running the following commands:

```shell
HUGGING_FACE_HUB_TOKEN=<...> \
WANDB_API_KEY=<...> \
dstack run . -f examples/fine-tuning/alignment-handbook/train-distrib.dstack.yaml
```

> [!NOTE]
> Weights and Biases doesn't log everything from multiple nodes since Hugging Face's library doesn't support it yet.

## Results

- [merged_ds_coding](https://huggingface.co/datasets/chansung/merged_ds_coding): SFT dataset for solely coding task. It roughly contains 60k training dataset.
- [chansung/coding_llamaduo_60k_v0.2](https://huggingface.co/chansung/coding_llamaduo_60k_v0.2): QLoRA adapter for Gemma 7B with the exactly the same configuration as in [`config.yaml`](./config.yaml). This adapter is fine-tuned on the `merged_ds_coding` dataset with 2xA6000 GPUs via `dstack` Sky.
- [chansung/coding_llamaduo_60k_v0.2](https://huggingface.co/chansung/coding_llamaduo_60k_v0.2): QLoRA adapter for Gemma 7B with the exactly the same configuration as in [`config.yaml`](./config.yaml). This adapter is fine-tuned on the `merged_ds_coding` dataset with 2xA6000 GPUs via `dstack` Sky.
26 changes: 26 additions & 0 deletions examples/fine-tuning/alignment-handbook/fsdp_qlora_full_shard.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 2
num_processes: 4
rdzv_backend: static
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
41 changes: 41 additions & 0 deletions examples/fine-tuning/alignment-handbook/train-distrib.dstack.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
type: task

python: "3.11"

nodes: 3

env:
- FT_MODEL_CONFIG_PATH
- ACCEL_CONFIG_PATH

- HUGGING_FACE_HUB_TOKEN
- WANDB_API_KEY

commands:
- conda install cuda
- git clone https://github.com/huggingface/alignment-handbook.git
- mkdir -p alignment-handbook/recipes/custom/
- cp "$FT_MODEL_CONFIG_PATH" alignment-handbook/recipes/custom/config.yaml
- cp "$ACCEL_CONFIG_PATH" alignment-handbook/recipes/custom/accel_config.yaml

- cd alignment-handbook
- python -m pip -q install .
- python -m pip install -q flash-attn --no-build-isolation

- pip install -q wandb
- wandb login $WANDB_API_KEY

- ACCELERATE_LOG_LEVEL=info accelerate launch
--config_file examples/fine-tuning/alignment-handbook/fsdp_qlora_full_shard.yaml \
--main_process_ip=$DSTACK_MASTER_NODE_IP
--main_process_port=8008
--machine_rank=$DSTACK_NODE_RANK
--num_processes=$DSTACK_GPUS_NUM
--num_machines=$DSTACK_NODES_NUM
scripts/run_sft.py recipes/custom/config.yaml
ports:
- 6006

resources:
gpu: 24GB:2
shm_size: 24GB