Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions examples/fine-tuning/alignment-handbook/.dstack.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
type: dev-environment
name: ah-vscode

# If `image` is not specified, dstack uses its default image
python: "3.10"

# Required environment variables
env:
- HUGGING_FACE_HUB_TOKEN
- ACCELERATE_LOG_LEVEL=info
- WANDB_API_KEY

# Uncomment if you want the environment to be pre-installed
#init:
# - conda install cuda
# - git clone https://github.com/huggingface/alignment-handbook.git
# - cd alignment-handbook
# - pip install .
# - pip install flash-attn --no-build-isolation
# - pip install wandb

ide: vscode

spot_policy: auto

resources:
# Minimum 24GB, one or more GPU
gpu: 24GB..:1..
242 changes: 120 additions & 122 deletions examples/fine-tuning/alignment-handbook/README.md
Original file line number Diff line number Diff line change
@@ -1,182 +1,180 @@
# Alignment Handbook

[Alignment Handbook](https://github.com/huggingface/alignment-handbook) provides robust recipes to continue pretraining
and to align language models with human and AI preferences. It basically comes with two types of recipes and four types
of scripts that were used to create Hugging Face [Zephyr models](https://huggingface.co/HuggingFaceH4):
[Alignment Handbook](https://github.com/huggingface/alignment-handbook) by HuggingFace offers recipes, configs, and
scripts for fine-tuning LLMs. It includes all the code needed to run CPT, SFT, DPO, and ORPO leveraging HuggingFace's
libraries like `transformers`, `peft`, `accelerate`, and `trl`. You just need to modify the recipes and run the
appropriate script.

- Accelerate recipes: configurations for [DeepSpeed Zero3](https://huggingface.co/docs/accelerate/v0.11.0/en/deepspeed),
- [FSDP(Fully Sharded Data Parallel)](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html), and multi GPU.
This example shows how use Alignment Handbook to fine-tune Gemma 7B on your SFT dataset
with Alignment Handbook and `dstack`.

- Training recipes: configurations of how GPT2, StarChat2-15B, Zephyr-141B-A35B, Zephyr-7B-Beta, and Zephyr-7B-Gemma
models were fine-tuned.
!!! info "Prerequisites"
Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo and run `dstack init`

- Scripts: [`run_cpt.py`](https://github.com/huggingface/alignment-handbook/blob/main/scripts/run_cpt.py) for continual
pre-training, [`run_sft.py`](https://github.com/huggingface/alignment-handbook/blob/main/scripts/run_sft.py) for
supervised fine-tuning, [`run_dpo.py`](https://github.com/huggingface/alignment-handbook/blob/main/scripts/run_dpo.py)
for aligning with preferences via [DPO](https://arxiv.org/abs/2305.18290),
and [`run_orpo.py`](https://github.com/huggingface/alignment-handbook/blob/main/scripts/run_orpo.py) aligning
with [ORPO](https://arxiv.org/abs/2403.07691).
```shell
git clone https://github.com/dstackai/dstack
cd dstack
dstack init
```

## Basics
## Training configuration recipe

Alignment Handbook provides all the code you need to run CPT, SFT, DPO, and ORPO within Hugging Face OSS ecosystem
such as `transformers`, `peft`, `accelerate`, `trl`. All you need to do is to modify recipes for accelerate and
training, and run appropriate script.
Alignment Handbook's training script reads the model, LoRA, and dataset arguments, as well
as trainer configuration from a YAML file.
This file can be found at [`examples/fine-tuning/alignment-handbook/config.yaml`](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/alignment-handbook/config.yaml).
You can modify it as needed.

For instance, if you want to QLoRA fine-tune Gemma 7B model on your own SFT dataset hosted on Hugging Face Hub, you can
prepare a `yaml` config file as [config.yaml](config.yaml). This config is based on the Zephyr-7B-Gemma recipe except
the following modification:
## Single-node training

- `dataset_mixer` field to point which SFT dataset to be used.
- `hub_model_id` and `output_dir` fields to point where the model and its checkpoints should be saved.
- `LoRA arguments` related fields to indicate that this fine-tuning is based on QLoRA methodology.

With the `config.yaml` file configured, you can run the following command to QLoRA fine-tune Gemma 7B model on 2
GPUs:

```shell
ACCELERATE_LOG_LEVEL=info accelerate launch \
--config_file config.yaml \
--num_processes=2 \
scripts/run_sft.py \
recipes/{model_name}/{task}/config_qlora.yaml
```

For more details and other alignment methods, please check out the
alignment-handbook's [official repository](https://github.com/huggingface/alignment-handbook).

## Running via `dstack` {#running-via-dstack}

This example demonstrate how to run an Alignment Handbook recipe via `dstack`.

First, define the [`train.dstack.yaml`](train.dstack.yaml) task configuration file as following:
The easiest way to run a training script with `dstack` is by creating a task configuration file.
This file can be found at [`examples/fine-tuning/alignment-handbook/train.dstack.yml`](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/alignment-handbook/train.dstack.yml). Below is its content:

```yaml
type: task
name: ah-train

python: "3.11"
# If `image` is not specified, dstack uses its default image
python: "3.10"

# Required environment variables
env:
- HUGGING_FACE_HUB_TOKEN
- ACCELERATE_LOG_LEVEL=info
- WANDB_API_KEY

# Commands of the task
commands:
- conda install cuda
- git clone https://github.com/huggingface/alignment-handbook.git
- mkdir -p alignment-handbook/recipes/custom/
- cp config.yaml alignment-handbook/recipes/custom/config.yaml

- cd alignment-handbook
- python -m pip install .
- python -m pip install flash-attn --no-build-isolation

- pip install .
- pip install flash-attn --no-build-isolation
- pip install wandb
- wandb login $WANDB_API_KEY

- ACCELERATE_LOG_LEVEL=info accelerate launch
- accelerate launch
--config_file recipes/accelerate_configs/multi_gpu.yaml
--num_processes=$DSTACK_GPUS_NUM
scripts/run_sft.py
recipes/custom/config.yaml
../examples/fine-tuning/alignment-handbook/config.yaml
# Expose 6006 to access TensorBoard
ports:
- 6006

resources:
gpu:
memory: 40GB
name: A6000
count: 2
# Required resources
gpu: 24GB
```

> [!NOTE]
> Feel free to adjust `resources` to specify the required resources.

The task clones the `huggingface/alignment-handbook` repo, and copies our local `config.yaml` to the recipies subfolder.
Then, the task installs dependencies, and launches the recipe.
The task clones Alignment Handbook's repo, installs the dependencies,
and runs the script.

Our `config.yaml` sets `report_to` to `wandb` and `tensorboard`. That's why we the task also installs `wandb`.
The `DSTACK_GPUS_NUM` environment variable is automatically passed to the container
according to the `resoruce` property.

To run the task, use the following command:
To run the task, use `dstack apply`:

```shell
HUGGING_FACE_HUB_TOKEN=<...> \
WANDB_API_KEY=<...> \
dstack run . -f examples/fine-tuning/alignment-handbook/train.dstack.yaml
```

## Multi-node
HUGGING_FACE_HUB_TOKEN=...
ACCELERATE_LOG_LEVEL=...
WANDB_API_KEY=...

With `dstack`, we can easily manage multiple nodes with multiple GPUs. To leverage multiple nodes for Alignment Handbook with `dstack`, we need to adjust two things: the configurations of Hugging Face's `accelerate` and the `dstack`'s task description.

### Accelerate configurations

Basically, the configurations of the `accelerate` don't have to be changed. It could remain the same as the `multi_gpu.yaml` used in the previous [Running via `dstack`](#running-via-dstack) section. However, it is worth knowing about the `fsdp_sharding_strategy` configuration.

```yaml
compute_environment: LOCAL_MACHINE
distributed_type: FSDP # Use Fully Sharded Data Parallelism
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_use_orig_params: false
fsdp_offload_params: true
fsdp_sharding_strategy: FULL_SHARD
# ... (other FSDP configurations)
# ... (other configurations)
dstack apply -f examples/fine-tuning/alignment-handbook/train.dstack.yml
```

With the FSDP of `distributed_type` and `FULL_SHARD` of `fsdp_config`’s `fsdp_sharding_strategy`, a model will be sharded across multiple GPUs in a single machine. If there are multiple nodes, each node will have the same model sharded across multiple GPUs within itself. That means each sharded model instance in each node will learn different parts/batches of a given dataset. If you want to shard a model across multiple GPUs on multiple nodes, the value of `fsdp_sharding_strategy` should be set as HYBRID_SHARD.
If you list `tensorbord` via `report_to` in [`examples/fine-tuning/alignment-handbook/config.yaml`](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/alignment-handbook/config.yaml),
you'll be able to access experiment metrics via `http://localhost:6006` (while the task is running).

### dstack task description
## Multi-node training

Fine-tuning LLMs on multiple nodes means each node should be connected and managed in the same network. `dstack` automatically comes with the features for these. The below `dstack`'s task description assumes that there are three nodes, and each node has two GPUs:
The multi-node training task configuration file can be found at [`examples/fine-tuning/alignment-handbook/train-distrib.dstack.yml`](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/alignment-handbook/train-distrib.dstack.yml).
Below is its content:

```yaml
type: task
python: "3.11"
nodes: 3
name: ah-train-distrib

# If `image` is not specified, dstack uses its default image
python: "3.10"

# Required environment variables
env:
- ACCEL_CONFIG_PATH
- FT_MODEL_CONFIG_PATH
- HUGGING_FACE_HUB_TOKEN
- WANDB_API_KEY
- ACCELERATE_LOG_LEVEL=info
- WANDB_API_KEY
# Commands of the task (dstack runs it on each node)
commands:
# ... (setup steps, cloning repo, installing requirements)
- ACCELERATE_LOG_LEVEL=info accelerate launch \
--config_file examples/fine-tuning/alignment-handbook/fsdp_qlora_full_shard.yaml \
--main_process_ip=$DSTACK_MASTER_NODE_IP \
--main_process_port=8008 \
--machine_rank=$DSTACK_NODE_RANK \
--num_processes=$DSTACK_GPUS_NUM \
--num_machines=$DSTACK_NODES_NUM \
scripts/run_sft.py recipes/custom/config.yaml
- conda install cuda
- git clone https://github.com/huggingface/alignment-handbook.git
- cd alignment-handbook
- pip install .
- pip install flash-attn --no-build-isolation
- pip install wandb
- accelerate launch
--config_file ../examples/fine-tuning/alignment-handbook/fsdp_qlora_full_shard.yaml
--main_process_ip=$DSTACK_MASTER_NODE_IP
--main_process_port=8008
--machine_rank=$DSTACK_NODE_RANK
--num_processes=$DSTACK_GPUS_NUM
--num_machines=$DSTACK_NODES_NUM
scripts/run_sft.py
../examples/fine-tuning/alignment-handbook/config.yaml
# Expose 6006 to access TensorBoard
ports:
- 6006
- 6006

# The number of interconnected instances required
nodes: 2

resources:
gpu: 24GB:2
# Required resources
gpu: 24GB
# Shared memory size for inter-process communication
shm_size: 24GB
```

Once you set `nodes` to the number bigger than `1`, `dstack` magically sets up a multiple nodes' environment. Furthermore, within the yaml file, you can access special variables that `dstack` automatically provides for you. For instance, `$DSTACK_MASTER_NODE_IP`, `$DSTACK_NODE_RANK`, `$DSTACK_GPUS_NUM`, and `$DSTACK_NODES_NUM` variables are the essential pieces of information to run jobs across multiple nodes with `accelerage`. Hence, `dstack` effortlessly integrates with Hugging Face's open source ecosystem.
Here's how the multi-node task is different from the single-node one:

Also, it is worth noting that those special variables are better to be determined at runtime instead of hard-coded. It is common to run a job within a cluster of cheaper machines for the unit-testing phase then run the same job with much bigger cluster of expensive machines for the actual fine-tuning phase. `dstack` allows us to focus on setting up the `nodes` and `resources` only.
1. The `nodes` property is specified with a number of required nodes (should match the fleet's nodes number).
2. Under `resoruces`, `shm_size` is specified with the shared memory size used for the communication of parallel
processes within a node (in case multiple GPUs per node are used).
3. Instead of Alignment Handbook's [`recipes/accelerate_configs/multi_gpu.yaml`](https://github.com/huggingface/alignment-handbook/blob/main/recipes/accelerate_configs/multi_gpu.yaml), we use [`examples/fine-tuning/alignment-handbook/fsdp_qlora_full_shard.yaml`](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/alignment-handbook/fsdp_qlora_full_shard.yaml) as an accelerate config.
4. We use `DSTACK_MASTER_NODE_IP`, `DSTACK_NODE_RANK`, `DSTACK_GPUS_NUM`, and `DSTACK_NODES_NUM` environment variables to
configure `accelerate`. The environment variables are automatically passed
to the container for each node based on the task configuration.

### Running multi-node task
## Fleets

This tutorial comes with pre-defined yaml files for [accelerate configurations](./fsdp_qlora_full_shard.yaml) and `dstack`'s [task description](./train-distrib.dstack.yml). You can experience multi-node fine-tuning with `dstack` by running the following commands:
> By default, `dstack run` reuses `idle` instances from one of the existing [fleets](https://dstack.ai/docs/fleets).
If no `idle` instances meet the requirements, it creates a new fleet using one of the configured backends.

The example folder includes two cloud fleet configurations: [`examples/fine-tuning/alignment-handbook/fleet.dstack.yml`](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/alignment-handbook/fleet.dstack.yml) (a single node with a `24GB` GPU),
and a [`examples/fine-tuning/alignment-handbook/fleet-distrib.dstack.yml`](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/alignment-handbook/fleet-distrib.dstack.yml) (a cluster of two nodes each with a `24GB` GPU).

You can update the fleet configurations to change the vRAM size, GPU model, number of GPUs per node, or number of nodes.

A fleet can be provisioned with `dstack apply`:

```shell
HUGGING_FACE_HUB_TOKEN=<...> \
WANDB_API_KEY=<...> \
dstack run . -f examples/fine-tuning/alignment-handbook/train-distrib.dstack.yaml
dstack apply -f examples/fine-tuning/alignment-handbook/fleet.dstack.yml
```

> [!NOTE]
> Weights and Biases doesn't log everything from multiple nodes since Hugging Face's library doesn't support it yet.
Once provisioned, the fleet can run dev environments and fine-tuning tasks.
To delete the fleet, use `dstack fleet delete`.

> To ensure `dstack apply` always reuses an existing fleet,
pass `--reuse` to `dstack apply` (or set `creation_policy` to `reuse` in the task configuration).
The default policy is `reuse_or_create`.

## Dev environment

If you'd like to play with the example using a dev environment, run
[.dstack.yml](.dstack.yml) via `dstack apply`:

```shell
dstack apply -f examples/fine-tuning/alignment-handbook/.dstack.yaml
```

## Results
## What's next?

- [merged_ds_coding](https://huggingface.co/datasets/chansung/merged_ds_coding): SFT dataset for solely coding task. It roughly contains 60k training dataset.
- [chansung/coding_llamaduo_60k_v0.2](https://huggingface.co/chansung/coding_llamaduo_60k_v0.2): QLoRA adapter for Gemma 7B with the exactly the same configuration as in [`config.yaml`](./config.yaml). This adapter is fine-tuned on the `merged_ds_coding` dataset with 2xA6000 GPUs via `dstack` Sky.
1. Browse [Alignment Handbook](https://github.com/huggingface/alignment-handbook).
2. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks),
[services](https://dstack.ai/docs/services), and [fleets](https://dstack.ai/docs/fleets).
3. See other [examples](https://github.com/dstackai/dstack/blob/master/examples/).
23 changes: 12 additions & 11 deletions examples/fine-tuning/alignment-handbook/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@ model_name_or_path: google/gemma-7b
model_revision: main
tokenizer_name_or_path: philschmid/gemma-tokenizer-chatml # Custom tokenizer with <|im_start|> and <|im_end|> tokens
torch_dtype: bfloat16
use_flash_attention_2: true
attn_implementation: flash_attention_2
bnb_4bit_quant_storage: bfloat16

# LoRA arguments
load_in_4bit: true
Expand All @@ -12,20 +13,20 @@ lora_r: 16
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj

# Data training arguments
dataset_mixer:
chansung/merged_ds_coding: 1.0
dataset_splits:
- train_sft
- test_sft
- train_sft
- test_sft
preprocessing_num_workers: 12

# SFT trainer config
Expand Down Expand Up @@ -55,7 +56,7 @@ per_device_eval_batch_size: 2
per_device_train_batch_size: 2
push_to_hub: true
report_to:
- tensorboard
#- tensorboard (temporarily disabled due to the issue with Alignment Handbook throwing an exception)
- wandb
save_strategy: "steps"
save_steps: 100
Expand Down
Loading