# Train Falcon model using SageMaker Distributed Data Parallel Library (SMDDP) and PyTorch Fully Sharded Data Parallelism (FSDP)

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/training|distributed_training|pytorch|data_parallel|fully_sharded_data_parallel|falcon|smddp_fsdp_example.ipynb)

---

In this tutorial, we will show how to train or fine-tune [Falcon-7B-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) on the [GLUE/SST2](https://huggingface.co/datasets/glue/viewer/sst2/train) dataset.  We will use 2 p4d.24xlarge instances, which come with 8 NVIDIA A100 40GB GPUs along with the PyTorch Fully Sharded Data Parallelism (FSDP) technique to efficiently train this large model with limited GPU memory.  

To accelerate training speed, we will also use the **SageMaker Distributed Data Parallel Library (SMDDP)** which speeds up GPU communication across P4d instances during sharded data parallel training.  

## Files

* `scripts/train.py` - The entry point for the training script where we initialize the SMDDP library.
* `scripts/utils.py` - Helper script for defining dataloaders
* `scripts/requirements.txt` - List of dependencies required for this example to train on SageMaker

*Note: The SMDDP library for accelerated sharded data parallel training is compatible with deep learning containers from PyTorch 2.0 onwards.  Ensure you are using PyTorch >=2.0 for this example.*

### How optimized GPU communication is enabled with SMDDP in FSDP
Enabling the SMDDP library in an existing FSDP training script is seamless.  As shown in `train.py`, the only code modifications required are:
* Importing the library: `import smdistributed.dataparallel.torch.torch_smddp`
* Creating the process group with `"smddp"` backend: `torch.distributed.init_process_group("smddp")`

## 1. Getting started

First, we'll install some dependencies in our current environment

If you are going to use Sagemaker in a local environment, you need access to an IAM Role with the required permissions for Sagemaker. You can find more about it [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html).



In [52]:
import sagemaker, boto3, datetime

from sagemaker.pytorch import PyTorch

sagemaker_session = sagemaker.Session()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

sagemaker_session = sagemaker.Session()
print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sagemaker_session.boto_region_name}")

sagemaker role arn: arn:aws:iam::015476483300:role/service-role/AmazonSageMaker-ExecutionRole-20240126T111548
sagemaker bucket: sagemaker_session_bucket
sagemaker session region: us-west-2


## 2. Load and prepare the dataset

As the base dataset, we will use the [GLUE/SST2](https://huggingface.co/datasets/glue/viewer/sst2/train) dataset, but before training the model, we need to preprocess the data. We will create chunks of `2048` tokens ([model max length](https://huggingface.co/EleutherAI/gpt-neox-20b)) to avoid unnecessary padding and computing. 

The first step is to load our dataset from Hugging Face.

In [24]:
model_id = "tiiuae/falcon-7b"

In [129]:
def create_worker_batches(num_of_hf_splits, num_of_workers, token, split_name):
    """
    Create a list of worker batches for processing a dataset.

    Args:
        num_of_hf_splits (int): The number of Hugging Face dataset splits.
        num_of_workers (int): The number of worker processes.
        token (str): The token to be replaced in the split name.
        split_name (str): The base name of the dataset split files.

    Returns:
        list: A list of worker batches, where each batch is a list of file names.
    """
    batch_size = num_of_hf_splits // num_of_workers

    worker_batches = []
    for worker_index in range(num_of_workers):
        worker_node = []
        start = worker_index * batch_size
        end = start + batch_size
        for batch in range(start, end):
            file_name = split_name.replace(token, str(batch).zfill(5))
            worker_node.append(file_name)
        worker_batches.append(worker_node)

    return worker_batches

# Example usage
num_of_hf_splits = 1024
num_of_workers = 200
token = 'split_number'
split_name = f'en/c4-train.{token}-of-01024.json.gz'

worker_batches = create_worker_batches(num_of_hf_splits, num_of_workers, token, split_name)

In [130]:
for i in range(len(worker_batches)):
    print(worker_batches[i])

['en/c4-train.00000-of-01024.json.gz', 'en/c4-train.00001-of-01024.json.gz', 'en/c4-train.00002-of-01024.json.gz', 'en/c4-train.00003-of-01024.json.gz', 'en/c4-train.00004-of-01024.json.gz']
['en/c4-train.00005-of-01024.json.gz', 'en/c4-train.00006-of-01024.json.gz', 'en/c4-train.00007-of-01024.json.gz', 'en/c4-train.00008-of-01024.json.gz', 'en/c4-train.00009-of-01024.json.gz']
['en/c4-train.00010-of-01024.json.gz', 'en/c4-train.00011-of-01024.json.gz', 'en/c4-train.00012-of-01024.json.gz', 'en/c4-train.00013-of-01024.json.gz', 'en/c4-train.00014-of-01024.json.gz']
['en/c4-train.00015-of-01024.json.gz', 'en/c4-train.00016-of-01024.json.gz', 'en/c4-train.00017-of-01024.json.gz', 'en/c4-train.00018-of-01024.json.gz', 'en/c4-train.00019-of-01024.json.gz']
['en/c4-train.00020-of-01024.json.gz', 'en/c4-train.00021-of-01024.json.gz', 'en/c4-train.00022-of-01024.json.gz', 'en/c4-train.00023-of-01024.json.gz', 'en/c4-train.00024-of-01024.json.gz']
['en/c4-train.00025-of-01024.json.gz', 'en/c4

## 3. Train the Falcon model using FSDP and SMDDP on Amazon SageMaker

We will begin by uploading the tokenized data to S3 which will be uploaded to the training cluster during training.

After we process the datasets we are going to use the new [FileSystem integration](https://huggingface.co/docs/datasets/filesystems) to upload our dataset to S3. We are using the `sagemaker_session.default_bucket()`, adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.

In [131]:
use_fsx = True

if use_fsx:
    from sagemaker.inputs import FileSystemInput

    file_system_directory_path = "/56uy3bev/c4"
    file_system_access_mode = "rw"
    file_system_type = "FSxLustre"
    train_fs = FileSystemInput(
        file_system_id='fs-0335d20f0f69afd9b',
        file_system_type=file_system_type,
        directory_path=file_system_directory_path,
        file_system_access_mode=file_system_access_mode,
    )
    data_channels = {"train": train_fs}

In [None]:
data_output_path = "s3://sagemaker-demo-c4/"

base_job_name = f'huggingface-dataset-workertest'

job_names=[]

#for worker_index in range(len(worker_batches)):
for worker_index in range(2):
    current_time = datetime.datetime.now().strftime("%H-%M-%S")

    dataset_job_name = f'{base_job_name}-{worker_index}-{current_time}'

    print(dataset_job_name)

    job_names.append(dataset_job_name)
    
    # hyperparameters, which are passed into the training job
    hyperparameters = {
        "model_id": model_id,  # model id from huggingface.co/models
        "num_proc": 72,
        "split_range": ','.join(map(str, worker_batches[worker_index])),
        "job_name": dataset_job_name
    }
    
    # estimator
    estimator = PyTorch(
        entry_point="data.py",
        max_run=1800,
        role=role,
        framework_version="2.0.1",
        py_version="py310",
        source_dir="./scripts",
        instance_count=1,
        instance_type="ml.c5.18xlarge",
        volume_size=200,
        sagemaker_session=sagemaker_session,
        disable_output_compression=True,
        keep_alive_period_in_seconds=600,
        hyperparameters=hyperparameters,
        output_path=data_output_path,
        subnets=['subnet-0067baa7d7be55e38'],
        security_group_ids=['sg-05ffe325d7d90c501']
    )

    # starting the train job with our uploaded datasets as input
    estimator.fit(inputs=data_channels, wait=True, job_name=dataset_job_name)

huggingface-dataset-workertest-0-00-25-54


INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-dataset-workertest-0-00-25-54


2024-04-19 00:25:54 Starting - Starting the training job...
2024-04-19 00:26:11 Starting - Preparing the instances for training...
2024-04-19 00:26:46 Downloading - Downloading input data........[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2024-04-19 00:27:56,749 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2024-04-19 00:27:56,750 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2024-04-19 00:27:56,750 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-04-19 00:27:56,759 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2024-04-19 00:27:56,760 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2024-04-19 00:27:58,047 sagemaker-training-toolkit INFO     Installing depen

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-dataset-workertest-1-00-32-41


Training seconds: 337
Billable seconds: 337
huggingface-dataset-workertest-1-00-32-41
2024-04-19 00:32:41 Starting - Starting the training job..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2024-04-19 00:32:52,294 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2024-04-19 00:32:52,294 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2024-04-19 00:32:52,295 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-04-19 00:32:52,304 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2024-04-19 00:32:52,305 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2024-04-19 00:32:53,632 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:[0m
[34m/opt/conda/

In [146]:
job_names


['huggingface-dataset-workertest-0-00-25-54',
 'huggingface-dataset-workertest-1-00-32-41']

In [128]:
!pwd

/home/sagemaker-user/amazon-sagemaker-examples/training/distributed_training/pytorch/data_parallel/fully_sharded_data_parallel/falcon


We can now start our training job, with the `.fit()` method passing our S3 path to the training script.

## 4. Expected Output

After training begins, you should see output similar to below: 

```4%|▍         | 1/25 [00:08<03:29,  8.72s/it]
8%|▊         | 2/25 [00:10<01:41,  4.39s/it]
12%|█▏        | 3/25 [00:11<01:06,  3.01s/it]
16%|█▌        | 4/25 [00:12<00:49,  2.35s/it]
20%|██        | 5/25 [00:14<00:39,  1.99s/it]
24%|██▍       | 6/25 [00:15<00:33,  1.77s/it]
28%|██▊       | 7/25 [00:16<00:29,  1.64s/it]
32%|███▏      | 8/25 [00:18<00:26,  1.55s/it]
36%|███▌      | 9/25 [00:19<00:23,  1.48s/it]
40%|████      | 10/25 [00:20<00:21,  1.44s/it]
44%|████▍     | 11/25 [00:22<00:19,  1.41s/it]
48%|████▊     | 12/25 [00:23<00:18,  1.40s/it]
52%|█████▏    | 13/25 [00:24<00:16,  1.38s/it]
56%|█████▌    | 14/25 [00:26<00:15,  1.37s/it]
60%|██████    | 15/25 [00:27<00:13,  1.37s/it]
64%|██████▍   | 16/25 [00:29<00:12,  1.36s/it]
68%|██████▊   | 17/25 [00:30<00:10,  1.36s/it]
72%|███████▏  | 18/25 [00:31<00:09,  1.35s/it]
76%|███████▌  | 19/25 [00:33<00:08,  1.35s/it]
80%|████████  | 20/25 [00:34<00:06,  1.35s/it]
84%|████████▍ | 21/25 [00:35<00:05,  1.35s/it]
88%|████████▊ | 22/25 [00:37<00:04,  1.35s/it]
92%|█████████▏| 23/25 [00:38<00:02,  1.35s/it]
96%|█████████▌| 24/25 [00:39<00:01,  1.35s/it]
100%|██████████| 25/25 [00:41<00:00,  1.35s/it]
100%|██████████| 25/25 [00:41<00:00,  1.65s/it]
******epoch=0: train_ppl=tensor(43260.7148, device='cuda:0') train_loss=tensor(10.6750, device='cuda:0')******
0it [00:00, ?it/s]
0it [00:00, ?it/s]
*******epoch=0: eval_ppl=tensor(nan, device='cuda:0') eval_loss=tensor(nan, device='cuda:0')*******
Training done!`

## 5. Terminate the warm pool cluster if no longer needed

You can terminate the warm pool cluster once finished experimenting to reduce billed time.

In [None]:
sagemaker_session.update_training_job(
    estimator.latest_training_job.job_name, resource_config={"KeepAlivePeriodInSeconds": 0}
)

In [144]:
current_time = datetime.datetime.now().strftime("%H-%M-%S")
training_job_name = f'huggingface-training-worker-{current_time}'

# hyperparameters, which are passed into the training job
hyperparameters = {
    "model_id": model_id,  # model id from huggingface.co/models
    "dataset_path": "/opt/ml/input/data/train",  # path where sagemaker will save training dataset
    "valid_path": "/opt/ml/input/data/valid",
    "gradient_checkpointing": True,  # enable gradient checkpointing
    "bf16": True,  # enable mixed precision training
    "optimizer": "adamw_torch",  # optimizer
    "per_device_train_batch_size": 1,  # batch size per device during training
    "epochs": 1,  # number of epochs to train
    "fsdp": '"full_shard auto_wrap"',  # fully sharded data parallelism
    "cache_dir": "/opt/ml/sagemaker/warmpoolcache",  # change this to /tmp if not using warmpools
    "max_steps": 1000,
}

# estimator
estimator = PyTorch(
    entry_point="train.py",
    max_run=1800,
    base_job_name=job_name,
    role=role,
    framework_version="2.0.1",
    py_version="py310",
    source_dir="./scripts",
    instance_count=1,
    instance_type="ml.p4d.24xlarge",
    sagemaker_session=sagemaker_session,
    disable_output_compression=True,
    distribution={"torch_distributed": {"enabled": True}},
    keep_alive_period_in_seconds=600,
    hyperparameters=hyperparameters,
    output_path=training_input_path,
    subnets=['subnet-0067baa7d7be55e38'],
    security_group_ids=['sg-05ffe325d7d90c501']
    )

In [147]:
estimator.fit(inputs=data_channels, wait=True,job_name=training_job_name)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-training-worker-02-36-20-2024-04-19-02-49-04-099


2024-04-19 02:49:04 Starting - Starting the training job...
2024-04-19 02:49:08 Pending - Training job failed for insufficient capacity
2024-04-19 02:49:08 Failed - Training job failed
..

CapacityError: Error for Training job huggingface-training-worker-02-36-20-2024-04-19-02-49-04-099: Failed. Reason: CapacityError: Unable to provision requested ML compute capacity. Please retry using a different ML instance type.

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.


![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/training|distributed_training|pytorch|data_parallel|fully_sharded_data_parallel|falcon|smddp_fsdp_example.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/training|distributed_training|pytorch|data_parallel|fully_sharded_data_parallel|falcon|smddp_fsdp_example.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/training|distributed_training|pytorch|data_parallel|fully_sharded_data_parallel|falcon|smddp_fsdp_example.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/training|distributed_training|pytorch|data_parallel|fully_sharded_data_parallel|falcon|smddp_fsdp_example.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/training|distributed_training|pytorch|data_parallel|fully_sharded_data_parallel|falcon|smddp_fsdp_example.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/training|distributed_training|pytorch|data_parallel|fully_sharded_data_parallel|falcon|smddp_fsdp_example.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/training|distributed_training|pytorch|data_parallel|fully_sharded_data_parallel|falcon|smddp_fsdp_example.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/training|distributed_training|pytorch|data_parallel|fully_sharded_data_parallel|falcon|smddp_fsdp_example.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/training|distributed_training|pytorch|data_parallel|fully_sharded_data_parallel|falcon|smddp_fsdp_example.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/training|distributed_training|pytorch|data_parallel|fully_sharded_data_parallel|falcon|smddp_fsdp_example.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/training|distributed_training|pytorch|data_parallel|fully_sharded_data_parallel|falcon|smddp_fsdp_example.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/training|distributed_training|pytorch|data_parallel|fully_sharded_data_parallel|falcon|smddp_fsdp_example.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/training|distributed_training|pytorch|data_parallel|fully_sharded_data_parallel|falcon|smddp_fsdp_example.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/training|distributed_training|pytorch|data_parallel|fully_sharded_data_parallel|falcon|smddp_fsdp_example.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/training|distributed_training|pytorch|data_parallel|fully_sharded_data_parallel|falcon|smddp_fsdp_example.ipynb)