### Train MPT model using Mosaic composer and Amazon SageMaker.


We will start with upgrading SageMaker Python SDK and boto3. Followed by sagemaker imports and session creation required to launch the training job

In [None]:
! pip install -U sagemaker boto3

In [None]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")


### Prepare and upload dataset to S3

We will download the c4 small dataset , convert it to streaming format and upload it to S3.

In [None]:
! python scripts/convert_dataset_hf.py \
    --dataset c4 \
    --data_subset en \
    --out_root data/my-copy-c4 \
    --splits train_small val_small \
    --concat_tokens 2048 \
    --tokenizer EleutherAI/gpt-neox-20b \
    --eos_text '<|endoftext|>'

In [None]:
train_data_url = sess.upload_data(
    path="data",
    key_prefix="dataset/c4small",
)

In [None]:
print(f"Training data uploaded here - {train_data_url}")

Update the Yaml file remote path with the above S3 URL. For this job we will use the mpt-7b.yaml file.

### Start the training job using the Custom Docker Image

Update he image_uri in the estimator below to use the custom image that we built.As mentioned in the beginning, we will use Amazon SageMaker and Mosaic Composer to train our model. Amazon SageMaker makes it easy to create a multi-node cluster to train our model in a distributed manner. The sagemaker python SDK supports to run training jobs using torchrun, to distribute the script across multiple nodes and GPUs.

To use torchrun to execute our scripts, we only have to define the distribution parameter in our Estimator and set it to "torch_distributed": {"enabled": True}. This tells sagemaker to launch our training job with.



In [None]:
import time

from sagemaker.pytorch import PyTorch
# define Training Job Name 
job_name = f'mosaic-llmfoundry-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'


# This environment variables are useful when training with P4d inorder to enable EFA based training.
env = {}
env['FI_PROVIDER'] = 'efa'
env['NCCL_PROTO'] = 'simple'
env['FI_EFA_USE_DEVICE_RDMA'] = '1'
env['RDMAV_FORK_SAFE'] = '1'

hyperparameters = {}
hyperparameters["config_path"] = "yamls"
hyperparameters["config_name"] = "mpt-7b.yaml"
hyperparameters["backend"] = "nccl" # Use smddp when you scale cluster size for better performance.
# estimator 
pt_estimator = PyTorch(
    entry_point='run.py',
    source_dir='./scripts',
    instance_type="ml.p4d.24xlarge",
    image_uri="xxxx.dkr.ecr.us-west-2.amazonaws.com/mosaic-llm-foundry-dlc:latest",
    instance_count=2,
    role=role,
    job_name=job_name,
    environment=env,
    disable_output_compression=True,
    keep_alive_period_in_seconds=600,
    distribution={"torch_distributed": {"enabled": True}} # enable torchrun 
)

In [None]:
pt_estimator.fit()

### Terminate warmpools when not needed

In [None]:
sess.update_training_job(pt_estimator.latest_training_job.job_name, resource_config={"KeepAlivePeriodInSeconds":0})