# ModelTrainer - SageMaker PySDK Training Redesign

---

## Introductions

The `ModelTrainer` class in the SageMaker Python SDK simplifies the process of launching and managing training jobs on AWS SageMaker. It provides an intuitive interface for customizing training jobs, including the ability to specify custom scripts, custom containers, distributed training configurations and executing training locally. In this notebook, we outline how to get started with the ModelTrainer class, its features, and examples to help you effectively leverage its capabilities.

### Benefits of the ModelTrainer

The ModelTrainer is designed to address the usability challenges associated with the Estimator class. Moving training with the SageMaker PySDK towards achieving a best-in-class developer experience.

Key Improvements Include:
1. **Improved Intuitiveness** - The ModelTrainer reduces complexity by leveraging configuration classes and minimizing the interface to only a few core parameters.
1. **Simplified Script Mode and BYOC** - The ModelTrainer natively supports script mode and removes the coupling to the SageMaker Training Toolkit for running a job in Script mode. By removing this runtime dependency, users can bring their own image to launch a training job without a needing to adapt it for script mode on SageMaker.
1. **Simplified Distributed Training** - The ModelTrainer provides enhanced flexibility for users to specify custom commands and distributed training configurations by specify the exect commands to execut in thier container using the `command` parameter in the `SourceCode` class or by leveraging a distributed training configuration class like `Torchrun()`


#### Install SageMaker PySDK

In [None]:
!pip install sagemaker "datasets[s3]" "requests<2.32.0" "protobuf<3.20"

In [None]:
import sagemaker

print(sagemaker.__version__)

## ModelTrainer - Basic Exuction

This case example shows a minimal setup for a ModelTrainer. A user need only to provide a desired training image and the commands they wish to execute in the container using the `SourceCode` class.

In [None]:
from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.configs import SourceCode

pytorch_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.0.0-cpu-py310"

source_code = SourceCode(
    command="echo 'Hello World'"
)

model_trainer = ModelTrainer(
    training_image=pytorch_image,
    source_code=source_code,
    base_job_name="model-trainer-basic-execution",
)

In [None]:
model_trainer.train()

## ModelTrainer - Script Mode

This examples show cases an abstracted setup for script mode where a user can provide their training image and a `SourceCode` config with path to their `source_dir`, `enty_script`, and any additional `requirements` to install in the training container for their job.

In [None]:
from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.configs import SourceCode

pytorch_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.0.0-cpu-py310"

source_code = SourceCode(
    source_dir="basic-script-mode",
    requirements="requirements.txt",
    entry_script="custom_script.py",
)

model_trainer = ModelTrainer(
    training_image=pytorch_image,
    source_code=source_code,
    base_job_name="model-trainer-script-mode",
)

In [None]:
model_trainer.train()

## ModelTrainer - Local Container Mode

This example show cases how a user can leverage the `LOCAL_CONTAINER` mode to run their training job in their local enviornment as docker containers for local experimentation and testing.

In [None]:
from sagemaker.modules.train.model_trainer import ModelTrainer, Mode
from sagemaker.modules.configs import InputData, SourceCode

pytorch_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.0.0-cpu-py310"


source_code = SourceCode(
    source_dir="basic-script-mode",
    entry_script="local_training_script.py",
)

train_data = InputData(
    channel_name="train",
    data_source="basic-script-mode/data/train/",
)

test_data = InputData(
    channel_name="test",
    data_source="basic-script-mode/data/test/",
)

model_trainer = ModelTrainer(
    training_mode=Mode.LOCAL_CONTAINER,
    training_image=pytorch_image,
    source_code=source_code,
    input_data_config=[train_data, test_data],
    base_job_name="local-container-mode"
)

In [None]:
model_trainer.train()

---

# Distributed Training

In this section, we will walk through how the ModelTrainer can be used for more complex Distributed Training jobs.

### Setup Variables

In [None]:
model_id = "openlm-research/open_llama_7b"
dataset_name = "tatsu-lab/alpaca"

### Load Data Set

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)


# Load dataset from huggingface.co
dataset = load_dataset(dataset_name)

dataset = dataset.shuffle(seed=42)

In [None]:
if "validation" not in dataset.keys():
    dataset["validation"] = load_dataset(dataset_name, split="train[:1%]")

    dataset["train"] = load_dataset(dataset_name, split="train[1%:]")

### Prepare Dataset

In [None]:
from itertools import chain
from functools import partial


def group_texts(examples, block_size=2048):
    # Concatenate all texts.
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result


column_names = dataset["train"].column_names

lm_dataset = dataset.map(
    lambda sample: tokenizer(sample["text"], return_token_type_ids=False),
    batched=True,
    remove_columns=list(column_names),
).map(
    partial(group_texts, block_size=2048),
    batched=True,
)

### Save Dataset

In [None]:
# save data locally

training_input_path = f"distributed-training/processed/data/"
lm_dataset.save_to_disk(training_input_path)

print(f"Saved data to: {training_input_path}")

## ModelTrainer - Distributed Training - Explicit Commands

This example shows how a user could perform a more complex setup for DistributedTraining using `torchrun` directly through the `command` parameter in the `SourceCode` class.

In [None]:
from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.configs import Compute, SourceCode, InputData

env = {}
env["FI_PROVIDER"] = "efa"
env["NCCL_PROTO"] = "simple"
env["NCCL_SOCKET_IFNAME"] = "eth0"
env["NCCL_IB_DISABLE"] = "1"
env["NCCL_DEBUG"] = "WARN"

compute = Compute(
    instance_count=1,
    instance_type="ml.p4d.24xlarge",
    volume_size_in_gb=96,
    keep_alive_period_in_seconds=3600,
)

hugging_face_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04"

hyperparameters = {
    "dataset_path": "/opt/ml/input/data/dataset",
    "model_dir": "/opt/ml/model",
    "cache_dir": None,
    "max_train_steps": None,
    "num_train_steps": 1,
    "num_warmup_steps": 0,
    "num_train_epochs": 1,
    "forward_prefetch": False,
    "limit_all_gathers": False,
    "lr_scheduler_type": "linear",
    "weight_decay": 0.0,
    "learning_rate": 5e-5,
    "epochs": 1,
    "max_steps": 100,
    "seed": 42,
    "fsdp": "full_shard auto_wrap",
    "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
    "gradient_checkpointing": True,
    "gradient_accumulation_steps": 1,
    "optimizer": "adamw_torch",
    "per_device_train_batch_size": 1,
    "model_id": model_id,
}

In [None]:
source_code = SourceCode(
    source_dir="distributed-training/scripts",
    requirements="requirements.txt",
    command="torchrun --nnodes 1 \
            --nproc_per_node 8 \
            --master_addr algo-1 \
            --master_port 7777 \
            --node_rank $SM_CURRENT_HOST_RANK \
            run_clm_no_trainer.py",
)

model_trainer = ModelTrainer(
    training_image=hugging_face_image,
    compute=compute,
    environment=env,
    hyperparameters=hyperparameters,
    source_code=source_code,
    base_job_name=f"model-trainer-distributed-commands",
)

In [None]:
test_data = InputData(
    channel_name="dataset",
    data_source=training_input_path,
)
model_trainer.train(input_data_config=[test_data], wait=False)

## ModelTrainer - Distributed Training - Abstraction

This examples shows how a user could perform distributed training using an abstracted approach provided via the `Torchrun` distributed training configuration class.

In [None]:
from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.configs import (
    Compute, SourceCode, InputData
)

compute = Compute(
    instance_count=2,
    instance_type="ml.p4d.24xlarge",
    volume_size_in_gb=96,
    keep_alive_period_in_seconds=3600
)

hugging_face_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04"

hyperparameters = {
    "dataset_path": "/opt/ml/input/data/dataset",
    "model_dir": "/opt/ml/model",
    "cache_dir": None,
    "max_train_steps": None,
    "num_train_steps": 1,
    "num_warmup_steps": 0,
    "num_train_epochs": 1,
    "forward_prefetch": False,
    "limit_all_gathers": False,
    "lr_scheduler_type": "linear",
    "weight_decay": 0.0,
    "learning_rate": 5e-5,
    "epochs": 1,
    "max_steps": 100,
    "seed": 42,
    "fsdp": "full_shard auto_wrap",
    "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
    "gradient_checkpointing": True,
    "gradient_accumulation_steps": 1,
    "optimizer": "adamw_torch",
    "per_device_train_batch_size": 1,
    "model_id": model_id,
}

In [None]:
from sagemaker.modules.distributed import Torchrun

source_code = SourceCode(
    source_dir="distributed-training/scripts",
    requirements="requirements.txt",
    entry_script="run_clm_no_trainer.py",
)

# Run using Torchrun
torchrun = Torchrun()

model_trainer = ModelTrainer(
    training_image=hugging_face_image,
    compute=compute,
    hyperparameters=hyperparameters,
    source_code=source_code,
    distributed=torchrun,
    base_job_name=f"model-trainer-distributed-abstraction",
)

In [None]:
test_data = InputData(
    channel_name="dataset",
    data_source=training_input_path,
)
model_trainer.train(input_data_config=[test_data], wait=False)

---

## ModelTrainer - SageMaker Recipes

This example showcases how a user could leverage SageMaker pre-defined training recipe `training/mistral/hf_mistral_7b_seq8k_gpu_p5x16_pretrain` for training a Mistral Model using synthetic data.

In [None]:
from sagemaker.modules import Session
from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.configs import Compute, TensorBoardOutputConfig

sagemaker_session = Session()

recipe_overrides = {
    "run": {
      "results_dir": "/opt/ml/model",
    },
    "trainer": {
        "num_nodes": 1,
    },
    "exp_manager": {
      "exp_dir": "/opt/ml/output",
      "explicit_log_dir": "/opt/ml/output/tensorboard",
    },
    "model": {
        "fp8": False,
        "train_batch_size": 1,
        "num_hidden_layers": 4,
        "shard_degree": 4,
        "data": {
            "use_synthetic_data": True
        }
    }
}

compute = Compute(
    instance_type="ml.p4d.24xlarge",
    keep_alive_period_in_seconds=3600,
)

tensorboad_output_config = TensorBoardOutputConfig(
    s3_output_path=f"s3://{sagemaker_session.default_bucket()}/output/tensorboard",
    local_path="/opt/ml/output/tensorboard"
)

smp_image = "658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"

model_trainer = ModelTrainer.from_recipe(
    sagemaker_session=sagemaker_session,
    training_image=smp_image,
    training_recipe="training/mistral/hf_mistral_7b_seq8k_gpu_p5x16_pretrain",
    recipe_overrides=recipe_overrides,
    compute=compute,
    base_job_name=f"model-trainer-recipes",
).with_tensorboard_output_config(tensorboad_output_config)

In [None]:
model_trainer.train(wait=False)