# Fine-tune GPT-NeoX and Llama-v2 with SageMaker-PyTorch FSDP at large-scale using tensor parallelism, hybrid sharding, and activation offloading
---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/training|distributed_training|pytorch|model_parallel_v2|llama_v2|smp-finetuning-llama-fsdp-tp.ipynb)

---

In this notebook, you will learn how to fine-tune the Hugging Face Transformers GPT-NeoX and Llama-v2 models with tensor parallelism, hybrid sharding, and activation offloading. You can either launch this notebook from an Amazon SageMaker notebook instance which handles all credentials automatically, or by running it locally and setting credentials manually.

this notebook is accompanied by the following files:
- `train.py`: The entry point script that'll be passed to the SageMaker PyTorch estimator later in this notebook when launching the fine-tuning job.
- `arguments.py`: This has functions for argument parsing (i.e. hyperparameters).
- `checkpoints.py`: This has functions for saving and loading checkpoints.
- `data_utils.py`: This has functions for handling S3 URLs.
- `data`: This directory has scripts for preparing and loading data.
- `fsdp_utils.py`: This has util functions for fully sharded data parallelism.
- `learning_rates.py`: This has functions for learning rate schedule.
- `logging_utils.py`: This has functions to handle logging.
- `memory_tracker.py`: This has functions to track memory usage.
- `requirements.txt`: This installs the dependencies, including HuggingFace transformers.
- `train_lib.py`: This has functions for running an end-to-end training of the GPT-NeoX or Llama-v2 model with SMP FSDP, settings for hybrid sharding applied, and implemented with code lines to save, load, and fine-tune the model.
- `train_utils.py`: This has utility functions for training.

## Additional Resources
- To learn more about launching a multi-node distributed PyTorch training job, see [Launching a Distributed Training Job](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#launching-a-distributed-training-job).
- To learn more about using the SageMaker Python SDK with PyTorch, see [Using PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html).
- To learn more about launching a training job in Amazon SageMaker with your own training image, see [Use Your Own Training Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html).

## Prerequisites
You need to create an S3 bucket to store the input data for training. This bucket must be located in the same AWS Region that you choose to launch your training job. To learn how to create a S3 bucket, see (Create your first S3 bucket in the Amazon S3 documentation)[https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html].

## Launching Environment
### Amazon SageMaker Notebook
You can run the notebook on an Amazon SageMaker notebook instance without manually setting your aws credentials.
1. Create a new SageMaker notebook instance and open it.
2. Zip the contents of this folder & upload to the instance with the Upload button on the top-right.
3. Open a new terminal with `New -> Terminal`.
4. Within the terminal, enter the correct directory and unzip the file.
    1. `cd SageMaker && unzip <your-zip-name-here>.zip`

### Locally
You can run locally by launching a Jupyter notebook server with `jupyter notebook`. This requires you to set your aws credentials in the environment manually. See [Configure the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) for more details.

## Amazon SageMaker Initialization
Run the following cell to import SageMaker modules and retrieve information of your current SageMaker work environment, such as your AWS account ID, the AWS Region, and the ARN of your Amazon SageMaker execution role. Upgrade SageMaker SDK to the latest version.

**NOTE:** This step might require a kernel restart.

In [None]:
FILE_SYSTEM_ID = "..."
FSX_SECURITY_GROUP_ID = "..."
FSX_SUBNET = "..."
BASE_PATH = "..."
PRETRAINED_MODEL = "..."
PRETRAINED_DIR = "..."

In [None]:
%pip install --upgrade "sagemaker>=2.2"
%pip install sagemaker-experiments

In [None]:
%%time
import os

import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.pytorch import PyTorch

role = (
    get_execution_role()
)  # provide a pre-existing role ARN as an alternative to creating a new role
print(f"SageMaker Execution Role: {role}")

client = boto3.client("sts")
account = client.get_caller_identity()["Account"]
print(f"AWS account: {account}")

session = boto3.session.Session()
region = session.region_name
print(f"AWS region: {region}")

sm_boto_client = boto3.client("sagemaker")
sagemaker_session = sagemaker.session.Session(boto_session=session)

# get default bucket
default_bucket = sagemaker_session.default_bucket()
print("Default bucket for this session: ", default_bucket)

## Download and prepare GLUE/SST2 data
Here you will download, prepare the GLUE/SST2 dataset and then copy the files to S3.

### Install the Hugging Face Transformers and Datasets libraries

In [None]:
! pip install -q datasets==2.15.0 transformers pytest

In [None]:
import datasets
from datasets import load_dataset, load_from_disk, load_metric

In [None]:
from sagemaker.pytorch import PyTorch
import transformers
import logging

from transformers import (
    AutoTokenizer,
)

from transformers.testing_utils import CaptureLogger

In [None]:
logger = logging.getLogger(__name__)

### Choose Model
Choose to train either the GPT-NeoX or Llama-v2 model.

In [None]:
model_type = "llama_v2"  # [gpt_neox, llama_v2]
max_context_width = 4096  # For Llama v2 model

### Load data
This section loads the [GLUE/SST2](https://huggingface.co/datasets/glue/viewer/sst2/train) dataset and splits it to training and validation datasets. You can update this section to load any HuggingFace dataset you want.

In [None]:
hyperparameters = {
    "dataset_name": "glue",
    "dataset_config_name": "sst2",
    "do_train": True,
    "do_eval": True,
    "cache_dir": "tmp",
}

In [None]:
raw_datasets = load_dataset(
    hyperparameters["dataset_name"],
    hyperparameters["dataset_config_name"],
)

In [None]:
# Remove existing validation dataset as it is too small
# to shard across all ranks.
del raw_datasets["validation"]
if "validation" not in raw_datasets.keys():
    validation_percentage = "10%"
    raw_datasets["validation"] = load_dataset(
        hyperparameters["dataset_name"],
        hyperparameters["dataset_config_name"],
        split=f"train[:{validation_percentage}]",
        cache_dir=hyperparameters["cache_dir"],
    )

    raw_datasets["train"] = load_dataset(
        hyperparameters["dataset_name"],
        hyperparameters["dataset_config_name"],
        split=f"train[{validation_percentage}:]",
        cache_dir=hyperparameters["cache_dir"],
    )

### Load tokenizer
Nearly every NLP task begins with a tokenizer. A tokenizer converts your text data into a format (token) that can be processed by the NLP model.
The following cell loads a tokenizer for GPT-NeoX-7B using [AutoTokenizer.from_pretrained()](https://huggingface.co/docs/transformers/v4.19.4/en/autoclass_tutorial#autotokenizer).

In [None]:
tokenizer_kwargs = {
    "cache_dir": hyperparameters["cache_dir"],
}

# Pretrained meta-llama/Llama-2-7b-hf requires HuggingFace access, https://huggingface.co/meta-llama/Llama-2-7b-hf
# There also exist pretrained models without special access requirement e.g., https://huggingface.co/NousResearch/Llama-2-7b-chat-hf
tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_MODEL, **tokenizer_kwargs)

### Preprocess data

The following two cells set up a function to run the tokenizer and group texts into chunks smaller than the block size.

In [None]:
def tokenize_function(examples):
    tok_logger = transformers.utils.logging.get_logger("transformers.tokenization_utils_base")

    with CaptureLogger(tok_logger) as cl:
        output = tokenizer(examples[text_column_name])
        # clm input could be much much longer than block_size
        if "Token indices sequence length is longer than the" in cl.out:
            tok_logger.warning(
                "^^^^^^^^^^^^^^^^ Please ignore the warning above - this long input will be chunked into smaller bits before being passed to the model."
            )
    return output


# Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size.
def group_texts(block_size, examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
        # Split by chunks of max_len.
        result = {
            k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
            for k, t in concatenated_examples.items()
        }
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
column_names = raw_datasets["train"].column_names
text_column_name = "text" if "text" in column_names else column_names[0]

# since this will be pickled to avoid _LazyModule error in Hasher force logger loading before tokenize_function
tok_logger = transformers.utils.logging.get_logger("transformers.tokenization_utils_base")

tokenized_datasets = raw_datasets.map(
    tokenize_function,
    batched=True,
    num_proc=1,
    remove_columns=column_names,
    desc="Running tokenizer on dataset",
)

import functools

lm_datasets = tokenized_datasets.map(
    functools.partial(group_texts, max_context_width),
    batched=True,
    #     num_proc=args.preprocessing_num_workers,
    desc=f"Grouping texts in chunks of {max_context_width}",
)

Set additional hyperparameters and S3 paths for mapping the train and validation datasets properly depending on the phase (training or validation) of the training job in each epoch.

In [None]:
if hyperparameters["do_train"]:
    if "train" not in tokenized_datasets:
        raise ValueError("--do_train requires a train dataset")
    train_dataset = lm_datasets["train"]


if hyperparameters["do_eval"]:
    if "validation" not in tokenized_datasets:
        raise ValueError("--do_eval requires a validation dataset")
    eval_dataset = lm_datasets["validation"]

In [None]:
training_dataset_location = None
validation_dataset_location = None


if hyperparameters["do_train"]:
    train_dataset.to_json("./training.json")
    training_dataset_location = "s3://{}/dataset/train/".format(default_bucket)

if hyperparameters["do_eval"]:
    eval_dataset.to_json("./validation.json")
    validation_dataset_location = "s3://{}/dataset/validation/".format(default_bucket)

In [None]:
if training_dataset_location is not None:
    command = "aws s3 cp ./training.json {}".format(training_dataset_location)
    os.system(command)

if validation_dataset_location is not None:
    command = "aws s3 cp ./validation.json {}".format(validation_dataset_location)
    os.system(command)

In [None]:
if hyperparameters["do_train"]:
    command = "rm ./training.json"
    os.system(command)

if hyperparameters["do_eval"]:
    command = "rm ./validation.json"
    os.system(command)

In [None]:
%store training_dataset_location
%store validation_dataset_location

In [None]:
%store

## Specify Amazon S3 Bucket Paths
Here you need to specify the paths for training data to be used by your job. The bucket used must be in the same region as where training will run. In the cells above you downloaded the GLUE/SST2 training and validation split datasets and uploaded the json files in an S3 bucket in your account. This example will train on those json files.

After you successfully run this example tensor parallel + fully sharded data parallel training job, you can modify the S3 bucket to where your own dataset is stored.

In [None]:
%store -r training_dataset_location
%store -r validation_dataset_location

In [None]:
s3_train_bucket = training_dataset_location
s3_test_bucket = validation_dataset_location

The following S3 bucket will store the output artifacts of the training job. You can modify this as needed.

In [None]:
s3_output_bucket = f"s3://sagemaker-{region}-{account}/smp-fsdp/{model_type}-outputdir/"

## Define Data Channels for SageMaker Training Using Amazon S3
In this step, define SageMaker training data channels to the S3 buckets.



In [None]:
# Set below var to True if you want to use fsx (see next cell)
use_fsx = True
if not use_fsx:
    if s3_train_bucket != None:
        train = sagemaker.inputs.TrainingInput(
            s3_train_bucket, distribution="FullyReplicated", s3_data_type="S3Prefix"
        )
        data_channels = {"train": train}
    else:
        data_channels = {"train": mock_data}
    if s3_test_bucket != None:
        test = sagemaker.inputs.TrainingInput(
            s3_test_bucket, distribution="FullyReplicated", s3_data_type="S3Prefix"
        )
        data_channels["test"] = test
    else:
        data_channels["test"] = mock_data

### (Optional) Set Up and Use Amazon FSx for Data Channels and Checkpoints
While the previous option of using Amazon S3 is easier to setup, using an FSx can be beneficial for performance when dealing with large input sizes and large model sizes and is more stable. In general, checkpointing should be done using FSx.

Please see the instructions from [Distributed Training of Mask-RCNN in Amazon SageMaker Using FSx](https://github.com/aws/amazon-sagemaker-examples/blob/main/advanced_functionality/distributed_tensorflow_mask_rcnn/mask-rcnn-scriptmode-fsx.ipynb) to create an FSx Lustre file system and import the dataset from the S3 bucket to your FSx file system. Note that the FSx file system must be created in a private subnet with internet gateway to ensure that training job has access to the internet. For general guidance on setting an FSx Lustre file system as data input channel, see Configure Data Input Channel to Use Amazon FSx for Lustre.

In [None]:
# Instructions obtained from:
# https://github.com/aws/amazon-sagemaker-examples/blob/main/advanced_functionality/distributed_tensorflow_mask_rcnn/mask-rcnn-scriptmode-fsx.ipynb
if use_fsx:
    from sagemaker.inputs import FileSystemInput

    # Specify FSx Lustre file system id.
    file_system_id = FILE_SYSTEM_ID

    # Specify the SG and subnet used by the FSX, these are passed to SM Estimator so jobs use this as well
    fsx_security_group_id = FSX_SECURITY_GROUP_ID
    fsx_subnet = FSX_SUBNET

    # Specify directory path for input data on the file system.
    # You need to provide normalized and absolute path below.
    # Your mount name can be provided by you when creating fsx, or generated automatically.
    # You can find this mount_name on the FSX page in console.
    # Example of fsx generated mount_name: "3x5lhbmv"
    # Example base path: "/3x5lhbmv"
    base_path = BASE_PATH

    # Specify your file system type.
    file_system_type = "FSxLustre"

    train = FileSystemInput(
        file_system_id=file_system_id,
        file_system_type=file_system_type,
        directory_path=base_path,
        file_system_access_mode="rw",
    )

    data_channels = {"train": train, "test": train}

### Set hyperparameters, metric definitions, and MPI options
#### Tensor Parallelism
Tensor parallelism is a type of model parallelism in which specific model weights, gradients, and/ or optimizer states are split across devices, by replacing specific submodules in the model with their distributed implementations. The tensor parallel degree controls the sharding level and can be set from 1 to `world_size`, though we only recommend setting 1 to 8, assuming an 8-gpu node such as ml.p4d.24xlarge. This is because inter-node tensor parallel communication is much slower than intra-node tensor parallel communication.

For more information, see [tensor parallelism](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism.html).

#### Hybrid Sharding
Hybrid sharding is a memory saving technique in between `FULL_SHARD` and `NO_SHARD` with `FULL_SHARD` saving the most and `NO_SHARD` not saving any. This technique shards parameters within the hybrid shard degree (HSD) group and replicates parameters across groups. The hybrid shard degree (HSD) controls sharding across GPUs and can be set to an integer from 0 to `world_size`.
- An HSD of 8 applies `FULL_SHARD` within a node and then replicates parameters across nodes since there are 8 GPUs in the nodes we are using. This results in reduced communication volume as expensive all-gathers and reduce-scatters are only done within a node, which can be more performant for medium-sized models.
Generally, you want to use the smallest HSD that does not cause Out of Memory (OOM) errors. If you are hitting OOM, try increasing the hybrid shard degree to reduce memory usage on each node.
- An HSD of 0 falls back to the native PyTorch implementation and API in the script. If `FULL_SHARD` was the strategy set, it would shard across the whole cluster of GPUs. If `HYBRID_SHARD` or `_HYBRID_SHARD_ZERO2` was the strategy, the default is equivalent to an HSD of 8.

For more information, see [fsdp.ShardingStrategy](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp).

#### Activation Offloading
Activation offloading is a memory saving technique which requires activation checkpointing to be enabled. Enabling this offloads activations onto CPU memory to save GPU memory. This is useful when our model is too large to fit in our nodes or when we want to train with a larger batch size.

##### SageMaker Activation Offloading
SageMaker activation offloading improves performance by pre-fetching activations from the CPU before they are needed, so that the GPU does not wait for the activations to be loaded.

Setting `"sm_activation_offloading": True` enables our improved version.

Note: we generally only need activation offloading for models >= 20B parameters or if we are getting OOM with a given batch size. We use it here simply to illustrate how to enable it.

##### Activation Loading Horizon
The activation loading horizon is the maximum number of loaded tensors that can be in the GPU memory simultaneously. This has to be greater than or equal to 1, and defaults to 2.

In [None]:
tensor_parallel_degree = 2  # An integer in [1, world_size]. Note: we recommend using TP_DEGREE in [1,8] for intra-node communication as inter-node TP communication is slow.
hybrid_shard_degree = (
    4  # An integer in [0, world_size // tensor_parallel_degree] and its default value is 0.
)
offload_activations = True  # Enables SM activation offloading implementation.
activation_loading_horizon = (
    2  # Activation loading horizon, a positive integer and its default value is 2.
)
save_steps = 50  # Save step interval.
max_steps = 50  # Maximum training steps.

hyperparameters = {
    "train_batch_size": 2,
    "val_batch_size": 4,
    "fast_validation": 0,
    "max_steps": max_steps,
    "epochs": 100,
    "seed": 12345,
    "bf16": 1,
    "lr": 0.0001,
    "min_lr": 1e-05,
    "beta1": 0.9,
    "beta2": 0.95,
    "lr_decay_style": "cosine",
    "lr_decay_iters": 47683,
    "warmup": 0.0032,
    "plateau": 0.0,
    "delayed_param": 1,
    "num_kept_checkpoints": 2,
    "checkpoint_freq": save_steps,
    "checkpoint_dir": "/opt/ml/checkpoints",
    "validation_freq": save_steps,
    "logging_freq": 1,
    "weight_decay": 0.2,
    "clean_cache": 0,
    "activation_checkpointing": 1,
    "enable_memory_profiling": 0,
    "forward_prefetch": 1,
    "vocab_size": 50257,
    "limit_all_gathers": 1,
    "backward_fetch_policy": "backward_pre",
    "sharding_strategy": "hybrid_shard",
    "auto_wrap_policy": "transformer_auto_wrap_policy",
    "model_type": model_type,
    "use_smp_flash_attn": 1,
    "distributed_backend": "nccl",
}

if use_fsx:
    # make sure to update paths for training_dir and test_dir based on the paths of datasets in fsx
    # If you want to resume training, set checkpoint_dir to the same path as a previous job.
    SM_TRAIN_DIR = "/opt/ml/input/data/train"
    hyperparameters["checkpoint_dir"] = f"{SM_TRAIN_DIR}/smp-v2/{model_type}/checkpointdir"
    hyperparameters["training_dir"] = f"{SM_TRAIN_DIR}/datasets/c4/en/hf-tokenized/llama/train"
    hyperparameters["test_dir"] = f"{SM_TRAIN_DIR}/datasets/c4/en/hf-tokenized/llama/val"
    hyperparameters["zipped_data"] = 1
    hyperparameters["dataset_type"] = "hf"
else:
    hyperparameters["zipped_data"] = 0
    hyperparameters["dataset_type"] = "gpt_jsonl"

# The checkpoint path (hyperparameters['checkpoint_dir'] or checkpoint_s3_uri) is not unique per job.
# You need to modify as needed for different runs.
# If same path is used for unrelated runs, this may increase time when downloading unnecessary checkpoints,
# and cause conflicts when loading checkpoints.

metric_definitions = [
    {"Name": "base_metric", "Regex": "<><><><><><>"}
]  # Add your custom metric definitions

## Fine-tuning
In this example, we use `"hf_pretrained_model_name_or_dir"` in hyperparameters, which activates fine-tuning functionality in the script `train.py`.  `"hf_pretrained_model_name_or_dir"` can either be a HuggingFace model (e.g., `meta-llama/Llama-2-7b-hf`) or an FSx stored model e.g., (`/fsx/users/.../hf_pretrained_models/Llama-2-7b-hf`). Note when using HuggingFace models, user might need to register for access.

In [None]:
if use_fsx:
    hyperparameters["hf_pretrained_model_name_or_dir"] = f"{SM_TRAIN_DIR}{PRETRAINED_DIR}"
else:
    hyperparameters["hf_pretrained_model_name_or_dir"] = PRETRAINED_MODEL

In [None]:
# Select your model size.
model_config = "7b"  # [7b, 65b]

if model_type == "gpt_neox":
    if model_config == "7b":
        model_params = {
            "max_context_width": 1024,
            "hidden_width": 4096,
            "num_layers": 32,
            "num_heads": 32,
        }
    elif model_config == "65b":
        model_params = {
            "max_context_width": 1024,
            "hidden_width": 8192,
            "num_layers": 80,
            "num_heads": 64,
        }
    else:
        raise RuntimeError("Unknown model config")
elif model_type == "llama_v2":
    if model_config == "7b":
        model_params = {
            "max_context_width": 4096,
            "hidden_width": 4096,
            "num_layers": 32,
            "num_heads": 32,
            "llama_intermediate_size": 11008,
        }
    elif model_config == "65b":
        model_params = {
            "max_context_width": 4096,
            "hidden_width": 8192,
            "num_layers": 80,
            "num_heads": 64,
            "llama_intermediate_size": 22016,
        }
    else:
        raise RuntimeError("Unknown model config")

for k, v in model_params.items():
    hyperparameters[k] = v

## Specify Essential Parameters for a SageMaker Training Job
Next, you use the `SageMaker Estimator class` to define a SageMaker Training Job, passing values through the following parameters for training job name, the number of EC2 instances, the instance type, and the size of the volume attached to the instances.

- `instance_count`
- `instance_type`
- `volume_size`
- `base_job_name`

### Update the Type and Number of EC2 Instance to Use
The instance type and the number of instances you specify to the `instance_type` and `instance_count` parameters, respectively, determine the total number of GPUs (world size).
$$\text{(world size) = (the number of GPUs on a single instance)}\times\text{(the number of instances)}$$

In [None]:
instance_type = "ml.p4d.24xlarge"

# You need >= 1 p4d for 7b model.
# You need >= 8 p4d for 65b model.
instance_count = 1

# set to the number of GPUs on that instance
processes_per_host = 8

### Specify a Base Job Name

In [None]:
machine_str = instance_type.split(".")[1] + instance_type.split(".")[2][:3]
base_job_name = f'smp-{model_config}-{machine_str}-hs{hybrid_shard_degree}-ao{offload_activations}-bs{hyperparameters["train_batch_size"]}'

In [None]:
if not use_fsx:
    # If you want to resume training, set checkpoint_s3_uri to the same path as a previous job.
    # Previous checkpoint to load must have same model config.
    checkpoint_bucket = f"s3://sagemaker-{region}-{account}/"
    checkpoint_s3_uri = (
        f"{checkpoint_bucket}/experiments/smp_fsdp-{model_type}-checkpoints/{base_job_name}/"
    )

In [None]:
kwargs = {}
if use_fsx:
    # Use the security group and subnet that was used to create the fsx filesystem
    kwargs["security_group_ids"] = [fsx_security_group_id]
    kwargs["subnets"] = [fsx_subnet]

smp_estimator = PyTorch(
    entry_point="train.py",
    hyperparameters=hyperparameters,
    source_dir=os.path.join(os.getcwd(), "../shared-scripts"),
    role=role,
    checkpoint_s3_uri=checkpoint_s3_uri if not use_fsx else None,
    checkpoint_local_path=hyperparameters["checkpoint_dir"] if use_fsx else None,
    instance_type=instance_type,
    volume_size=400,
    instance_count=instance_count,
    sagemaker_session=sagemaker_session,
    distribution={
        "torch_distributed": {"enabled": True},  # Use torchrun.
        "smdistributed": {
            "modelparallel": {
                "enabled": True,
                "parameters": {
                    "tensor_parallel_degree": tensor_parallel_degree,
                    "hybrid_shard_degree": hybrid_shard_degree,
                    "sm_activation_offloading": offload_activations,
                    "activation_loading_horizon": activation_loading_horizon,
                },
            }
        },
    },
    py_version="py310",
    framework_version="2.0.1",
    # image_uri=$IMAGE,  # Either provide `framework_version` or `image_uri`
    output_path=s3_output_bucket,
    max_run=86400,
    debugger_hook_config=False,
    base_job_name=base_job_name,
    metric_definitions=metric_definitions,
    **kwargs,
)

Finally, run the estimator.fit method to launch the SageMaker fine-tuning job of the model with hybrid sharding and activation offloading.

In [None]:
smp_estimator.fit(inputs=data_channels)

## Accessing the launched SM training job
You can access the launched training job from [SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html).  
Go to `Amazon SageMaker -> Training -> Training jobs`.  
You can also access the training logs from here with `View Logs` which opens CloudWatch directly.

## Accessing the Training Logs

You can access the training logs from [Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html).

You can use CloudWatch to track SageMaker GPU and memory utilization during training and inference. To view the metrics and logs that SageMaker writes to CloudWatch, see [SageMaker Jobs and Endpoint Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html#cloudwatch-metrics-jobs) in the Amazon SageMaker Developer Guide.

If you are a new user of CloudWatch, see [Getting Started with Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/GettingStarted.html). 

For additional information on monitoring and analyzing Amazon SageMaker training jobs, see [Monitor and Analyze Training Jobs Using Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/training-metrics.html).

## Deploying Trained Model for Inference

In most cases, a trained model can be deployed on a single device for inference because inference only requires a small amount of memory.

After you build and train your models, you can deploy them to get predictions in one of two ways:

* To set up a persistent endpoint to get predictions from your models, use SageMaker hosting services. For an overview on deploying a single model or multiple models with SageMaker hosting services, see [Deploy a Model on SageMaker Hosting Services](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-deployment.html#how-it-works-hosting).
* To get predictions for an entire dataset, use SageMaker batch transform. For an overview on deploying a model with SageMaker Batch Transform, see [Get Inferences for an Entire Dataset with Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html).

To learn more about deploying models for inference using SageMaker, see [Deploy Models for Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html). 


## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.


![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/training|distributed_training|pytorch|model_parallel_v2|llama_v2|smp-finetuning-llama-fsdp-tp.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/training|distributed_training|pytorch|model_parallel_v2|llama_v2|smp-finetuning-llama-fsdp-tp.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/training|distributed_training|pytorch|model_parallel_v2|llama_v2|smp-finetuning-llama-fsdp-tp.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/training|distributed_training|pytorch|model_parallel_v2|llama_v2|smp-finetuning-llama-fsdp-tp.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/training|distributed_training|pytorch|model_parallel_v2|llama_v2|smp-finetuning-llama-fsdp-tp.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/training|distributed_training|pytorch|model_parallel_v2|llama_v2|smp-finetuning-llama-fsdp-tp.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/training|distributed_training|pytorch|model_parallel_v2|llama_v2|smp-finetuning-llama-fsdp-tp.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/training|distributed_training|pytorch|model_parallel_v2|llama_v2|smp-finetuning-llama-fsdp-tp.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/training|distributed_training|pytorch|model_parallel_v2|llama_v2|smp-finetuning-llama-fsdp-tp.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/training|distributed_training|pytorch|model_parallel_v2|llama_v2|smp-finetuning-llama-fsdp-tp.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/training|distributed_training|pytorch|model_parallel_v2|llama_v2|smp-finetuning-llama-fsdp-tp.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/training|distributed_training|pytorch|model_parallel_v2|llama_v2|smp-finetuning-llama-fsdp-tp.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/training|distributed_training|pytorch|model_parallel_v2|llama_v2|smp-finetuning-llama-fsdp-tp.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/training|distributed_training|pytorch|model_parallel_v2|llama_v2|smp-finetuning-llama-fsdp-tp.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/training|distributed_training|pytorch|model_parallel_v2|llama_v2|smp-finetuning-llama-fsdp-tp.ipynb)
