## Step 1: Building a custom container for training with torchtitan

[SageMaker BYOC( Bring Your Own Container)](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers.html) allows you to use custom Docker containers to train and deploy machine learning models.Typically, Amazon SageMaker provides built-in algorithms and pre-configured environments for popular machine learning frameworks. However, there may be cases where we have unique or proprietary algorithms, dependencies, or specific requirements that are not available in the built-in options necessitating custom containers.In this case since we need to use the nightly versions of torch, and the torchao package to train with FP8.

This notebook demonstrates how to build and use a simple custom Docker container for training with Amazon SageMaker that leverages on the sagemaker-training-toolkit library to define framework containers.

Reference guide is available on this blog post https://aws.amazon.com/blogs/machine-learning/using-the-amazon-sagemaker-studio-image-build-cli-to-build-container-images-from-your-studio-notebooks/

We start by defining the execution role, region, and the default Amazon S3 bucket to be used by Amazon SageMaker.

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role

role = get_execution_role()
account_id = role.split(":")[4]
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
bucket = sagemaker_session.default_bucket()

print(account_id)
print(region)
print(role)
print(bucket)

## Using the Amazon SageMaker Studio Image Build CLI to build the torchtitan image

We will use the Amazon SageMaker Studio Image Build convenience package that offers a CLI to simplify the process of building custom container images directly from SageMaker Studio notebooks. This tool eliminates the need for manual setup of Docker build environments, streamlining the workflow for data scientists and developers. The CLI automatically manages the underlying AWS services (S3, CodeBuild, and ECR) required for image building, allowing users to focus on their machine learning tasks rather than infrastructure setup. It offers a simple command interface, handles packaging of Dockerfiles and container code, and provides the resulting image URI for use in SageMaker training and hosting. 

We need to install the package with the command below:

In [None]:
!pip install sagemaker-studio-image-build

## Updating Execution Role with the required IAM permissions and policies to use the  Image Build CLI

Next. ensure the IAM Execution Role that we are using in this Notebook has the trust policy below with CodeBuild. You will need to update the role permissions and trust policy from the IAM console.

You also need to make sure the appropriate permissions are included in your role to run the build in CodeBuild, create a repository in Amazon ECR, and push images to that repository. The following code is an example policy that you should modify as necessary to meet your needs and security requirements:

### Creating the Dockerfile

In the following steps, we create the Dockerfile with the required packages

We first extend the a pre-built image, you can leverage the included deep learning libraries and settings without having to create an image from scratch with the line below:

    FROM 763104351884.dkr.ecr.${REGION}.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker

We then specify the libraries to install. We will need the nightly versions of torch, torchdata and the torchao libraries. Update the region variable accordingly to match your region


In [None]:
%%writefile Dockerfile
# Set the default value for the REGION build argument
ARG REGION=us-west-2

# SageMaker PyTorch image for TRAINING
FROM 763104351884.dkr.ecr.${REGION}.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker

RUN pip install --pre torch --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu121

RUN pip install --pre torchdata --index-url https://download.pytorch.org/whl/nightly

#install torchtitan dependencies
RUN pip install --no-cache-dir \
    datasets>=2.19.0 \
    tomli>=1.1.0 \
    tensorboard \
    sentencepiece \
    tiktoken \
    blobfile \
    tabulate

#install torchao - fp8 package
RUN pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu121 # full options are cpu/cu118/cu121/cu124

# Display installed packages for reference
RUN pip freeze

## Building the Image and pushing to ECR

We are now ready to use the Image Build CLI to build and push our image to ECR:

In [None]:
!sm-docker build --repository torchtitan:latest .

Please note the path to the image URI on ECR as we will use this in the next step in our estimator function to launch the training jobs.

(Optional) We then remove unused data/ artifacts from the Docker system in order to reclaim space on our  Notebook instance

In [None]:
!yes | docker system prune -a #free up space