# Lab: Bring your own custom container with Amazon SageMaker

<div class="alert alert-block alert-info">
⚠️ In order to run this notebook, please ensure Docker is enabled on your SageMaker Studio Domain. If running this notebook at an AWS facilitated event, <b>you can skip this part</b>. If you have provisioned your own SageMaker Studio Domain, please  <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-local.html#studio-updated-local-enable">read here</a> about how to enable Docker via the AWS CLI on an existing SageMaker Studio Domain. After running this command, <b>you must restart your JupyterApp for the changes to take effect</b>.
</div>

## Overview

### Background
Here, we'll show how to bring your docker cotainer that packages your environment and code. We showcase the [decision tree](http://scikit-learn.org/stable/modules/tree.html) algorithm from the widely used [scikit-learn](http://scikit-learn.org/stable/) machine learning package. The example is purposefully fairly trivial since the point is to show the surrounding structure that you'll want to add to your own container so you can bring it to Amazon SageMaker for training and hosting.


### High-level overview

The following diagram shows how you typically train and deploy a model with Amazon SageMaker:

<div>
<img src="https://docs.aws.amazon.com/sagemaker/latest/dg/images/sagemaker-architecture.png" width="900"/>
</div>

The area labeled SageMaker highlights the two components of SageMaker: model training and model deployment. The area labeled [EC2 container registry](https://aws.amazon.com/ecr/) is where we store, manage, and deploy our Docker container images. The training data and model artifacts are stored in S3 bucket. 

In this lab, we use a single image to support both model training and hosting for simplicity. Sometimes you’ll want separate images for training and hosting because they have different requirements. 

The high-level steps include:
1. **Building the container** - We walk through the different components of the containers and inspect the docker file. Then we build and push the container to ECR. 
2. **Setup & Upload Data** - Once our container is built and registered. We ready sagemaker and upload the data to S3. 
3. **Model Training** - Create a training job using SageMaker Python SDK. It will pull data from S3 and use the container we built.  
4. **Model Deployment** - Once training is complete, deploy our model to a HTTP endpoint using SageMaker Python SDK. 
5. **Run Inferences** - Run predictions to test our model.
6. **Cleanup**



## Building the container
[Docker](https://aws.amazon.com/docker/#:~:text=Docker%20is%20a%20software%20platform,test%2C%20and%20deploy%20applications%20quickly.&text=Running%20Docker%20on%20AWS%20provides,distributed%20applications%20at%20any%20scale.) packages software into standardized units called [containers](https://aws.amazon.com/containers/) that have everything the software needs to run including libraries, system tools, code, and runtime. Using Docker, you can quickly deploy and scale applications into any environment and know your code will run.


Amazon SageMaker uses Docker to allow users to train and deploy arbitrary algorithms. More details on [how to use docker containers with sagemaker](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers.html).

### Walkthrough of the container directory
You can find the source code of the sample container we are using in [this GitHub repository](https://github.com/aws/amazon-sagemaker-examples/tree/main/advanced_functionality/scikit_bring_your_own). 

The container directory contains all the components you need to package for SageMaker:

```
.
|-- Dockerfile
|-- build_and_push.sh
|-- local_test
`-- decision_trees
    |-- nginx.conf
    |-- predictor.py
    |-- serve
    |-- train
    `-- wsgi.py
```

Let’s discuss each of these in turn:

- `Dockerfile` describes how to build your Docker container image. More details below.
- `build_and_push.sh` is a script that uses the Dockerfile to build your container images and then pushes it to ECR. We’ll invoke the commands directly later in this notebook, but you can just copy and run the script for your own algorithms.
- `local_test` is a directory that shows how to test your new container on any computer that can run Docker, including an Amazon SageMaker notebook instance. Using this method, you can quickly iterate using small datasets to eliminate any structural bugs before you use the container with Amazon SageMaker. Testing is not the focus of this lab, but feel free to checkout the example at your own time.  
- `decision_trees` is the directory which contains the files that will be installed in the container.

In this simple application, we only install five files in the container. These five show the standard structure of our Python containers, although you are free to choose a different toolset or programming language and therefore could have a different layout.

The files that we’ll put in the container are:

- `nginx.conf` is the configuration file for the nginx front-end. Generally, you should be able to take this file as-is.
- `predictor.py` is the program that actually implements the Flask web server and the decision tree predictions for this app. You’ll want to customize the actual prediction parts to your application. Since this algorithm is simple, we do all the processing here in this file, but you may choose to have separate files for implementing your custom logic.
- `serve` is the program started when the container is started for hosting. It simply launches the gunicorn server which runs multiple instances of the Flask app defined in predictor.py. You should be able to take this file as-is.
- `train` is the program that is invoked when the container is run for training. You will modify this program to implement your training algorithm.
- `wsgi.py` is a small wrapper used to invoke the Flask app. You should be able to take this file as-is.

In summary, the two files you will probably want to change for your application are `train` and `predictor.py`

### Install packages
Please choose `Python 3 (ipykernel)` kernel to proceed.

We will first install the prerequisite packages.

In [None]:
# cell 00

!pip install --root-user-action=ignore --upgrade pip
!pip install --root-user-action=ignore -q pandas==2.1.4
!pip install --root-user-action=ignore -q awswrangler==3.5.1 --no-cache

### Install Docker

To use Docker, you must manually install it from the terminal of your JupyterLab application. Please get familiar with the docker operations that are currently supported in Studio [see here](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-local.html).


In [None]:
%%bash

# see https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository
sudo apt-get update
sudo apt-get install -y ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update

## Currently only Docker version 20.10.X is supported in Studio: see https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-local.html
# pick the latest patch from:
# apt-cache madison docker-ce | awk '{ print $3 }' | grep -i 20.10
VERSION_STRING=5:20.10.24~3-0~ubuntu-jammy
sudo apt-get install docker-ce-cli=$VERSION_STRING docker-compose-plugin -y

# validate the Docker Client is able to access Docker Server at [unix:///docker/proxy.sock]
docker version

We will then unzip and copy over the files we need:
- `scikit_bring_your_own/container` → `lab03_container`
- `scikit_bring_your_own/data` → `lab03_data` 

In [None]:
# cell 01

!unzip -q scikit_bring_your_own.zip
!mv scikit_bring_your_own/data/ ./lab03_data/
!mv scikit_bring_your_own/container/ ./lab03_container/
!rm -rf scikit_bring_your_own

### The Dockerfile
The `Dockerfile` describes the image that we want to build. You can think of it as describing the complete operating system installation of the system that you want to run. A Docker container running is quite a bit lighter than a full operating system, however, because it takes advantage of Linux on the host machine for the basic operations.

For the Python science stack, we will start from a standard Ubuntu installation and run the normal tools to install the things needed by `scikit-learn`. Finally, we add the code that implements our specific algorithm to the container and set up the right environment to run under.

Let's take a look of what's inside our `Dockerfile`:

In [None]:
!pygmentize lab03_container/Dockerfile

### Building and registering the container

In [None]:
%%sh
# Login to ECR
aws --region ${AWS_DEFAULT_REGION} ecr get-login-password | docker login --username AWS --password-stdin ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_DEFAULT_REGION}.amazonaws.com/sagemaker-decision-trees

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "sagemaker-decision-trees" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "sagemaker-decision-trees" > /dev/null
fi

cd lab03_container

chmod +x decision_trees/train
chmod +x decision_trees/serve

# Build the image - it might take a few minutes to complete this step
docker build --network sagemaker . -t ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_DEFAULT_REGION}.amazonaws.com/sagemaker-decision-trees:latest
# Push the image to ECR
docker push ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_DEFAULT_REGION}.amazonaws.com/sagemaker-decision-trees:latest

## Setup & Upload Data

### Setup the Environment 
Here we specify a bucket to use and the role that will be used for working with SageMaker.



In [None]:
# cell 03

S3_prefix = "DEMO-scikit-byo-iris"

# Define IAM role
import boto3
import re

import os
import numpy as np
import pandas as pd
from sagemaker import get_execution_role

role = get_execution_role()

The session remembers our connection parameters to SageMaker. We’ll use it to perform all of our SageMaker operations.

In [None]:
# cell 04

import sagemaker as sage
from time import gmtime, strftime

sess = sage.Session()

### Upload data to S3 Bucket

When training large models with huge amounts of data, you’ll typically use big data tools, like Amazon Athena, AWS Glue, or Amazon EMR, to create your data in S3. For the purposes of this example, we’re using some the [classic Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) in the `lab03_data` directory. 

We can use use the tools provided by the [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/) to upload the data to a default bucket.

In [None]:
# cell 05

WORK_DIRECTORY = "lab03_data"

data_location = sess.upload_data(WORK_DIRECTORY, key_prefix=S3_prefix)

## Model Training

In order to use SageMaker to fit our algorithm, we create an [`estimator`](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) that defines how to use the container to train. This includes the configuration we need to invoke SageMaker training:

- `image_uri (str)` - The [Amazon Elastic Container Registry](https://aws.amazon.com/ecr/) path where the docker image is registered. This is constructed in the shell commands in *cell 06*.
- `role (str)` - SageMaker IAM role as obtained above in *cell 03*.
- `instance_count (int)` - number of machines to use for training.
- `instance_type (str)` - the type of machine to use for training.
- `output_path (str)` - where the model artifact will be written.
- `sagemaker_session (sagemaker.session.Session)` - the SageMaker session object that we defined in *cell 04*.



Then we use `estimator.fit()` method to train against the data that we uploaded.
The API calls the Amazon SageMaker `CreateTrainingJob` API to start model training. The API uses configuration you provided to create the `estimator` and the specified input training data to send the `CreatingTrainingJob` request to Amazon SageMaker.

In [None]:
# cell 06

account = sess.boto_session.client("sts").get_caller_identity()["Account"]
region = sess.boto_session.region_name
image_uri = "{}.dkr.ecr.{}.amazonaws.com/sagemaker-decision-trees:latest".format(account, region)

tree = sage.estimator.Estimator(
    image_uri,
    role,
    instance_count=1,
    instance_type="ml.c4.2xlarge",
    output_path="s3://{}/output".format(sess.default_bucket()),
    sagemaker_session=sess,
)

file_location = data_location + "/iris.csv"
tree.fit(file_location)

## Model Deployment
You can use a trained model to get real time predictions using HTTP endpoint. Follow these steps to walk you through the process.

After the model training successfully completes, you can call the [`estimator.deploy()` method](https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.Estimator.deploy). The `deploy()` method creates a deployable model, configures the SageMaker hosting services endpoint, and launches the endpoint to host the model. 

The method uses the following configurations:
- `initial_instance_count (int)` – The number of instances to deploy the model.
- `instance_type (str)` – The type of instances that you want to operate your deployed model.
- `serializer (int)` – Serialize input data of various formats (a NumPy array, list, file, or buffer) to a CSV-formatted string in this example. 


In [None]:
# cell 07

from sagemaker.serializers import CSVSerializer

predictor = tree.deploy(
    initial_instance_count=1, instance_type="ml.m4.xlarge", serializer=CSVSerializer()
)

## Run Inferences


### Preparing test data
In order to do some predictions, we’ll extract some of the data we used for training and do predictions against it. This is, of course, bad statistical practice, but an easy way to see how the mechanism works.

In [None]:
print(file_location)

In [None]:
# cell 08
import awswrangler as wr

shape = wr.s3.read_csv(file_location, header=None)

# shape=pd.read_csv(file_location, header=None)
shape.sample(3)

In [None]:
# cell 09

# drop the label column in the training set
shape.drop(shape.columns[[0]], axis=1, inplace=True)
shape.sample(3)

In [None]:
# cell 10

import itertools

a = [50 * i for i in range(3)]
b = [40 + i for i in range(10)]
indices = [i + j for i, j in itertools.product(a, b)]

test_data = shape.iloc[indices[:-1]]

### Predictions

Prediction is as easy as calling `predict` with the `predictor` we got back from `deploy` and the data we want to do predictions with. The serializers take care of doing the data conversions for us.

In [None]:
# cell 11

print(predictor.predict(test_data.values).decode("utf-8"))

## Cleanup
After completing the lab, use these steps to [delete the endpoint through AWS Console](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html) or simply run the following code


In [None]:
# cell 12
sess.delete_endpoint(predictor.endpoint_name)

Remove the container artifacts and data we downloaded.

In [None]:
# cell 13
!rm -rf lab03_container lab03_data