# Bring your own code to AWS

This notebook is designed to help you streamline your data science projects on AWS. It shows how to train a scikit-learn model in Amazon SageMaker by reusing code or containers built outside Amazon SageMaker (on a local installation, for example). This approach enables data science teams to reuse existing code, speed up their development and deployment process and benefit from all the advantages of the cloud.

In what follows, we'll address the well-known problem of classifying the Iris dataset by taking advantage of several levels of artifact reuse: 
- **Bring your own script:** In this scenario, your data science team has already built a python script using scikit-learn to train a model but wants to take advantage of the elasticity of the cloud and the SageMaker ecosystem to manage the training. 

- **Bring your own Container:** In this scenario, your team has built its own training code but also its own docker container that has, for example, specific libraries installed to run the model, and wants the model to be trained on AWS. 

- **Remote jobs:** In this scenario, your team has built its own script with specific dependencies and wants to run it on AWS, but wants to continue developing on its local development set-up while leveraging the SageMaker ecosystem. 
  

Whether you're new to SageMaker or a seasoned practitioner, this notebook is a valuable resource for understanding the power and simplicity of reusing training scripts with Amazon SageMaker.
Let's embark together on this journey into data science.

## Download data 
Before diving into SageMaker's functionality, we need access to the data that will be used in this session. The following procedure downloads the [Iris Data Set](https://archive.ics.uci.edu/ml/datasets/iris), and displays its first lines using Python's renowned data handling library, Pandas.

In [None]:
import boto3
import pandas as pd
import numpy as np

s3 = boto3.client("s3")
s3.download_file(f"sagemaker-sample-files", "datasets/tabular/iris/iris.data", "iris.data")

df = pd.read_csv(
"iris.data", header=None, names=["sepal_len", "sepal_wid", "petal_len", "petal_wid", "class"]
)
df.head()

Unnamed: 0,sepal_len,sepal_wid,petal_len,petal_wid,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


## Data Preparation
Here, we show how to pre-process the Iris dataset to prepare it for model training. This involves data cleaning, feature scaling and splitting the data into two sets, one for training and the other for testing.

Let's start by encoding the target column (the Iris category) of the model we're going to train as an integer to meet the requirements of the scikit-learn classifier.

In [37]:
# Convert the three classes from strings to integers in {0,1,2}
df["class_cat"] = df["class"].astype("category").cat.codes
categories_map = dict(enumerate(df["class"].astype("category").cat.categories))
print(categories_map)
df.head()

{0: 'Iris-setosa', 1: 'Iris-versicolor', 2: 'Iris-virginica'}


Unnamed: 0,sepal_len,sepal_wid,petal_len,petal_wid,class,class_cat
0,5.1,3.5,1.4,0.2,Iris-setosa,0
1,4.9,3.0,1.4,0.2,Iris-setosa,0
2,4.7,3.2,1.3,0.2,Iris-setosa,0
3,4.6,3.1,1.5,0.2,Iris-setosa,0
4,5.0,3.6,1.4,0.2,Iris-setosa,0


We then split the data into a training dataset (80% of the data) and a test dataset (the remaining 20%) before saving them in CSV files.

In [38]:
# Split the data into 80-20 train-test split
num_samples = df.shape[0]
split = round(num_samples * 0.8)
train = df.iloc[:split, :]
test = df.iloc[split:, :]
print("{} train, {} test".format(split, num_samples - split))
# Write train and test CSV files
train.to_csv("train.csv", index=False)
test.to_csv("test.csv", index=False)

120 train, 30 test


These files are then uploaded to Amazon S3, where the SageMaker python SDK can access them and use them to train the model.

In [None]:
# Create a sagemaker session to upload data to S3
import sagemaker
sagemaker_session = sagemaker.Session()
# Upload data to default S3 bucket
prefix = "DEMO-sklearn-iris"
training_input_path = sagemaker_session.upload_data("train.csv", key_prefix=prefix + "/training")

## Bring your own script
Now, we have made the data available in S3 and in the correct format to be used by scikit-learn classifier, let's jump into the first training solution we want to explore when talking about reusing existing training script. 

The data scientists might want to leverage the use of containers provided by Amazon SageMaker as they hold the packages needed to run the training script.
In this case, they can use SageMaker Python SDK to define the estimator relative to the container to use.
An estimator is a SageMaker Python SDK object for managing the configuration and execution of your SageMaker Training job which allows to run training workloads on ephemeral compute instances and obtain a zipped trained model.

In this part, you'll use the existing training script as the entry point of a SageMaker Estimator that will execute the training job on an ephemeral instance, SageMaker dealing with all the dependencies in this instance by using pre-defined containers. 

First, let's get the execution role for training. This role allows us to access the S3 bucket in the last step, where the train and test datasets is located.

In [39]:
# Use the current execution role for training. It needs access to S3
role = sagemaker.get_execution_role()
print(role)

arn:aws:iam::933067310703:role/service-role/AmazonSageMaker-ExecutionRole-20220327T140435


Then, it is time to define the SageMaker Estimator. We use an Estimator class specifically designed to train scikit-learn models called `SKLearn`. In this estimator, we define the following parameters:
1. The script that we want to use to train the model (i.e. `entry_point`). This is the heart of the Script Mode method. 
2. The role which allows us access to the S3 bucket containing the train and test data set (i.e. `role`)
3. How many instances we want to use in training (i.e. `instance_count`) and what type of instance we want to use in training (i.e. `instance_type`)
4. Which version of scikit-learn to use (i.e. `framework_version`)
5. Training hyperparameters (i.e. `hyperparameters`)

After setting these parameters, the `fit` function is invoked to train the model.

In [51]:
from sagemaker.sklearn import SKLearn

sk_estimator = SKLearn(
    entry_point="train.py",
    role=role,
    instance_count=1,
    instance_type="ml.c5.xlarge",
    py_version="py3",
    framework_version="1.0-1",
    hyperparameters={"estimators": 20},
)

In [52]:
# Train the estimator
sk_estimator.fit({"train": training_input_path})

INFO:sagemaker:Creating training-job with name: sagemaker-scikit-learn-2023-07-26-15-33-50-156


Using provided s3_resource
2023-07-26 15:33:50 Starting - Starting the training job...
2023-07-26 15:34:05 Starting - Preparing the instances for training...
2023-07-26 15:34:50 Downloading - Downloading input data......
2023-07-26 15:35:41 Training - Training image download completed. Training in progress.
2023-07-26 15:35:41 Uploading - Uploading generated training model[34m2023-07-26 15:35:33,717 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2023-07-26 15:35:33,720 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-07-26 15:35:33,727 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2023-07-26 15:35:33,903 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-07-26 15:35:33,913 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-07-26 15:35:33,925 sagemaker-training-

And that's it, you have your first Random Forest model trained. Now it's on the SageMaker ecosystem, you can easily evaluate, deploy and expose it to end-user, register it to a Model Registry, and many other capabilities that relates to machine learning models lifecycle.
This solution provides simplicity, you just need to provide data and your existing training script and SageMaker takes care of the infrastructure part. 
Next, we'll see what SageMaker has to offer if we need more control on the underlying infrastructure training the model. 

## Bring your own container

Data scientists can bring their own specific dockerfile to run the training script.
There are two ways of doing this, and we'll look first at the one that asks data scientists to build and push their docker image before training the model on SageMaker. 
This approach may correspond to scenarios where they already have their own container and process for building their image and want it pushed to AWS for use in a SageMaker training job.

To do this, data scientists need to take advantage of Amazon ECR by pushing the docker image to a private ECR repository.

### Container Build and push
In our example, we prepare a dockerfile at the same level of hierarchy as the train.py script in our Python Project.
We then build the image locally using this dockerfile and then push it to ECR following this script. 

The approach is different following if you're working on SageMaker Studio or not. 

#### If you run this Notebook on SageMaker studio : 
Please install the sagemaker-studio-image-build librairy by running the following cell. It will help to build and push your docker image to ECR. 

In [53]:
## In sagemaker studio
!pip install sagemaker-studio-image-build

Collecting PyYAML==6.0
  Using cached PyYAML-6.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (682 kB)
Installing collected packages: PyYAML
  Attempting uninstall: PyYAML
    Found existing installation: PyYAML 5.4.1
    Uninstalling PyYAML-5.4.1:
      Successfully uninstalled PyYAML-5.4.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
serverlessrepo 0.1.10 requires pyyaml~=5.1, but you have pyyaml 6.0 which is incompatible.
awscli 1.27.153 requires PyYAML<5.5,>=3.10, but you have pyyaml 6.0 which is incompatible.
aws-sam-cli 1.89.0 requires PyYAML==5.*,>=5.4.1, but you have pyyaml 6.0 which is incompatible.[0m[31m
[0mSuccessfully installed PyYAML-6.0
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m

In [54]:
!sh build_and_push_studio.sh

build_and_push_studio.sh: 1: %%sh: not found
build_and_push_studio.sh: 23: docker: not found
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe
................[Container] 2023/07/26 16:04:25 Waiting for agent ping

[Container] 2023/07/26 16:04:26 Waiting for DOWNLOAD_SOURCE
[Container] 2023/07/26 16:04:31 Phase is DOWNLOAD_SOURCE
[Container] 2023/07/26 16:04:31 CODEBUILD_SRC_DIR=/codebuild/output/src234481480/src
[Container] 2023/07/26 16:04:31 YAML location is /codebuild/output/src234481480/src/buildspec.yml
[Container] 2023/07/26 16:04:31 Setting HTTP client timeout to higher timeout for S3 source
[Container] 2023/07/26 16:04:31 Processing environment variables
[Container] 2023/07/26 16:04:31 No runtime version selected in buildspec.
[Container] 2023/07/26 16:04:31 Moving to directory /codebuild/output/src234481480/src
[Container] 2023/07/26 16:04:31 Configuring ssm agent with target id: codebuild:c03cff44-99f6

#### If you run this Notebook outside of SageMaker studio : 


In [None]:
# Outside of sagemaker studio
!sh build_and_push.sh

### Model training : 
Finally, we setup an Amazon SageMaker estimator with as parameter the ECR image URI and launch the model training:

In [55]:
import boto3

account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'scikit-learn-custom'
tag = ':latest'

region = boto3.session.Session().region_name

uri_suffix = 'amazonaws.com'
if region in ['cn-north-1', 'cn-northwest-1']:
    uri_suffix = 'amazonaws.com.cn'

byoc_image_uri = '{}.dkr.ecr.{}.{}/{}'.format(account_id, region, uri_suffix, ecr_repository + tag)

In [56]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator

estimator = Estimator(image_uri=byoc_image_uri,
                      role=get_execution_role(),
                      base_job_name='scikit-custom-container-test-job',
                      instance_count=1,
                      instance_type='ml.c5.xlarge')

# Train the estimator
sk_estimator.fit({"train": training_input_path})

INFO:sagemaker:Creating training-job with name: sagemaker-scikit-learn-2023-07-26-16-06-37-079


Using provided s3_resource
2023-07-26 16:06:37 Starting - Starting the training job...
2023-07-26 16:06:52 Starting - Preparing the instances for training......
2023-07-26 16:07:40 Downloading - Downloading input data...
2023-07-26 16:08:31 Training - Training image download completed. Training in progress..[34m2023-07-26 16:08:32,064 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2023-07-26 16:08:32,067 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-07-26 16:08:32,074 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2023-07-26 16:08:32,251 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-07-26 16:08:32,261 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-07-26 16:08:32,273 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0

And now we have our model trained with our own container provided and ready to run in the SageMaker ecosystem. 
In particular, this allows us to use specific dependencies that could not be provided by SageMaker's pre-built containers.

We'll see in the next section that there is a more direct way of addressing the need for specific dependencies in the training job run in SageMaker.

## Amazon SageMaker remote job

The data scientist must provide information on execution environment (Python packages to install or conda environment configuration or ECR image to use), compute instance configuration and if needed Networking and Permission configurations. 

Compared to the previous way presented to bring your own container, the remote decorator approach saves you the overhead of building and pushing the docker image to ECR by doing it in the background.

In the following cells, we defined the training script and add the @remote decorator with instance type and location of the dependency file that we want the container provided by SageMaker built with. 

In [58]:
import json
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
import joblib
from sagemaker.remote_function import remote

@remote(instance_type="ml.m5.xlarge", dependencies='./environment.yml')
def perform_training(training_dir, estimators):
    # Read in data
    df = pd.read_csv(training_dir + "/train.csv", sep=",")

    # Preprocess data
    X = df.drop(["class", "class_cat"], axis=1)
    y = df["class_cat"]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    sc = StandardScaler()
    X_train = sc.fit_transform(X_train)
    X_test = sc.transform(X_test)

    # Build model
    regressor = RandomForestRegressor(n_estimators=estimators)
    regressor.fit(X_train, y_train)
    y_pred = regressor.predict(X_test)

    # Save model
    joblib.dump(regressor, "model.joblib")

In [59]:
input_data_path = "s3://sagemaker-eu-west-1-933067310703/DEMO-sklearn-iris/training"
perform_training(input_data_path, 20)

2023-07-26 16:09:37,304 sagemaker.remote_function INFO     Copied dependencies file at './environment.yml' to '/tmp/tmp53k79kut/temp_workspace/sagemaker_remote_function_workspace/environment.yml'
2023-07-26 16:09:37,306 sagemaker.remote_function INFO     Successfully created workdir archive at '/tmp/tmp53k79kut/workspace.zip'
2023-07-26 16:09:37,351 sagemaker.remote_function INFO     Successfully uploaded workdir to 's3://sagemaker-eu-west-1-933067310703/perform-training-2023-07-26-16-09-37-157/sm_rf_user_ws/workspace.zip'
2023-07-26 16:09:37,352 sagemaker.remote_function INFO     Serializing function code to s3://sagemaker-eu-west-1-933067310703/perform-training-2023-07-26-16-09-37-157/function
2023-07-26 16:09:37,444 sagemaker.remote_function INFO     Serializing function arguments to s3://sagemaker-eu-west-1-933067310703/perform-training-2023-07-26-16-09-37-157/arguments
2023-07-26 16:09:37,505 sagemaker.remote_function INFO     Creating job: perform-training-2023-07-26-16-09-37-157

2023-07-26 16:09:37 Starting - Starting the training job...
2023-07-26 16:09:52 Starting - Preparing the instances for training......
2023-07-26 16:11:06 Downloading - Downloading input data
2023-07-26 16:11:06 Training - Training image download completed. Training in progress...[34mINFO: CONDA_PKGS_DIRS is set to '/opt/ml/sagemaker/warmpoolcache/sm_remotefunction_user_dependencies_cache/conda/pkgs'[0m
[34mINFO: PIP_CACHE_DIR is set to '/opt/ml/sagemaker/warmpoolcache/sm_remotefunction_user_dependencies_cache/pip'[0m
[34mINFO: Bootstraping runtime environment.[0m
[34m2023-07-26 16:11:15,567 sagemaker.remote_function INFO     Successfully unpacked workspace archive at '/'.[0m
[34m2023-07-26 16:11:15,567 sagemaker.remote_function INFO     '/sagemaker_remote_function_workspace/pre_exec.sh' does not exist. Assuming no pre-execution commands to run[0m
[34m/opt/conda/bin/mamba[0m
[34m2023-07-26 16:11:15,568 sagemaker.remote_function INFO     Creating conda environment sagemaker-

And we got the last model trained, possibly directly from your local development environment, specifying only the dependencies on which your model training is based. The model has benefited from dedicated compute instances and is now ready for use in the SageMaker ecosystem.