# Training a CatBoost regression model with data from DVC

With Amazon SageMaker, you can package your own algorithms that can than be trained and deployed in the SageMaker environment. This notebook will guide you through an example that shows you how to build a Docker container for SageMaker and use it for training and inference.

By packaging an algorithm in a container, you can bring almost any code to the Amazon SageMaker environment, regardless of programming language, environment, framework, or dependencies. 

### California Housing dataset
We use the California Housing dataset, present in [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html). 

The California Housing dataset was originally published in:

Pace, R. Kelley, and Ronald Barry. "Sparse spatial auto-regressions." Statistics & Probability Letters 33.3 (1997): 291-297.


### DVC
DVC is built to make ML models shareable and reproducible. It is designed to handle large files, data sets, machine learning models, and metrics as well as code.

[DVC Official Site](https://dvc.org/)

# Part 1: Packaging and Uploading your Algorithm for use with Amazon SageMaker

### An overview of Docker

If you're familiar with Docker already, you can skip ahead to the next section.

For many data scientists, Docker containers are a new concept, but they are not difficult, as you'll see here. 

Docker provides a simple way to package arbitrary code into an _image_ that is totally self-contained. Once you have an image, you can use Docker to run a _container_ based on that image. Running a container is just like running a program on the machine except that the container creates a fully self-contained environment for the program to run. Containers are isolated from each other and from the host environment, so the way you set up your program is the way it runs, no matter where you run it.

Docker is more powerful than environment managers like conda or virtualenv because (a) it is completely language independent and (b) it comprises your whole operating environment, including startup commands, environment variable, etc.

In some ways, a Docker container is like a virtual machine, but it is much lighter weight. For example, a program running in a container can start in less than a second and many containers can run on the same physical machine or virtual machine instance.

Docker uses a simple file called a `Dockerfile` to specify how the image is assembled. We'll see an example of that below. You can build your Docker images based on Docker images built by yourself or others, which can simplify things quite a bit.

Docker has become very popular in the programming and devops communities for its flexibility and well-defined specification of the code to be run. It is the underpinning of many services built in the past few years, such as [Amazon ECS].

Amazon SageMaker uses Docker to allow users to train and deploy arbitrary algorithms.

In Amazon SageMaker, Docker containers are invoked in a certain way for training and a slightly different way for hosting. The following sections outline how to build containers for the SageMaker environment.

Some helpful links:

* [Docker home page](http://www.docker.com)
* [Getting started with Docker](https://docs.docker.com/get-started/)
* [Dockerfile reference](https://docs.docker.com/engine/reference/builder/)
* [`docker run` reference](https://docs.docker.com/engine/reference/run/)

[Amazon ECS]: https://aws.amazon.com/ecs/

### How Amazon SageMaker runs your Docker container

Because you can run the same image in training or hosting, Amazon SageMaker runs your container with the argument `train` or `serve`. How your container processes this argument depends on the container:

* In the example here, we don't define an `ENTRYPOINT` in the Dockerfile so Docker will run the command `train` at training time and `serve` at serving time. In this example, we define these as executable Python scripts, but they could be any program that we want to start in that environment.
* If you specify a program as an `ENTRYPOINT` in the Dockerfile, that program will be run at startup and its first argument will be `train` or `serve`. The program can then look at that argument and decide what to do.
* If you are building separate containers for training and hosting (or building only for one or the other), you can define a program as an `ENTRYPOINT` in the Dockerfile and ignore (or verify) the first argument passed in. 

#### Running your container during training

When Amazon SageMaker runs training, your `train` script is run just like a regular Python program. A number of files are laid out for your use, under the `/opt/ml` directory:

    /opt/ml
    |-- input
    |   |-- config
    |   |   |-- hyperparameters.json
    |   |   `-- resourceConfig.json
    |   `-- data
    |       `-- <channel_name>
    |           `-- <input data>
    |-- model
    |   `-- <model files>
    `-- output
        `-- failure

##### The input

* `/opt/ml/input/config` contains information to control how your program runs. `hyperparameters.json` is a JSON-formatted dictionary of hyperparameter names to values. These values will always be strings, so you may need to convert them. `resourceConfig.json` is a JSON-formatted file that describes the network layout used for distributed training. Since scikit-learn doesn't support distributed training, we'll ignore it here.
* `/opt/ml/input/data/<channel_name>/` (for File mode) contains the input data for that channel. The channels are created based on the call to CreateTrainingJob but it's generally important that channels match what the algorithm expects. The files for each channel will be copied from S3 to this directory, preserving the tree structure indicated by the S3 key structure. 
* `/opt/ml/input/data/<channel_name>_<epoch_number>` (for Pipe mode) is the pipe for a given epoch. Epochs start at zero and go up by one each time you read them. There is no limit to the number of epochs that you can run, but you must close each pipe before reading the next epoch.

##### The output

* `/opt/ml/model/` is the directory where you write the model that your algorithm generates. Your model can be in any format that you want. It can be a single file or a whole directory tree. SageMaker will package any files in this directory into a compressed tar archive file. This file will be available at the S3 location returned in the `DescribeTrainingJob` result.
* `/opt/ml/output` is a directory where the algorithm can write a file `failure` that describes why the job failed. The contents of this file will be returned in the `FailureReason` field of the `DescribeTrainingJob` result. For jobs that succeed, there is no reason to write this file as it will be ignored.

#### Running your container during hosting

Hosting has a very different model than training because hosting is responding to inference requests that come in via HTTP. In this example, we use our recommended Python serving stack to provide robust and scalable serving of inference requests:

![Request serving stack](stack.png)

This stack is implemented in the sample code here and you can mostly just leave it alone. 

Amazon SageMaker uses two URLs in the container:

* `/ping` will receive `GET` requests from the infrastructure. Your program returns 200 if the container is up and accepting requests.
* `/invocations` is the endpoint that receives client inference `POST` requests. The format of the request and the response is up to the algorithm. If the client supplied `ContentType` and `Accept` headers, these will be passed in as well. 

The container will have the model files in the same place they were written during training:

    /opt/ml
    `-- model
        `-- <model files>



### The parts of the sample container

In the `container` directory are all the components you need to package the sample algorithm for Amazon SageMager:

    .
    |-- Dockerfile
    `-- catboost_regressor
        |-- nginx.conf
        |-- predictor.py
        |-- serve
        |-- train
        `-- wsgi.py

Let's discuss each of these in turn:

* __`Dockerfile`__ describes how to build your Docker container image. More details below.
* __`catboost_regressor`__ is the directory which contains the files that will be installed in the container.
* __`local_test`__ is a directory that shows how to test your new container on any computer that can run Docker, including an Amazon SageMaker notebook instance. Using this method, you can quickly iterate using small datasets to eliminate any structural bugs before you use the container with Amazon SageMaker. We'll walk through local testing later in this notebook.

In this simple application, we only install five files in the container. You may only need that many or, if you have many supporting routines, you may wish to install more. These five show the standard structure of our Python containers, although you are free to choose a different toolset and therefore could have a different layout. If you're writing in a different programming language, you'll certainly have a different layout depending on the frameworks and tools you choose.

The files that we'll put in the container are:

* __`nginx.conf`__ is the configuration file for the nginx front-end. Generally, you should be able to take this file as-is.
* __`predictor.py`__ is the program that actually implements the Flask web server and the decision tree predictions for this app. You'll want to customize the actual prediction parts to your application. Since this algorithm is simple, we do all the processing here in this file, but you may choose to have separate files for implementing your custom logic.
* __`serve`__ is the program started when the container is started for hosting. It simply launches the gunicorn server which runs multiple instances of the Flask app defined in `predictor.py`. You should be able to take this file as-is.
* __`train`__ is the program that is invoked when the container is run for training. You will modify this program to implement your training algorithm.
* __`wsgi.py`__ is a small wrapper used to invoke the Flask app. You should be able to take this file as-is.

In summary, the two files you will probably want to change for your application are `train` and `predictor.py`.

In [None]:
!cat container/Dockerfile

In the `container` directory are all the components you need to package the sample algorithm for Amazon SageMaker:

    .
    `-- container/
        |-- Dockerfile
        |-- README.md
        `--catboost_regressor/
            |-- nginx.conf
            |-- predictor.py
            |-- serve
            |-- train
            |-- wsgi.py


## Building and registering the container

TODO: explain what we are doing here and the `sm-docker` build command.

In [None]:
%%sh

# The name of our algorithm
algorithm_name=sagemaker-catboost-dvc

cd container

chmod +x catboost_regressor/train
chmod +x catboost_regressor/serve

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-1 if none defined)
region=$(aws configure get region)
region=${region:-eu-west-1}

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

sm-docker build . --repository "${algorithm_name}:latest"

## Configure DVC for data versioning

Let us create a subdirectory where we prepare the data, i.e. `sagemaker-dvc-sample`.
Within this subdirectory, we initialize a new git repository and set the remote to a repository we create in AWS CodeCommit.
Finally, the `dvc` configurations and files for data tracking will be versioned in this repository.

One of the advantage of using AWS CodeCommit is its integration with IAM for authentication purposes, meaning we can use IAM roles to push / pull data without the need to fetch credentials or ssh keys. Setting the appropriate permissions on SageMaker execution role will also allow the SageMaker training job to interact securely with the AWS CodeCommit.

In [None]:
%%sh

## Create the repository

repo_name="sagemaker-dvc-sample"

aws codecommit create-repository --repository-name ${repo_name} --repository-description "Sample repository to describe how to use dvc with sagemaker and codecommit"

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to eu-west-1 if none defined)
region=$(aws configure get region)
region=${region:-eu-west-1}

## repo_name is already in the .gitignore of the root repo

mkdir -p ${repo_name}
cd ${repo_name}

# initalize new repo in subfolder
git init
## Change the remote to the codecommit
git remote add origin https://git-codecommit."${region}".amazonaws.com/v1/repos/"${repo_name}"

# Configure git
git config --global user.email "you@example.com"
git config --global user.name "Your Name"

git config --global credential.helper '!aws codecommit credential-helper $@'
git config --global credential.UseHttpPath true

# Initialize dvc
dvc init

git commit -m 'Add dvc configuration'

# Set the DVC remote storage to S3
dvc remote add -d storage s3://sagemaker-"${region}"-"${account}"/DEMO-sagemaker-experiments-dvc
git commit .dvc/config -m "initialize DVC local remote"

# set the DVC cache to S3
dvc remote add s3cache s3://sagemaker-"${region}"-"${account}"/DEMO-sagemaker-experiments-dvc/cache
dvc config cache.s3 s3cache

# disable sending anonymized data to dvc for troubleshooting
dvc config core.analytics false

git add .dvc/config
git commit -m 'update dvc config'

git push --set-upstream origin master --force

### Prepare first dataset

In [None]:
import pandas as pd
import numpy as np

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

from pathlib import Path

databunch = fetch_california_housing()
dataset = np.concatenate((databunch["target"].reshape(-1, 1), databunch["data"]), axis=1)

print(f"Dataset shape = {dataset.shape}")

train, other = train_test_split(dataset, test_size=0.1)
validation, test = train_test_split(other, test_size=0.5)

print(f"Train shape = {train.shape}")
print(f"Validation shape = {validation.shape}")
print(f"Test shape = {test.shape}")

base_dir = './sagemaker-dvc-sample/dataset'

for path in ['train', 'validation', 'test']:
    output_dir = Path(f"{base_dir}/{path}/")
    output_dir.mkdir(parents=True, exist_ok=True)

pd.DataFrame(train).to_csv(f"{base_dir}/train/california_train.csv", header=False, index=False)
pd.DataFrame(validation).to_csv(f"{base_dir}/validation/california_validation.csv", header=False, index=False)
pd.DataFrame(test).to_csv(f"{base_dir}/test/california_test.csv", header=False, index=False)

### Version data with DVC

In [None]:
%%sh

repo_name="sagemaker-dvc-sample"
cd "${repo_name}"

git checkout -b dataset_v1.0.0

dvc add dataset/test/california_test.csv
dvc add dataset/validation/california_validation.csv
dvc add dataset/train/california_train.csv

git add .

git commit -m 'add dev_dataset_1'

dvc push
git push --set-upstream origin dataset_v1.0.0

git tag v1.0.0
git push --tags

## Using your Algorithm in Amazon SageMaker

In [None]:
import numpy as np
import pandas as pd
import boto3
import sagemaker
import time
from time import strftime

boto_session = boto3.Session()
sagemaker_session = sagemaker.Session(boto_session=boto_session)
sm_client = boto3.client("sagemaker")
region = boto_session.region_name
bucket = sagemaker_session.default_bucket()
role = sagemaker.get_execution_role()
account = sagemaker_session.boto_session.client("sts").get_caller_identity()["Account"]

prefix = 'DEMO-sagemaker-experiments-dvc'

print(f"account: {account}")
print(f"bucket: {bucket}")
print(f"region: {region}")
print(f"role: {role}")

## Setup Experiments

Now, in order to track this test in Sagemaker, we need to create an experiment. We need to also define the trial within the experiment. For the sake of simplicity, we just consider one trial for the experiment, but we can have any number of trials within an experiment, for example if you want to test different algorithms.

In [None]:
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

experiment_name = 'DEMO-sagemaker-experiments-dvc'

# create the experiment if it doesn't exist
try:
    my_experiment = Experiment.load(experiment_name=experiment_name)
    print("existing experiment loaded")
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        my_experiment = Experiment.create(
            experiment_name = experiment_name,
            description = "How to integrate DVC"
        )
        print("new experiment created")
    else:
        print(f"Unexpected {ex}=, {type(ex)}")
        print("Dont go forward!")
        raise

first_trial_name = "dvc-trial-v1"

try:
    my_first_trial = Trial.load(trial_name=first_trial_name)
    print("existing trial loaded")
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        my_first_trial = Trial.create(
            experiment_name=experiment_name,
            trial_name=first_trial_name,
        )
        print("new trial created")
    else:
        print(f"Unexpected {ex}=, {type(ex)}")
        print("Dont go forward!")
        raise

## Create an estimator and fit the model

To use DVC integration, pass a `dvc_repo_url` and `dvc_tag` as parameters when you create the Estimator object.

We will train on the `v1.0.0` tag first.

When doing `dvc pull`, this is the dataset structure:

```
dataset
    |-- train
    |   |-- california_train.csv
    |-- test
    |   |-- california_test.csv
    |-- validation
    |   |-- california_validation.csv
```

In [None]:
dvc_repo_url = "codecommit::{}://sagemaker-dvc-sample".format(region)
dvc_tag = "v1.0.0"

In [None]:
dvc_repo_extended = "https://git-codecommit.{}.amazonaws.com/v1/repos/sagemaker-dvc-sample".format(region)

with Tracker.create(display_name="DatasetLineage") as tracker:
    tracker.log_parameters(
        {
            "dataset_git_tag": dvc_tag,
            "dataset_git_repo": dvc_repo_extended
        }
    )

my_first_trial.add_trial_component(tracker.trial_component)

In [None]:
image = "{}.dkr.ecr.{}.amazonaws.com/sagemaker-catboost-dvc:latest".format(account, region)

metric_definitions = [{'Name': 'median-AE', 'Regex': "AE-at-50th-percentile: ([0-9.]+).*$"}]

estimator = sagemaker.estimator.Estimator(
    image,
    role,
    instance_count=1,
    metric_definitions=metric_definitions,
    instance_type="ml.m5.large",
    sagemaker_session=sagemaker_session,
    hyperparameters={
            "dvc-repo-url": dvc_repo_url,
            "dvc-tag": dvc_tag
    },
)

experiment_config={
    "ExperimentName": my_experiment.experiment_name,
    "TrialName": my_first_trial.trial_name,
    "TrialComponentDisplayName": "Training"
}

In [None]:
estimator.fit(experiment_config=experiment_config)

On the logs above you can see those lines, indicating about the files pulled by dvc:

```
Running dvc pull command
A       train/california_train.csv
A       test/california_test.csv
A       validation/california_validation.csv
3 files added and 3 files fetched
Starting the training.
Found train files: ['/opt/ml/input/data/dataset/train/california_train.csv']
Found validation files: ['/opt/ml/input/data/dataset/train/california_train.csv']
```

### Second version of the data


Data preparation

In [None]:
## Data creation 

def split_dataframe(df, num=5):
    chunks = [df.iloc[i:i+num] for i in range(0,df.shape[0], int(df.shape[0] / num))]
    return chunks

for index, chunk in enumerate(split_dataframe(pd.DataFrame(train))):
    chunk.to_csv(f"{base_dir}/train/california_train_{index + 1}.csv", header=False, index=False)

for index, chunk in enumerate(split_dataframe(pd.DataFrame(validation), 3)):
    chunk.to_csv(f"{base_dir}/validation/california_validation_{index + 1}.csv", header=False, index=False)

In [None]:
%%sh

repo_name="sagemaker-dvc-sample"
cd "${repo_name}"

git checkout -b dataset_v2.0.0

dvc add dataset/test/california_test*.csv
dvc add dataset/validation/california_validation*.csv
dvc add dataset/train/california_train*.csv

git add .

git commit -m 'add dev_dataset_2'

dvc push
git push --set-upstream origin dataset_v2.0.0

git tag v2.0.0
git push --tags

We will now train on the `v2.0.0` tag.

When doing `dvc pull`, this is the dataset structure:

```
dataset
    |-- train
    |   |-- california_train_1.csv
    |   |-- california_train_2.csv
    |   |-- california_train_3.csv
    |   |-- california_train_4.csv
    |   |-- california_train_5.csv
    |-- test
    |   |-- california_test.csv
    |-- validation
    |   |-- california_validation_1.csv
    |   |-- california_validation_2.csv
    |   |-- california_validation_3.csv
```

In [None]:
second_trial_name = "dvc-trial-v2"

try:
    my_second_trial = Trial.load(trial_name=second_trial_name)
    print("existing trial loaded")
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        my_second_trial = Trial.create(
            experiment_name=experiment_name,
            trial_name=second_trial_name,
        )
        print("new trial created")
    else:
        print(f"Unexpected {ex}=, {type(ex)}")
        print("Dont go forward!")
        raise

In [None]:
dvc_tag = "v2.0.0"

In [None]:
with Tracker.create(display_name="DatasetLineage") as ptracker:
    ptracker.log_parameters(
        {
            "dataset_git_tag": dvc_tag,
            "dataset_git_repo": dvc_repo_extended
        }
    )

my_second_trial.add_trial_component(ptracker.trial_component)

In [None]:
estimator = sagemaker.estimator.Estimator(
    image,
    role,
    instance_count=1,
    metric_definitions=metric_definitions,
    instance_type="ml.m5.large",
    sagemaker_session=sagemaker_session,
    hyperparameters={
            "dvc-repo-url": dvc_repo_url,
            "dvc-tag": dvc_tag
        },
)

experiment_config={
    "ExperimentName": my_experiment.experiment_name,
    "TrialName": my_second_trial.trial_name,
    "TrialComponentDisplayName": "Training"
}

In [None]:
estimator.fit(experiment_config=experiment_config)

On the logs above you can see those lines, indicating about the files pulled by dvc:

```
Running dvc pull command
A       validation/california_validation_2.csv
A       validation/california_validation_1.csv
A       validation/california_validation_3.csv
A       train/california_train_4.csv
A       train/california_train_5.csv
A       train/california_train_2.csv
A       train/california_train_3.csv
A       train/california_train_1.csv
A       test/california_test.csv
9 files added and 9 files fetched
Starting the training.
Found train files: ['/opt/ml/input/data/dataset/train/california_train_2.csv', '/opt/ml/input/data/dataset/train/california_train_5.csv', '/opt/ml/input/data/dataset/train/california_train_4.csv', '/opt/ml/input/data/dataset/train/california_train_1.csv', '/opt/ml/input/data/dataset/train/california_train_3.csv']
Found validation files: ['/opt/ml/input/data/dataset/validation/california_validation_2.csv', '/opt/ml/input/data/dataset/validation/california_validation_1.csv', '/opt/ml/input/data/dataset/validation/california_validation_3.csv']
```

## Hosting your model

In [None]:
from sagemaker.predictor import csv_serializer

predictor = estimator.deploy(1, "ml.t2.medium", serializer=csv_serializer)

## Invoke endpoint with the Python SDK

In [None]:
predicted = predictor.predict(test).decode('utf-8')
print(predicted)

### Delete the Endpoint

Make sure to delete the endpoint to avoid un-expected costs

In [None]:
predictor.delete_endpoint()

### (Optional) Delete the Experiment, and all Trails, TrialComponents

In [None]:
my_experiment.delete_all(action="--force")

### Delete the AWS CodeCommit repository

In [None]:
!aws codecommit delete-repository --repository-name sagemaker-dvc-sample