# Creating a custom container and Estimator to run Catboost on SageMaker

In this notebook, we use the SageMaker Training Toolkit (https://github.com/aws/sagemaker-training-toolkit) to create a SageMaker-compatible docker image to run python scripts using the Catboost algorithm library. We also show how to create a custom SageMaker training `Estimator` from the SageMaker `Framework` class (https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.Framework)

CatBoost is a high-performance open source library for gradient boosting on decision trees. You can learn more about it at the following links:
* https://tech.yandex.com/catboost/
* https://catboost.ai/
* https://github.com/catboost/catboost


<br/><br/><br/>

We use the Boston Housing dataset, present in Scikit-Learn: https://scikit-learn.org/stable/datasets/index.html#boston-dataset

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression problems.

References

 * Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
 * Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

**This sample is provided for demonstration purposes, make sure to conduct appropriate testing if derivating this code for your own use-cases!**

## Step 1: Container creation and upload to Amazon ECR

### Creating a SageMaker-compatible Catboost container
We derive our dockerfile from the SageMaker Scikit-Learn dockerfile https://github.com/aws/sagemaker-scikit-learn-container/blob/master/docker/0.20.0/base/Dockerfile.cpu

In [None]:
%%writefile Dockerfile

FROM ubuntu:16.04

RUN apt-get update && \
    apt-get -y install build-essential libatlas-dev git wget curl nginx jq libatlas3-base

RUN curl -LO http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
    bash Miniconda3-latest-Linux-x86_64.sh -bfp /miniconda3 && \
    rm Miniconda3-latest-Linux-x86_64.sh

ENV PATH=/miniconda3/bin:${PATH}
        
RUN apt-get update && apt-get install -y python-pip && pip install sagemaker-training catboost scikit-learn setuptools wheel spacy && python -m spacy download en_core_web_sm

ENV PYTHONDONTWRITEBYTECODE=1 PYTHONUNBUFFERED=1 PYTHONIOENCODING=UTF-8

### Sending the container to ECR

In [None]:
import boto3
import sagemaker

from sagemaker import get_execution_role

role = get_execution_role()

ecr_namespace = 'sagemaker-training-containers/'
prefix = 'catboost-image'

ecr_repository_name = ecr_namespace + prefix
account_id = role.split(':')[4]
region = boto3.Session().region_name
sess = sagemaker.session.Session()
bucket = sess.default_bucket()

print('Account: {}'.format(account_id))
print('Region: {}'.format(region))
print('Role: {}'.format(role))
print('S3 Bucket: {}'.format(bucket))

In [None]:
%%writefile build_and_push.sh

ACCOUNT_ID=$1
REGION=$2
REPO_NAME=$3


sudo docker build -f Dockerfile -t $REPO_NAME .

docker tag $REPO_NAME $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPO_NAME:latest

$(aws ecr get-login --no-include-email --registry-ids $ACCOUNT_ID)

aws ecr describe-repositories --repository-names $REPO_NAME || aws ecr create-repository --repository-name $REPO_NAME

docker push $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPO_NAME:latest

In [None]:
!bash build_and_push.sh $account_id $region $ecr_repository_name

In [None]:
container_image_uri = '{0}.dkr.ecr.{1}.amazonaws.com/{2}:latest'.format(account_id, region, ecr_repository_name)
print('ECR container ARN: {}'.format(container_image_uri))

The docker image is now pushed to ECR and is ready for consumption! In the next section, we go in the shoes of an ML practitioner that develops a Catboost model and runs it remotely on Amazon SageMaker

## Step 2: local ML development and remote training job with Amazon SageMaker

We install catboost locally for local development

In [None]:
!pip install catboost 
!pip install scikit-optimize
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download en_core_web_sm

### Data processing
We use pandas to process a small local dataset into a training and testing piece.

We could also design code that loads all the data and runs cross-validation within the script. 

In [None]:
import os

import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

### Developing a local training script

In [None]:
!wget https://black-belt-ml-challenge.s3.us-east-2.amazonaws.com/wines.csv

In [None]:
import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split
#import spacy
#nlp = spacy.load('en_core_web_sm')

In [None]:
df = pd.read_csv('wines.csv')
df.loc[:,'log1p_price'] = np.log1p(df.price)
df.loc[:,'len_description']=df.description.str.len()
df.loc[:,'len_title']=df.title.str.len()
df.loc[:,'len_winery']=df.winery.str.len()

In [None]:
#df.loc[:,'doc'] = df.apply(lambda x : nlp(x.description),axis=1)

In [None]:
#good_vectors = [ 1,  2,  3,  5,  6, 11, 14, 15, 16, 20, 21, 22, 24, 25, 26, 27, 28,
#       29, 30, 31, 32, 33, 36, 37, 38, 39, 41, 42, 43, 45, 46, 47, 48, 49,
#       51, 52, 53, 54, 58, 61, 62, 63, 64, 65, 67, 68, 69, 70, 71, 72, 73,
#       74, 75, 76, 77, 79, 81, 82, 84, 85, 87, 88, 89, 90, 91, 93, 94, 95]
#df_tensor_sum = df.apply(lambda x : x.doc.tensor[:,good_vectors].sum(axis=0),axis=1,result_type='expand')

In [None]:
#df_tensor_sum.columns = ['tensor_value_'+str(i) for i in df_tensor_sum.columns]

In [None]:
#df = pd.concat([df,df_tensor_sum],axis=1)

In [None]:
#df.loc[:,'max_tensor'] = df.doc.apply(lambda x : x.tensor.max())
#df.loc[:,'sum_tensor'] = df.doc.apply(lambda x : x.tensor.sum())
#df.loc[:,'count_ents']= df.doc.apply(lambda x : len([ent for ent in x.ents]))
#df.loc[:,'count_ADJ']= df.doc.apply(lambda x : len([token.pos_ for token in x if token.pos_ =='ADJ']))
#df.loc[:,'count_is_not_stop']= df.doc.apply(lambda x : len([token.pos_ for token in x if token.is_stop==False ]))



In [None]:
#df.loc[:,'contains_ripe']= df.doc.apply(lambda x : len([token for token in x if token.pos_ =='ADJ' and  token.is_stop==False and token.lemma_=='ripe']))
#df.loc[:,'contains_red']= df.doc.apply(lambda x : len([token for token in x if token.pos_ =='ADJ' and  token.is_stop==False and token.lemma_=='red']))
#df.loc[:,'contains_rich']= df.doc.apply(lambda x : len([token for token in x if token.pos_ =='ADJ' and  token.is_stop==False and token.lemma_=='rich']))
#df.loc[:,'contains_fresh']= df.doc.apply(lambda x : len([token for token in x if token.pos_ =='ADJ' and  token.is_stop==False and token.lemma_=='fresh']))
#df.loc[:,'contains_soft']= df.doc.apply(lambda x : len([token for token in x if token.pos_ =='ADJ' and  token.is_stop==False and token.lemma_=='soft']))
#df.loc[:,'contains_sweet']= df.doc.apply(lambda x : len([token for token in x if token.pos_ =='ADJ' and  token.is_stop==False and token.lemma_=='sweet']))
#df.loc[:,'contains_green']= df.doc.apply(lambda x : len([token for token in x if token.pos_ =='ADJ' and  token.is_stop==False and token.lemma_=='green']))
#df.loc[:,'contains_simple']= df.doc.apply(lambda x : len([token for token in x if token.pos_ =='ADJ' and  token.is_stop==False and token.lemma_=='simple']))
#df.loc[:,'contains_light']= df.doc.apply(lambda x : len([token for token in x if token.pos_ =='ADJ' and  token.is_stop==False and token.lemma_=='light']))


In [None]:
text_features = ['description', 'designation']

cat_features = ['country', 'province', 'region_1', 'region_2', 
                'taster_name', 'taster_twitter_handle', 'variety',
                'winery']

df.loc[:,cat_features] = df.loc[:,cat_features].fillna('Missing')

df.loc[:,text_features] = df.loc[:,text_features].fillna('Missing')
#df = df.drop(columns=['doc', 'title'])
df = df.drop(columns=['title'])
df_train, df_test = train_test_split(df, test_size=0.3, random_state=42)

In [None]:
local_train='wines_train.csv'
local_test='wines_test.csv'

In [None]:
df_train.to_csv(local_train)

In [None]:
df_test.to_csv(local_test)

In [None]:
# send data to S3. SageMaker will take training data from S3
train_location = sess.upload_data(
    path=local_train, 
    bucket=bucket,
    key_prefix='catboost')

test_location = sess.upload_data(
    path=local_test, 
    bucket=bucket,
    key_prefix='catboost')

In [None]:
%%writefile catboost_training_wines.py

import argparse
import logging
import os

from catboost import CatBoostRegressor
from catboost import Pool, cv
import numpy as np
import pandas as pd
from sklearn import metrics


if __name__ =='__main__':

    print('extracting arguments')
    parser = argparse.ArgumentParser()
    
    parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
    parser.add_argument('--train-file', type=str, default='wines_train.csv')
    parser.add_argument('--test-file', type=str, default='wines_test.csv')
    parser.add_argument('--model-name', type=str, default='catboost_model.dump')
    parser.add_argument('--features', type=str)  # in this script we ask user to explicitly name features
    parser.add_argument('--cat_features', type=str)  # in this script we ask user to explicitly name cat_features
    parser.add_argument('--target', type=str) # in this script we ask user to explicitly name the target
    parser.add_argument('--learning_rate', type=float) # in this script we ask user to explicitly name the target
    parser.add_argument('--depth', type=int) # in this script we ask user to explicitly name the target
    parser.add_argument('--l2_leaf_reg', type=int) # in this script we ask user to explicitly name the target
    
    args, _ = parser.parse_known_args()

    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    
    logging.info('reading data')
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))

    logging.info('building training and testing datasets')
    X_train = train_df[args.features.split()]
    X_test = test_df[args.features.split()]
    y_train = train_df[args.target]
    y_test = test_df[args.target]
        
    # define and train model
    #model = CatBoostRegressor(learning_rate=args.learning_rate,depth=args.depth,l2_leaf_reg=args.l2_leaf_reg,cat_features=args.cat_features.split())
    #
    #model.fit(X_train, y_train, eval_set=(X_test, y_test), logging_level='Silent') 
    #
    ## print abs error
    #logging.info('validating model')
    #abs_err = np.abs(model.predict(X_test) - y_test)
    #preds = model.predict(X_test).round(0)
    #models_evals = {'explained_variance_score' : [metrics.explained_variance_score(y_test, preds)],
    #            'max_error' : [metrics.max_error(y_test, preds)],
    #            'mean_absolute_error' : [metrics.mean_absolute_error(y_test, preds)],
    #            'root_mean_squared_error' : [metrics.mean_squared_error(y_test, preds)**(1/2)],
    #            'mean_squared_error' : [metrics.mean_squared_error(y_test, preds)],
    #            'mean_squared_log_error' : [metrics.mean_squared_log_error(y_test, preds)],
    #            'median_absolute_error' : [metrics.median_absolute_error(y_test, preds)],
    #            #metrics.mean_absolute_percentage_error(y_test, preds),
    #            'r2_score' : [metrics.r2_score(y_test, preds)]}
    
        # print couple perf metrics
    #for q in models_evals.keys():
    #    logging.info(str(q)+' : '+ str(models_evals[q]))
    
    cv_dataset = Pool(data=X_train,
                  label=y_train,
                  cat_features=args.cat_features.split())

    params = {"iterations": 1000,
              "learning_rate":args.learning_rate,
              "depth": args.depth,
              "loss_function": "RMSE",
              "l2_leaf_reg": args.l2_leaf_reg,
              "verbose": False}

    scores = cv(cv_dataset,
                params,
                fold_count=3, 
            )
    
    logging.info('rmse'+': '+ str(scores['test-RMSE-mean'].iloc[-1]))
    # print couple perf metrics
    #for q in [10, 50, 90]:
    #    logging.info('AE-at-' + str(q) + 'th-percentile: '
    #          + str(np.percentile(a=abs_err, q=q)))
    
    # persist model
    #path = os.path.join(args.model_dir, args.model_name)
    #logging.info('saving to {}'.format(path))
    #model.save_model(path)


### Testing our script locally

In [None]:
features_str=' '.join([i for i in df_train.columns if i not in ('points')])
features_str

In [None]:
cat_features_str = ' '.join([i for i in df_train.columns if i in cat_features+text_features])
cat_features_str

In [None]:
%%time
# local test

! python catboost_training_wines.py \
    --train ./ \
    --test ./ \
    --model-dir ./ \
    --features 'country description designation price province region_1 region_2 taster_name taster_twitter_handle variety winery log1p_price len_description len_title len_winery' \
    --cat_features 'country description designation province region_1 region_2 taster_name taster_twitter_handle variety winery' \
    --target 'points' \
    --learning_rate 0.1 \
    --depth 4 \
    --l2_leaf_reg 2 

## Remote training in SageMaker

### Option 1: Launch a SageMaker training job from code uploaded to S3

With that option, we first need to send code to S3. This could also be done automatically by a build system.

In [None]:
import tarfile

In [None]:
# first compress the code and send to S3
program = 'catboost_training_wines.py'
source = 'source.tar.gz'
project = 'catboost'

tar = tarfile.open(source, 'w:gz')
tar.add(program)
tar.close()

submit_dir = sess.upload_data(
    path=source, 
    bucket=bucket,
    key_prefix=project+ '/' + source)

print(submit_dir)

We then launch a training job with the `Estimator` class

In [None]:
from sagemaker.estimator import Estimator

In [None]:
output_path = 's3://' + bucket + '/' + project + '/' + 'training_jobs'

estimator = Estimator(image_uri=container_image_uri,
                      role=role,
                      max_run=20*60,
                      train_instance_count=1,
                      train_instance_type='ml.m5.xlarge',
                      output_path=output_path,
                      use_spot_instances=True,
                      max_wait=20*60,
                      hyperparameters={'sagemaker_program': program,
                                       'sagemaker_submit_directory': submit_dir,
                                       'features': features_str,
                                       'cat_features': cat_features_str,
                                       'target': 'points'})

In [None]:
%%time
estimator.fit({'train':train_location, 'test': test_location}, logs=True)

In [None]:
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

In [None]:
hyperparameter_ranges = {
    "learning_rate": ContinuousParameter(0.01, 0.1, scaling_type="Logarithmic"),
    "depth": IntegerParameter(4, 10),
    "l2_leaf_reg": IntegerParameter(1, 9),
}

In [None]:
objective_metric_name = "rmse"
metric_definitions = [{"Name": "rmse", "Regex": "rmse: ([0-9\\.]+)"}]

In [None]:
tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    strategy='Bayesian',
    objective_type="Minimize",
    max_jobs=50,
    max_parallel_jobs=10,
)

In [None]:
%%time
tuner.fit({'train':train_location, 'test': test_location},logs=True)

In [None]:
sagemaker.HyperparameterTuningJobAnalytics(tuner.latest_tuning_job.job_name).dataframe().sort_values(['FinalObjectiveValue'])