# Creating a custom container and Estimator to run Catboost on SageMaker

In this notebook, we use the SageMaker Training Toolkit (https://github.com/aws/sagemaker-training-toolkit) to create a SageMaker-compatible docker image to run python scripts using the Catboost algorithm library. We also show how to create a custom SageMaker training `Estimator` from the SageMaker `Framework` class (https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.Framework)

CatBoost is a high-performance open source library for gradient boosting on decision trees. You can learn more about it at the following links:
* https://tech.yandex.com/catboost/
* https://catboost.ai/
* https://github.com/catboost/catboost


<br/><br/><br/>

We use the Boston Housing dataset, present in Scikit-Learn: https://scikit-learn.org/stable/datasets/index.html#boston-dataset

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression problems.

References

 * Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
 * Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

**This sample is provided for demonstration purposes, make sure to conduct appropriate testing if derivating this code for your own use-cases!**

## Step 1: Container creation and upload to Amazon ECR

### Creating a SageMaker-compatible Catboost container
We derive our dockerfile from the SageMaker Scikit-Learn dockerfile https://github.com/aws/sagemaker-scikit-learn-container/blob/master/docker/0.20.0/base/Dockerfile.cpu

In [1]:
%%writefile Dockerfile

FROM ubuntu:16.04

RUN apt-get update && \
    apt-get -y install build-essential libatlas-dev git wget curl nginx jq libatlas3-base

RUN curl -LO http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
    bash Miniconda3-latest-Linux-x86_64.sh -bfp /miniconda3 && \
    rm Miniconda3-latest-Linux-x86_64.sh

ENV PATH=/miniconda3/bin:${PATH}
        
RUN apt-get update && apt-get install -y python-pip && pip install sagemaker-training catboost scikit-learn setuptools wheel spacy && python -m spacy download en_core_web_sm

ENV PYTHONDONTWRITEBYTECODE=1 PYTHONUNBUFFERED=1 PYTHONIOENCODING=UTF-8

Writing Dockerfile


### Sending the container to ECR

In [2]:
import boto3
import sagemaker

from sagemaker import get_execution_role

role = get_execution_role()

ecr_namespace = 'sagemaker-training-containers/'
prefix = 'catboost-image'

ecr_repository_name = ecr_namespace + prefix
account_id = role.split(':')[4]
region = boto3.Session().region_name
sess = sagemaker.session.Session()
bucket = sess.default_bucket()

print('Account: {}'.format(account_id))
print('Region: {}'.format(region))
print('Role: {}'.format(role))
print('S3 Bucket: {}'.format(bucket))

Account: 641677763413
Region: us-east-2
Role: arn:aws:iam::641677763413:role/TeamRole
S3 Bucket: sagemaker-us-east-2-641677763413


In [3]:
%%writefile build_and_push.sh

ACCOUNT_ID=$1
REGION=$2
REPO_NAME=$3


sudo docker build -f Dockerfile -t $REPO_NAME .

docker tag $REPO_NAME $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPO_NAME:latest

$(aws ecr get-login --no-include-email --registry-ids $ACCOUNT_ID)

aws ecr describe-repositories --repository-names $REPO_NAME || aws ecr create-repository --repository-name $REPO_NAME

docker push $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPO_NAME:latest

Writing build_and_push.sh


In [4]:
!bash build_and_push.sh $account_id $region $ecr_repository_name

Sending build context to Docker daemon  440.3kB
Step 1/6 : FROM ubuntu:16.04
16.04: Pulling from library/ubuntu

[1B3ba1d414: Pulling fs layer 
[1B39f216bd: Pulling fs layer 
[1Babdc9f90: Pulling fs layer 
[1Bff7bcc24: Pull complete  169B/169BMBB4A[2K[4A[2K[1A[2K[4A[2K[4A[2K[4A[2K[4A[2K[4A[2K[4A[2K[4A[2K[4A[2K[4A[2K[4A[2K[4A[2K[4A[2K[4A[2K[4A[2K[4A[2K[4A[2K[4A[2K[4A[2K[4A[2K[3A[2K[2A[2K[1A[2K[1A[2KDigest: sha256:6aab78d1825b4c15c159fecc62b8eef4fdf0c693a15aace3a605ad44e5e2df0c
Status: Downloaded newer image for ubuntu:16.04
 ---> 065cf14a189c
Step 2/6 : RUN apt-get update &&     apt-get -y install build-essential libatlas-dev git wget curl nginx jq libatlas3-base
 ---> Running in 11531b6d690c
Get:1 http://security.ubuntu.com/ubuntu xenial-security InRelease [109 kB]
Get:2 http://archive.ubuntu.com/ubuntu xenial InRelease [247 kB]
Get:3 http://security.ubuntu.com/ubuntu xenial-security/main amd64 Packages [2051 kB]
Get:4 http://secur

In [5]:
container_image_uri = '{0}.dkr.ecr.{1}.amazonaws.com/{2}:latest'.format(account_id, region, ecr_repository_name)
print('ECR container ARN: {}'.format(container_image_uri))

ECR container ARN: 641677763413.dkr.ecr.us-east-2.amazonaws.com/sagemaker-training-containers/catboost-image:latest


The docker image is now pushed to ECR and is ready for consumption! In the next section, we go in the shoes of an ML practitioner that develops a Catboost model and runs it remotely on Amazon SageMaker

## Step 2: local ML development and remote training job with Amazon SageMaker

We install catboost locally for local development

In [6]:
!pip install catboost 
!pip install scikit-optimize
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download en_core_web_sm

Collecting catboost
  Downloading catboost-0.26-cp36-none-manylinux1_x86_64.whl (69.2 MB)
[K     |████████████████████████████████| 69.2 MB 267 kB/s eta 0:00:01     |████████████████████            | 43.3 MB 6.7 MB/s eta 0:00:04
[?25hCollecting graphviz
  Downloading graphviz-0.16-py2.py3-none-any.whl (19 kB)
Installing collected packages: graphviz, catboost
Successfully installed catboost-0.26 graphviz-0.16
Collecting scikit-optimize
  Downloading scikit_optimize-0.8.1-py2.py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 5.4 MB/s ta 0:00:011
Collecting pyaml>=16.9
  Downloading pyaml-20.4.0-py2.py3-none-any.whl (17 kB)
Installing collected packages: pyaml, scikit-optimize
Successfully installed pyaml-20.4.0 scikit-optimize-0.8.1
Collecting setuptools
  Downloading setuptools-57.0.0-py3-none-any.whl (821 kB)
[K     |████████████████████████████████| 821 kB 5.4 MB/s eta 0:00:01
Installing collected packages: setuptools
  Attempting uninstall: setuptools
   

### Data processing
We use pandas to process a small local dataset into a training and testing piece.

We could also design code that loads all the data and runs cross-validation within the script. 

In [7]:
import os

import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

### Developing a local training script

In [8]:
!wget https://black-belt-ml-challenge.s3.us-east-2.amazonaws.com/wines.csv

--2021-06-25 01:42:09--  https://black-belt-ml-challenge.s3.us-east-2.amazonaws.com/wines.csv
Resolving black-belt-ml-challenge.s3.us-east-2.amazonaws.com (black-belt-ml-challenge.s3.us-east-2.amazonaws.com)... 52.219.88.192
Connecting to black-belt-ml-challenge.s3.us-east-2.amazonaws.com (black-belt-ml-challenge.s3.us-east-2.amazonaws.com)|52.219.88.192|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 39098140 (37M) [text/csv]
Saving to: ‘wines.csv’


2021-06-25 01:42:10 (72.4 MB/s) - ‘wines.csv’ saved [39098140/39098140]



In [9]:
import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split
import spacy
nlp = spacy.load('en_core_web_sm')

In [10]:
df = pd.read_csv('wines.csv')
df.loc[:,'log1p_price'] = np.log1p(df.price)
df.loc[:,'len_description']=df.description.str.len()
df.loc[:,'len_title']=df.title.str.len()
df.loc[:,'len_winery']=df.winery.str.len()

In [11]:
df.loc[:,'doc'] = df.apply(lambda x : nlp(x.description),axis=1)

In [12]:
good_vectors = [ 1,  2,  3,  5,  6, 11, 14, 15, 16, 20, 21, 22, 24, 25, 26, 27, 28,
       29, 30, 31, 32, 33, 36, 37, 38, 39, 41, 42, 43, 45, 46, 47, 48, 49,
       51, 52, 53, 54, 58, 61, 62, 63, 64, 65, 67, 68, 69, 70, 71, 72, 73,
       74, 75, 76, 77, 79, 81, 82, 84, 85, 87, 88, 89, 90, 91, 93, 94, 95]
df_tensor_sum = df.apply(lambda x : x.doc.tensor[:,good_vectors].sum(axis=0),axis=1,result_type='expand')

In [13]:
df_tensor_sum.columns = ['tensor_value_'+str(i) for i in df_tensor_sum.columns]

In [14]:
df = pd.concat([df,df_tensor_sum],axis=1)

In [15]:
df.loc[:,'max_tensor'] = df.doc.apply(lambda x : x.tensor.max())
df.loc[:,'sum_tensor'] = df.doc.apply(lambda x : x.tensor.sum())
df.loc[:,'count_ents']= df.doc.apply(lambda x : len([ent for ent in x.ents]))
df.loc[:,'count_ADJ']= df.doc.apply(lambda x : len([token.pos_ for token in x if token.pos_ =='ADJ']))
df.loc[:,'count_is_not_stop']= df.doc.apply(lambda x : len([token.pos_ for token in x if token.is_stop==False ]))



In [16]:
df.loc[:,'contains_ripe']= df.doc.apply(lambda x : len([token for token in x if token.pos_ =='ADJ' and  token.is_stop==False and token.lemma_=='ripe']))
df.loc[:,'contains_red']= df.doc.apply(lambda x : len([token for token in x if token.pos_ =='ADJ' and  token.is_stop==False and token.lemma_=='red']))
df.loc[:,'contains_rich']= df.doc.apply(lambda x : len([token for token in x if token.pos_ =='ADJ' and  token.is_stop==False and token.lemma_=='rich']))
df.loc[:,'contains_fresh']= df.doc.apply(lambda x : len([token for token in x if token.pos_ =='ADJ' and  token.is_stop==False and token.lemma_=='fresh']))
df.loc[:,'contains_soft']= df.doc.apply(lambda x : len([token for token in x if token.pos_ =='ADJ' and  token.is_stop==False and token.lemma_=='soft']))
df.loc[:,'contains_sweet']= df.doc.apply(lambda x : len([token for token in x if token.pos_ =='ADJ' and  token.is_stop==False and token.lemma_=='sweet']))
df.loc[:,'contains_green']= df.doc.apply(lambda x : len([token for token in x if token.pos_ =='ADJ' and  token.is_stop==False and token.lemma_=='green']))
df.loc[:,'contains_simple']= df.doc.apply(lambda x : len([token for token in x if token.pos_ =='ADJ' and  token.is_stop==False and token.lemma_=='simple']))
df.loc[:,'contains_light']= df.doc.apply(lambda x : len([token for token in x if token.pos_ =='ADJ' and  token.is_stop==False and token.lemma_=='light']))


In [17]:
text_features = ['description', 'designation']

cat_features = ['country', 'province', 'region_1', 'region_2', 
                'taster_name', 'taster_twitter_handle', 'variety',
                'winery']

df.loc[:,cat_features] = df.loc[:,cat_features].fillna('Missing')

df.loc[:,text_features] = df.loc[:,text_features].fillna('Missing')
df = df.drop(columns=['doc', 'title'])
df_train, df_test = train_test_split(df, test_size=0.3, random_state=42)

In [18]:
local_train='wines_train.csv'
local_test='wines_test.csv'

In [19]:
df_train.to_csv(local_train)

In [20]:
df_test.to_csv(local_test)

In [21]:
# send data to S3. SageMaker will take training data from S3
train_location = sess.upload_data(
    path=local_train, 
    bucket=bucket,
    key_prefix='catboost')

test_location = sess.upload_data(
    path=local_test, 
    bucket=bucket,
    key_prefix='catboost')

In [27]:
%%writefile catboost_training_wines.py

import argparse
import logging
import os

from catboost import CatBoostRegressor
from catboost import Pool, cv
import numpy as np
import pandas as pd
from sklearn import metrics


if __name__ =='__main__':

    print('extracting arguments')
    parser = argparse.ArgumentParser()
    
    parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
    parser.add_argument('--train-file', type=str, default='wines_train.csv')
    parser.add_argument('--test-file', type=str, default='wines_test.csv')
    parser.add_argument('--model-name', type=str, default='catboost_model.dump')
    parser.add_argument('--features', type=str)  # in this script we ask user to explicitly name features
    parser.add_argument('--cat_features', type=str)  # in this script we ask user to explicitly name cat_features
    parser.add_argument('--target', type=str) # in this script we ask user to explicitly name the target
    parser.add_argument('--learning_rate', type=float) # in this script we ask user to explicitly name the target
    parser.add_argument('--depth', type=int) # in this script we ask user to explicitly name the target
    parser.add_argument('--l2_leaf_reg', type=int) # in this script we ask user to explicitly name the target
    
    args, _ = parser.parse_known_args()

    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    
    logging.info('reading data')
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))

    logging.info('building training and testing datasets')
    X_train = train_df[args.features.split()]
    X_test = test_df[args.features.split()]
    y_train = train_df[args.target]
    y_test = test_df[args.target]
        
    # define and train model
    #model = CatBoostRegressor(learning_rate=args.learning_rate,depth=args.depth,l2_leaf_reg=args.l2_leaf_reg,cat_features=args.cat_features.split())
    #
    #model.fit(X_train, y_train, eval_set=(X_test, y_test), logging_level='Silent') 
    #
    ## print abs error
    #logging.info('validating model')
    #abs_err = np.abs(model.predict(X_test) - y_test)
    #preds = model.predict(X_test).round(0)
    #models_evals = {'explained_variance_score' : [metrics.explained_variance_score(y_test, preds)],
    #            'max_error' : [metrics.max_error(y_test, preds)],
    #            'mean_absolute_error' : [metrics.mean_absolute_error(y_test, preds)],
    #            'root_mean_squared_error' : [metrics.mean_squared_error(y_test, preds)**(1/2)],
    #            'mean_squared_error' : [metrics.mean_squared_error(y_test, preds)],
    #            'mean_squared_log_error' : [metrics.mean_squared_log_error(y_test, preds)],
    #            'median_absolute_error' : [metrics.median_absolute_error(y_test, preds)],
    #            #metrics.mean_absolute_percentage_error(y_test, preds),
    #            'r2_score' : [metrics.r2_score(y_test, preds)]}
    
        # print couple perf metrics
    #for q in models_evals.keys():
    #    logging.info(str(q)+' : '+ str(models_evals[q]))
    
    cv_dataset = Pool(data=X_train,
                  label=y_train,
                  cat_features=args.cat_features.split())

    params = {"iterations": 1000,
              "learning_rate":args.learning_rate,
              "depth": args.depth,
              "loss_function": "RMSE",
              "l2_leaf_reg": args.l2_leaf_reg,
              "verbose": False}

    scores = cv(cv_dataset,
                params,
                fold_count=3, 
            )
    
    logging.info('rmse'+': '+ str(scores['test-RMSE-mean'].iloc[-1]))
    # print couple perf metrics
    #for q in [10, 50, 90]:
    #    logging.info('AE-at-' + str(q) + 'th-percentile: '
    #          + str(np.percentile(a=abs_err, q=q)))
    
    # persist model
    #path = os.path.join(args.model_dir, args.model_name)
    #logging.info('saving to {}'.format(path))
    #model.save_model(path)


Overwriting catboost_training_wines.py


In [28]:
%%writefile catboost_training_wines_v0.py

import argparse
import logging
import os

from catboost import CatBoostRegressor
import numpy as np
import pandas as pd
from sklearn import metrics

if __name__ =='__main__':

    print('extracting arguments')
    parser = argparse.ArgumentParser()
    
    parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
    parser.add_argument('--train-file', type=str, default='wines_train.csv')
    parser.add_argument('--test-file', type=str, default='wines_test.csv')
    parser.add_argument('--model-name', type=str, default='catboost_model.dump')
    parser.add_argument('--features', type=str)  # in this script we ask user to explicitly name features
    parser.add_argument('--cat_features', type=str)  # in this script we ask user to explicitly name cat_features
    parser.add_argument('--target', type=str) # in this script we ask user to explicitly name the target
    parser.add_argument('--learning_rate', type=float) # in this script we ask user to explicitly name the target
    parser.add_argument('--depth', type=int) # in this script we ask user to explicitly name the target
    parser.add_argument('--l2_leaf_reg', type=int) # in this script we ask user to explicitly name the target
    
    args, _ = parser.parse_known_args()

    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    
    logging.info('reading data')
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))

    logging.info('building training and testing datasets')
    X_train = train_df[args.features.split()]
    X_test = test_df[args.features.split()]
    y_train = train_df[args.target]
    y_test = test_df[args.target]
        
    # define and train model
    model = CatBoostRegressor(learning_rate=args.learning_rate,depth=args.depth,l2_leaf_reg=args.l2_leaf_reg,cat_features=args.cat_features.split())
    
    model.fit(X_train, y_train, eval_set=(X_test, y_test), logging_level='Silent') 
    
    # print abs error
    logging.info('validating model')
    abs_err = np.abs(model.predict(X_test) - y_test)
    preds = model.predict(X_test).round(0)
    models_evals = {'explained_variance_score' : [metrics.explained_variance_score(y_test, preds)],
                'max_error' : [metrics.max_error(y_test, preds)],
                'mean_absolute_error' : [metrics.mean_absolute_error(y_test, preds)],
                'root_mean_squared_error' : [metrics.mean_squared_error(y_test, preds)**(1/2)],
                'mean_squared_error' : [metrics.mean_squared_error(y_test, preds)],
                'mean_squared_log_error' : [metrics.mean_squared_log_error(y_test, preds)],
                'median_absolute_error' : [metrics.median_absolute_error(y_test, preds)],
                #metrics.mean_absolute_percentage_error(y_test, preds),
                'r2_score' : [metrics.r2_score(y_test, preds)]}
    
        # print couple perf metrics
    for q in models_evals.keys():
        logging.info(str(q)+' : '+ str(models_evals[q]))
    
    logging.info('rmse'+': '+ str(metrics.mean_squared_error(y_test, preds)**(1/2)))
    # print couple perf metrics
    for q in [10, 50, 90]:
        logging.info('AE-at-' + str(q) + 'th-percentile: '
              + str(np.percentile(a=abs_err, q=q)))
    
    # persist model
    path = os.path.join(args.model_dir, args.model_name)
    logging.info('saving to {}'.format(path))
    model.save_model(path)
    #return model

Overwriting catboost_training_wines_v0.py


### Testing our script locally

In [34]:
features_str=' '.join([i for i in df_train.columns if i not in ('points')])
features_str

'country description designation price province region_1 region_2 taster_name taster_twitter_handle variety winery log1p_price len_description len_title len_winery tensor_value_0 tensor_value_1 tensor_value_2 tensor_value_3 tensor_value_4 tensor_value_5 tensor_value_6 tensor_value_7 tensor_value_8 tensor_value_9 tensor_value_10 tensor_value_11 tensor_value_12 tensor_value_13 tensor_value_14 tensor_value_15 tensor_value_16 tensor_value_17 tensor_value_18 tensor_value_19 tensor_value_20 tensor_value_21 tensor_value_22 tensor_value_23 tensor_value_24 tensor_value_25 tensor_value_26 tensor_value_27 tensor_value_28 tensor_value_29 tensor_value_30 tensor_value_31 tensor_value_32 tensor_value_33 tensor_value_34 tensor_value_35 tensor_value_36 tensor_value_37 tensor_value_38 tensor_value_39 tensor_value_40 tensor_value_41 tensor_value_42 tensor_value_43 tensor_value_44 tensor_value_45 tensor_value_46 tensor_value_47 tensor_value_48 tensor_value_49 tensor_value_50 tensor_value_51 tensor_value_5

In [35]:
cat_features_str = ' '.join([i for i in df_train.columns if i in cat_features+text_features])
cat_features_str

'country description designation province region_1 region_2 taster_name taster_twitter_handle variety winery'

In [31]:
# local test
! python catboost_training_wines.py \
    --train ./ \
    --test ./ \
    --model-dir ./ \
    --features 'country description designation price province region_1 region_2 taster_name taster_twitter_handle variety winery log1p_price len_description len_title len_winery tensor_value_0 tensor_value_1 tensor_value_2 tensor_value_3 tensor_value_4 tensor_value_5 tensor_value_6 tensor_value_7 tensor_value_8 tensor_value_9 tensor_value_10 tensor_value_11 tensor_value_12 tensor_value_13 tensor_value_14 tensor_value_15 tensor_value_16 tensor_value_17 tensor_value_18 tensor_value_19 tensor_value_20 tensor_value_21 tensor_value_22 tensor_value_23 tensor_value_24 tensor_value_25 tensor_value_26 tensor_value_27 tensor_value_28 tensor_value_29 tensor_value_30 tensor_value_31 tensor_value_32 tensor_value_33 tensor_value_34 tensor_value_35 tensor_value_36 tensor_value_37 tensor_value_38 tensor_value_39 tensor_value_40 tensor_value_41 tensor_value_42 tensor_value_43 tensor_value_44 tensor_value_45 tensor_value_46 tensor_value_47 tensor_value_48 tensor_value_49 tensor_value_50 tensor_value_51 tensor_value_52 tensor_value_53 tensor_value_54 tensor_value_55 tensor_value_56 tensor_value_57 tensor_value_58 tensor_value_59 tensor_value_60 tensor_value_61 tensor_value_62 tensor_value_63 tensor_value_64 tensor_value_65 tensor_value_66 tensor_value_67 max_tensor sum_tensor count_ents count_ADJ count_is_not_stop contains_ripe contains_red contains_rich contains_fresh contains_soft contains_sweet contains_green contains_simple contains_light' \
    --cat_features 'country description designation province region_1 region_2 taster_name taster_twitter_handle variety winery' \
    --target 'points' \
    --learning_rate 0.1 \
    --depth 4 \
    --l2_leaf_reg 2 

extracting arguments
INFO:root:reading data
INFO:root:building training and testing datasets
INFO:root:rmse: 1.678887888711105


## Remote training in SageMaker

### Option 1: Launch a SageMaker training job from code uploaded to S3

With that option, we first need to send code to S3. This could also be done automatically by a build system.

In [32]:
import tarfile

In [33]:
# first compress the code and send to S3
program = 'catboost_training_wines.py'
source = 'source.tar.gz'
project = 'catboost'

tar = tarfile.open(source, 'w:gz')
tar.add(program)
tar.close()

submit_dir = sess.upload_data(
    path=source, 
    bucket=bucket,
    key_prefix=project+ '/' + source)

print(submit_dir)

s3://sagemaker-us-east-2-641677763413/catboost/source.tar.gz/source.tar.gz


We then launch a training job with the `Estimator` class

In [36]:
from sagemaker.estimator import Estimator

In [42]:
86400/60/60

24.0

In [44]:
output_path = 's3://' + bucket + '/' + project + '/' + 'training_jobs'

estimator = Estimator(image_uri=container_image_uri,
                      role=role,
                      max_run=20*60,
                      train_instance_count=1,
                      train_instance_type='ml.m5.xlarge',
                      output_path=output_path,
                      use_spot_instances=True,
                      max_wait=20*60,
                      hyperparameters={'sagemaker_program': program,
                                       'sagemaker_submit_directory': submit_dir,
                                       'features': features_str,
                                       'cat_features': cat_features_str,
                                       'target': 'points'})

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [45]:
%%time
estimator.fit({'train':train_location, 'test': test_location}, logs=True)

2021-06-25 02:16:45 Starting - Starting the training job...
2021-06-25 02:17:08 Starting - Launching requested ML instancesProfilerReport-1624587405: InProgress
...
2021-06-25 02:17:36 Starting - Preparing the instances for training.........
2021-06-25 02:19:13 Downloading - Downloading input data
2021-06-25 02:19:13 Training - Downloading the training image......
2021-06-25 02:20:09 Training - Training image download completed. Training in progress.[34m2021-06-25 02:20:04,301 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-06-25 02:20:04,301 sagemaker-training-toolkit INFO     Failed to parse hyperparameter features value country description designation price province region_1 region_2 taster_name taster_twitter_handle variety winery log1p_price len_description len_title len_winery tensor_value_0 tensor_value_1 tensor_value_2 tensor_value_3 tensor_value_4 tensor_value_5 tensor_value_6 tensor_value_7 tensor_value_8 tensor_value_9 tensor

In [46]:
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

In [47]:
hyperparameter_ranges = {
    "learning_rate": ContinuousParameter(0.01, 0.1, scaling_type="Logarithmic"),
    "depth": IntegerParameter(4, 10),
    "l2_leaf_reg": IntegerParameter(1, 9),
}

In [48]:
objective_metric_name = "rmse"
metric_definitions = [{"Name": "rmse", "Regex": "rmse: ([0-9\\.]+)"}]

In [49]:
tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    strategy='Bayesian',
    objective_type="Minimize",
    max_jobs=50,
    max_parallel_jobs=10,
)

In [50]:
%%time
tuner.fit({'train':train_location, 'test': test_location},logs=True)

..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................!
CPU times: user 2.87 s, sys: 194 ms, total: 3.06 s
Wall time: 52min


In [52]:
sagemaker.HyperparameterTuningJobAnalytics(tuner.latest_tuning_job.job_name).dataframe().sort_values(['FinalObjectiveValue'])

Unnamed: 0,depth,l2_leaf_reg,learning_rate,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds
6,6.0,1.0,0.081272,catboost-image-210625-0228-044-ab9e4f3e,Completed,1.665864,2021-06-25 03:07:02+00:00,2021-06-25 03:13:13+00:00,371.0
4,6.0,1.0,0.078587,catboost-image-210625-0228-046-2a60f852,Completed,1.666028,2021-06-25 03:08:39+00:00,2021-06-25 03:14:56+00:00,377.0
5,6.0,1.0,0.076896,catboost-image-210625-0228-045-d89a24d7,Completed,1.666336,2021-06-25 03:07:24+00:00,2021-06-25 03:13:34+00:00,370.0
1,6.0,1.0,0.086404,catboost-image-210625-0228-049-5f472ca4,Completed,1.666691,2021-06-25 03:13:01+00:00,2021-06-25 03:19:06+00:00,365.0
10,6.0,1.0,0.093777,catboost-image-210625-0228-040-dd3dae91,Completed,1.667073,2021-06-25 03:04:46+00:00,2021-06-25 03:10:59+00:00,373.0
7,6.0,1.0,0.079422,catboost-image-210625-0228-043-627e04cd,Completed,1.667274,2021-06-25 03:06:57+00:00,2021-06-25 03:13:09+00:00,372.0
2,6.0,1.0,0.08104,catboost-image-210625-0228-048-343f064d,Completed,1.66778,2021-06-25 03:12:09+00:00,2021-06-25 03:18:20+00:00,371.0
13,6.0,1.0,0.1,catboost-image-210625-0228-037-22bf6018,Completed,1.668052,2021-06-25 03:00:09+00:00,2021-06-25 03:06:14+00:00,365.0
3,6.0,1.0,0.078687,catboost-image-210625-0228-047-b679b8af,Completed,1.668223,2021-06-25 03:10:29+00:00,2021-06-25 03:16:44+00:00,375.0
0,6.0,1.0,0.082928,catboost-image-210625-0228-050-b3583846,Completed,1.66829,2021-06-25 03:13:28+00:00,2021-06-25 03:19:39+00:00,371.0


In [None]:
tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    strategy='Bayesian',
    objective_type="Minimize",
    max_jobs=30,
    max_parallel_jobs=4,
)

In [None]:
%%time
tuner.fit({'train':train_location, 'test': test_location},logs=True)

......................................................................................................................................................................................................................

In [207]:
sagemaker.HyperparameterTuningJobAnalytics(tuner.latest_tuning_job.job_name).dataframe().sort_values('FinalObjectiveValue').head()

Unnamed: 0,depth,l2_leaf_reg,learning_rate,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds
3,8.0,1.0,0.070562,catboost-image-210623-2325-027-b0f2f1ab,Completed,1.77942,2021-06-24 00:31:28+00:00,2021-06-24 00:38:30+00:00,422.0
10,8.0,1.0,0.061068,catboost-image-210623-2325-020-fdf6b766,Completed,1.77952,2021-06-24 00:13:34+00:00,2021-06-24 00:20:18+00:00,404.0
5,8.0,1.0,0.068164,catboost-image-210623-2325-025-8c340d04,Completed,1.779521,2021-06-24 00:31:55+00:00,2021-06-24 00:38:48+00:00,413.0
2,8.0,2.0,0.070007,catboost-image-210623-2325-028-5023caf3,Completed,1.780036,2021-06-24 00:32:24+00:00,2021-06-24 00:39:26+00:00,422.0
4,8.0,1.0,0.068335,catboost-image-210623-2325-026-fcf84fc4,Completed,1.780158,2021-06-24 00:31:04+00:00,2021-06-24 00:37:57+00:00,413.0


In [210]:
tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    strategy='Bayesian',
    objective_type="Minimize",
    max_jobs=30,
    max_parallel_jobs=10,
)

In [211]:
%%time
tuner.fit({'train':train_location, 'test': test_location},logs=True)

..........................................................................................................................................................................................................................................................................................................................................................................................!
CPU times: user 1.79 s, sys: 114 ms, total: 1.91 s
Wall time: 31min 55s


In [15]:
!python bb3_test_apigw.py --apigw_url https://tis633974c.execute-api.us-east-2.amazonaws.com/prod/ --api_key TopSecret_BlackBelt_2021_ApiKey

Validating API Gateway:
endpoint=`https://tis633974c.execute-api.us-east-2.amazonaws.com/prod/`
api_key=`TopSecret_BlackBelt_2021_ApiKey`

Sending simple payload:
('{"input": ["US", "Hailing from Underwood Mountain, ...", "Reminiscence", '
 '18.0, "Washington", "Columbia Gorge (WA)", "Washington Other", "Sean P. '
 'Sullivan", "@wawinereport", "Ancestry 2012 Reminiscence ...", "Riesling", '
 '"Ancestry"]}')

Response code: 200
Response body:
{'output': 'Parabens, tudo certo aqui! Nesse output deverá vir o resultado da '
           'inferencia do seu modelo'}

Congratulations! It looks like you just deployed the default API Gateway with API keys successfully for the AWS Black Belt 3.0 competition!
Now:
1. develop a simple model and put it behind the API Gateway (can be in a SageMaker endpoint, Lambda, Fargate, whatever you want!)
2. run again this script to test
3. if everything goes successfully again, start improving model performance, reducing inference latency, and the cost of infer

In [213]:
sagemaker.HyperparameterTuningJobAnalytics(tuner.latest_tuning_job.job_name).dataframe().sort_values('FinalObjectiveValue').head()

Unnamed: 0,depth,l2_leaf_reg,learning_rate,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds
13,7.0,1.0,0.095614,catboost-image-210624-0141-017-e9f8ed26,Completed,1.77974,2021-06-24 01:52:42+00:00,2021-06-24 01:58:44+00:00,362.0
5,7.0,1.0,0.09724,catboost-image-210624-0141-025-1b03dae5,Completed,1.779804,2021-06-24 02:00:50+00:00,2021-06-24 02:07:19+00:00,389.0
1,7.0,1.0,0.073689,catboost-image-210624-0141-029-2b867a62,Completed,1.77988,2021-06-24 02:05:04+00:00,2021-06-24 02:11:22+00:00,378.0
6,7.0,1.0,0.084732,catboost-image-210624-0141-024-d7a1cf36,Completed,1.78133,2021-06-24 02:01:15+00:00,2021-06-24 02:07:47+00:00,392.0
3,7.0,1.0,0.0949,catboost-image-210624-0141-027-f60c9659,Completed,1.781364,2021-06-24 02:00:53+00:00,2021-06-24 02:07:03+00:00,370.0


In [215]:
# local test
! python catboost_training_wines_v0.py \
    --train ./ \
    --test ./ \
    --model-dir ./ \
    --features 'country description designation price province region_1 region_2 taster_name taster_twitter_handle title variety winery log1p_price len_description' \
    --cat_features 'country description designation province region_1 region_2 taster_name taster_twitter_handle title variety winery' \
    --target 'points' \
    --learning_rate 0.070562 \
    --depth 8 \
    --l2_leaf_reg 1 

extracting arguments
INFO:root:reading data
INFO:root:building training and testing datasets
INFO:root:validating model
INFO:numexpr.utils:NumExpr defaulting to 2 threads.
INFO:root:explained_variance_score : [0.6598887276255376]
INFO:root:max_error : [10.0]
INFO:root:mean_absolute_error : [1.3378812747914102]
INFO:root:root_mean_squared_error : [1.764412484561082]
INFO:root:mean_squared_error : [3.11315141567501]
INFO:root:mean_squared_log_error : [0.0003917503139340069]
INFO:root:median_absolute_error : [1.0]
INFO:root:r2_score : [0.6598817922827849]
INFO:root:rmse: 1.764412484561082
INFO:root:AE-at-10th-percentile: 0.20159649435218657
INFO:root:AE-at-50th-percentile: 1.1097220587133165
INFO:root:AE-at-90th-percentile: 2.8573572417264996
INFO:root:saving to ./catboost_model.dump


In [216]:
# local test
! python catboost_training_wines_v0.py \
    --train ./ \
    --test ./ \
    --model-dir ./ \
    --features 'country description designation price province region_1 region_2 taster_name taster_twitter_handle title variety winery log1p_price len_description' \
    --cat_features 'country description designation province region_1 region_2 taster_name taster_twitter_handle title variety winery' \
    --target 'points' \
    --learning_rate 0.095614 \
    --depth 7 \
    --l2_leaf_reg 1 

extracting arguments
INFO:root:reading data
INFO:root:building training and testing datasets
INFO:root:validating model
INFO:numexpr.utils:NumExpr defaulting to 2 threads.
INFO:root:explained_variance_score : [0.6565277677054229]
INFO:root:max_error : [10.0]
INFO:root:mean_absolute_error : [1.345882916153741]
INFO:root:root_mean_squared_error : [1.7731315093566513]
INFO:root:mean_squared_error : [3.1439953494733963]
INFO:root:mean_squared_log_error : [0.00039565025294792226]
INFO:root:median_absolute_error : [1.0]
INFO:root:r2_score : [0.6565120289524071]
INFO:root:rmse: 1.7731315093566513
INFO:root:AE-at-10th-percentile: 0.2052530606654187
INFO:root:AE-at-50th-percentile: 1.1198794754509507
INFO:root:AE-at-90th-percentile: 2.8591277268137305
INFO:root:saving to ./catboost_model.dump


In [14]:
#10.0	4.0	0.072544
! python catboost_training_wines_v0.py \
    --train ./ \
    --test ./ \
    --model-dir ./ \
    --features 'country description designation price province region_1 region_2 taster_name taster_twitter_handle title variety winery log1p_price len_description' \
    --cat_features 'country description designation province region_1 region_2 taster_name taster_twitter_handle title variety winery' \
    --target 'points' \
    --learning_rate 0.072544 \
    --depth 10 \
    --l2_leaf_reg 4 

extracting arguments
INFO:root:reading data
INFO:root:building training and testing datasets
INFO:root:validating model
INFO:numexpr.utils:NumExpr defaulting to 2 threads.
INFO:root:explained_variance_score : [0.6604153563348261]
INFO:root:max_error : [10.0]
INFO:root:mean_absolute_error : [1.3364450827520176]
INFO:root:root_mean_squared_error : [1.7630359380284681]
INFO:root:mean_squared_error : [3.1082957187799205]
INFO:root:mean_squared_log_error : [0.00039121675108169803]
INFO:root:median_absolute_error : [1.0]
INFO:root:r2_score : [0.6604122871751503]
INFO:root:rmse: 1.7630359380284681
INFO:root:AE-at-10th-percentile: 0.20611983719317095
INFO:root:AE-at-50th-percentile: 1.1079707267312102
INFO:root:AE-at-90th-percentile: 2.8580109338098043
INFO:root:saving to ./catboost_model.dump


In [55]:
import pandas as pd 
from io import StringIO


In [63]:
pd.DataFrame.from_dict(
{
"input": [
'US', # country
'Hailing from Underwood Mountain, ...', # description
'Reminiscence', # designation
18.0, # price
'Washington', # province
'Columbia Gorge (WA)', # region_1
'Washington Other', # region_2
'Sean P. Sullivan', # taster_name
'@wawinereport', # taster_twitter_handle
'Ancestry 2012 Reminiscence ...', # title
'Riesling', # variety
'Ancestry' # winery
]
})#.to_json('input_test.json')

Unnamed: 0,input
0,US
1,"Hailing from Underwood Mountain, ..."
2,Reminiscence
3,18
4,Washington
5,Columbia Gorge (WA)
6,Washington Other
7,Sean P. Sullivan
8,@wawinereport
9,Ancestry 2012 Reminiscence ...


In [64]:
pd.read_json('input_test.json').sort_index()

Unnamed: 0,input
0,US
1,"Hailing from Underwood Mountain, ..."
2,Reminiscence
3,18
4,Washington
5,Columbia Gorge (WA)
6,Washington Other
7,Sean P. Sullivan
8,@wawinereport
9,Ancestry 2012 Reminiscence ...


In [65]:
import json