# Amazon SageMaker XGBoost model applied to NYC TLC trip data

In this example, we will train a model using  [Amazon SageMaker XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) to predict NYC TLC trip duration in minutes based on feature vector that includes ride pickup source zone, destination zone, month, day of the week and hour of the day for the ride. 

The data used to train the XGBoost model was prepared using PySpark cluster as described in this [companion notebook](emr-pyspark-nyc-tlc.ipynb) and saved in S3 bucket. So, if you have not executed the companion notebook, please do that before executing this notebook.

### Get Amazon SageMaker Execution Role

This notebook requires a Python 3 kernel with SageMaker support, for example `conda_python3` kernel.

The first step in using Amazon SageMaker is to create an execution role that encapsulates permissions used by Amazon SageMaker to access other AWS services, so we do that as the first step.

In [None]:
from sagemaker import get_execution_role
import boto3

# Create SageMaker role 
role = get_execution_role()
print(f'Amazon SageMaker execution role: {role}')

### Get XGBoost Amazon ECR URI
Every Amazon SageMaker model training requires an [Amazon ECR](https://aws.amazon.com/ecr/) image that provides an [Amazon SageMaker training container](https://docs.aws.amazon.com/sagemaker/latest/dg/amazon-sagemaker-containers.html). So, next, we specify the URI for the Amazon ECR docker image for Amazon SageMaker XGBoost algorithm.

In [None]:
# get the url to the container image for using linear-learner
from sagemaker.amazon.amazon_estimator import get_image_uri
xgboost_image = get_image_uri(boto3.Session().region_name, 'xgboost', '0.90-1')
print(f'Amazon SageMaker XGBoost Algorithm ECR Image URI: {xgboost_image}')

### Split training data into training, validation and test

At the end of the data preparation step executed in [companion notebook](emr-pyspark-nyc-tlc.ipynb), we get multiple CSV files saved in your S3 bucket. Each CSV file contains data with columns for trip duration in minutes, origin zone, destination zone, and pickup month, ordinal day of the week, and hour of day. 

Specify `source_bucket` S3 bucket where you saved the prepared data in the [companion notebook](emr-pyspark-nyc-tlc.ipynb). Specify `dest_bucket` S3 bucket where you want to load the split training, validaiton and test datasets. The `source_bucket` and `dest_bucket` can be the same S3 bucket.

In [None]:
# source bucket with CSV files produced by Amazon EMR spark cluster data preparation
source_bucket=
source_prefix='emr/output/uber_nyc/v1'

# destination bucket to upload SageMaker training input data files
# can be the same as source_bucket
dest_bucket=source_bucket
dest_prefix = 'sagemaker/input/uber_nyc/v1'

We split input data into <b>training, validation and test</b> datasets for SageMaker XGBoost model training. We split the data by downloading the input CSV data files from `source_bucket` and separating the downloaded files into three folders. We reserve 80% of the downloaded files for training, 15% for validation and 5% for test, and upload the separated data files using `training`, `validation` and `test` prefixes into the `dest_bucket` S3 bucket to stage the data for SageMaker XGBoost model training. 

In [None]:
import tempfile
import csv

s3 = boto3.client('s3')

response=s3.list_objects_v2(Bucket=source_bucket, Prefix=source_prefix)
contents=response['Contents']
count=len(contents)

sbucket = boto3.resource('s3').Bucket(source_bucket)
ntrain=int(count*0.80)
nval = int(count*0.15)
ntest = count - ntrain - nval

def stage_data(start, end, name):  
    for i in range(start, end, 1):
        item=contents[i]
        key =item['Key']    
        if not key.lower().endswith(".csv"):
            continue
        with tempfile.NamedTemporaryFile(mode='w+b', suffix='.csv', prefix='data-', delete=True) as csv_file:
            print(f'Download {key} file')
            sbucket.download_fileobj(key, csv_file)
            
            with open(csv_file.name, 'rb') as data_reader:
                dest_key = f'{dest_prefix}/{name}/part-{i}.csv'
                print(f'upload {dest_key} file')
                s3.upload_fileobj(data_reader, dest_bucket, dest_key)
                data_reader.close()
        
            csv_file.close()
            
stage_data(0, ntrain, 'train')
stage_data(ntrain, ntrain+nval, 'validation')
stage_data(ntrain+nval, count, 'test')

### Create Amazon SageMaker data input channels

Next, we  define <b>train and validation</b> input channels for Amazon SageMaker XGBoost model training.

In [None]:
from sagemaker import s3_input

s3_train = s3_input(s3_data=f's3://{dest_bucket}/{dest_prefix}/train', content_type='csv')
s3_validation = s3_input(s3_data=f's3://{dest_bucket}/{dest_prefix}/validation', content_type='csv')

output_path=f's3://{dest_bucket}/sagemaker/output/uber_nyc/xgboost'
data_channels = {'train': s3_train, 'validation': s3_validation}

### Create Amazon SageMaker XGBoost Estimator

Next, we define [SageMaker Estimator](https://sagemaker.readthedocs.io/en/stable/estimators.html) for training [Amazon SageMaker XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html). In this example, we use `ml.c5.9xlarge` instance type and we use 1 training instance. 

 

In [None]:
from sagemaker.estimator import Estimator
from sagemaker import Session

sagemaker_session = Session()

xgb = Estimator(image_name=xgboost_image,
                            role=role, 
                            train_instance_count=1, 
                            train_instance_type='ml.c5.9xlarge',
                            output_path=output_path,
                            sagemaker_session=sagemaker_session)


### Specify XGBoost Hyper-parameters
Next, we specify required [XGBoost hyper-parameters](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html) that we know at this point. However, there are some  XGBoost hyper-parameters, `num_rounds` and `tree_depth`, which we will need to discover through hyper-parameter tuning.

In [None]:
xgb.set_hyperparameters(objective='reg:linear',
                       grow_policy='depthwise',
                       tree_method = 'approx')


### Create Amazon SageMaker XGBoost Hyper-parameter Tuner ###

Before we train the XGBoost model, we need to discover optimal values for the hyper-parameters required by the model, `num_round` and `max_depth`, using hyper-parameter tuner.

The hyper-parameter tuner expects ranges for the hyper-parameters that need to be tuned, so we define those ranges next.

In [None]:
from sagemaker.tuner import IntegerParameter
from sagemaker.tuner import HyperparameterTuner
from sagemaker.tuner import CategoricalParameter

objective_metric_name = "validation:rmse"

num_round = IntegerParameter(50,200)
max_depth = IntegerParameter(8,32)
hyperparameter_ranges={}
hyperparameter_ranges['num_round'] = num_round
hyperparameter_ranges['max_depth'] = max_depth

We define a [hyper-parameter tuner](https://sagemaker.readthedocs.io/en/stable/tuner.html) that will use Bayesian search to minimize validation <b>Root Mean Squere Error (RMSE)</b> objective. We limit the maximum total hyper-parameter tuning `max_jobs` to 10 and `max_parallel_jobs` to 1. 

Each Hyper-parameter tuning job spwans a set of `max_parallel_jobs` training jobs with different values for `num_round` and `max_depth` hyper-parameters. The objective value obtained from the training jobs is used with Bayesian search strategy to explore new values for `num_round` and `max_depth` hyper-parameters. The total number of training jobs attempted is limited by `max_jobs`.

In [None]:
hyperparameter_tuner=HyperparameterTuner(xgb, 
                                         objective_metric_name, 
                                         hyperparameter_ranges, 
                                         strategy='Bayesian', 
                                         objective_type='Minimize', 
                                         max_jobs=15, 
                                         max_parallel_jobs=1, 
                                         base_tuning_job_name='xgboost-tuning')

### Start Hyper-parameter Tuning ###
Next, we start the hyper-parameter tuning jobs, which are launched <b>asynchornously</b>. You can use Amazon SageMaker console to monitor hyper-parameter tuning jobs. The duration of this asynchronous step depends on the `max_jobs` and `max_parallel_jobs` specified above. 

In [None]:
hyperparameter_tuner.fit(inputs=data_channels, logs=True)

### Use Best Hyper-parameters

Next,we will use the best hyper-parameters values found in Amazon SageMaker console to set `num_round` and `max_depth` hyper-parameters for the training estimator. Below is an example. You may see different best values for `num_round` and `max_depth` hyper-parameters.

In [None]:
xgb.set_hyperparameters(objective='reg:linear',
                        grow_policy='depthwise',
                        tree_method = 'approx',
                        num_round=166, 
                        max_depth=9)

### Start Training Job ###
After specifying best values for `num_round` and `max_depth` hyper-parameters, we are ready to start the training job.

In [None]:
xgb.fit(inputs=data_channels, logs=True)

### Deploy Model to Endpoint ###

Once training is complete we are ready to deploy the trained model to an Amazon SageMaker model endpoint. We use  `ml.m5.xlarge` instance type and 1 instance to serve the model from a REST service endpoint. This step can take several minutes, so please be patient.

In [None]:
xgb_predictor = xgb.deploy(initial_instance_count=1, 
               instance_type='ml.m5.xlarge')

### Make Predictions for Test Data ###
Below we import the classes to make prediction with test data. These classes are used to serialize and de-serialize the data to SageMaker endpoint.

In [None]:
from sagemaker.predictor import csv_serializer, json_deserializer

xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer
xgb_predictor.deserializer = json_deserializer

Below we download the test data and submit the test data to Amazon SageMaker deployed endpoint to make predctions on test data and compare the predictions to expected output.

In [None]:
import numpy as np
import tempfile

s3 = boto3.client('s3')
bucket = boto3.resource('s3').Bucket(dest_bucket)
with tempfile.NamedTemporaryFile(mode='w+b', suffix='.csv', prefix='data-', delete=True) as test_csv:
    key=f'{dest_prefix}/test/part-21.csv'
    print(f'download: {key}')
    bucket.download_fileobj(key, test_csv)
    print("read test csv file")
    array = np.genfromtxt(test_csv.name, delimiter=',', skip_header=False)
    np.random.shuffle(array)
    for i in range(100):
        print(f'test input: {array[i, 1:]}')
        result = xgb_predictor.predict(array[i, 1:])
        print(f'predicted: {result}')
        print(f'expected: {array[i,0]}')
    test_csv.close()