# Amazon SageMaker Experiments applied to NYC TLC trip data

In this example, we will do SageMaker Experiments to train a model using  [Amazon SageMaker XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) to predict NYC TLC trip duration in minutes based on feature vector that includes ride pickup source zone, destination zone, month, day of the week and hour of the day for the ride. 

The data used to train the XGBoost model was prepared using PySpark cluster as described in this [companion notebook](emr-pyspark-nyc-tlc.ipynb) and saved in S3 bucket. So, if you have not executed the companion notebook, please do that before executing this notebook.

### Get Amazon SageMaker Execution Role

This notebook requires a Python 3 kernel with SageMaker support, for example `conda_python3` kernel.

The first step in using Amazon SageMaker is to create an execution role that encapsulates permissions used by Amazon SageMaker to access other AWS services, so we do that as the first step.

In [None]:
from sagemaker import get_execution_role
import boto3

# Create SageMaker role 
role = get_execution_role()
print(f'Amazon SageMaker execution role: {role}')

### Get XGBoost Amazon ECR URI
Every Amazon SageMaker model training requires an [Amazon ECR](https://aws.amazon.com/ecr/) image that provides an [Amazon SageMaker training container](https://docs.aws.amazon.com/sagemaker/latest/dg/amazon-sagemaker-containers.html). So, next, we specify the URI for the Amazon ECR docker image for Amazon SageMaker XGBoost algorithm.

In [None]:
# get the url to the container image for using linear-learner
from sagemaker.amazon.amazon_estimator import get_image_uri
xgboost_image = get_image_uri(boto3.Session().region_name, 'xgboost', '0.90-1')
print(f'Amazon SageMaker XGBoost Algorithm ECR Image URI: {xgboost_image}')

### Split training data into training, validation and test

At the end of the data preparation step executed in [companion notebook](emr-pyspark-nyc-tlc.ipynb), we get multiple CSV files saved in your S3 bucket. Each CSV file contains data with columns for trip duration in minutes, origin zone, destination zone, and pickup month, ordinal day of the week, and hour of day. 

Specify `source_bucket` S3 bucket where you saved the prepared data in the [companion notebook](emr-pyspark-nyc-tlc.ipynb). Specify `dest_bucket` S3 bucket where you want to load the split training, validaiton and test datasets. The `source_bucket` and `dest_bucket` can be the same S3 bucket.

In [None]:
# source bucket with CSV files produced by Amazon EMR spar cluster data preparation
source_bucket=
source_prefix='emr/output/uber_nyc/v1'

# destination bucket to upload SageMaker training input data files
# can be the same as source_bucket
dest_bucket=source_bucket
dest_prefix = 'sagemaker/input/uber_nyc/v1'

We split input data into <b>training, validation and test</b> datasets for SageMaker XGBoost model training. We split the data by downloading the input CSV data files from `source_bucket` and separating the downloaded files into three folders. We reserve 80% of the downloaded files for training, 15% for validation and 5% for test, and upload the separated data files using `training`, `validation` and `test` prefixes into the `dest_bucket` S3 bucket to stage the data for SageMaker XGBoost model training. 

In [None]:
import tempfile
import csv

s3 = boto3.client('s3')

response=s3.list_objects_v2(Bucket=source_bucket, Prefix=source_prefix)
contents=response['Contents']
count=len(contents)

sbucket = boto3.resource('s3').Bucket(source_bucket)
ntrain=int(count*0.80)
nval = int(count*0.15)
ntest = count - ntrain - nval

def stage_data(start, end, name):  
    for i in range(start, end, 1):
        item=contents[i]
        key =item['Key']    
        if not key.lower().endswith(".csv"):
            continue
        with tempfile.NamedTemporaryFile(mode='w+b', suffix='.csv', prefix='data-', delete=True) as csv_file:
            print(f'Download {key} file')
            sbucket.download_fileobj(key, csv_file)
            
            with open(csv_file.name, 'rb') as data_reader:
                dest_key = f'{dest_prefix}/{name}/part-{i}.csv'
                print(f'upload {dest_key} file')
                s3.upload_fileobj(data_reader, dest_bucket, dest_key)
                data_reader.close()
        
            csv_file.close()
            
stage_data(0, ntrain, 'train')
stage_data(ntrain, ntrain+nval, 'validation')
stage_data(ntrain+nval, count, 'test')

### Create Amazon SageMaker data input channels

Next, we  define <b>train and validation</b> input channels for Amazon SageMaker XGBoost model training.

In [None]:
from sagemaker import s3_input

s3_train = s3_input(s3_data=f's3://{dest_bucket}/{dest_prefix}/train', content_type='csv')
s3_validation = s3_input(s3_data=f's3://{dest_bucket}/{dest_prefix}/validation', content_type='csv')

output_path=f's3://{dest_bucket}/sagemaker/output/uber_nyc/xgboost'
data_channels = {'train': s3_train, 'validation': s3_validation}

### Create Amazon SageMaker Session

Next, we create Amazon SageMaker Session.

In [None]:
from sagemaker.estimator import Estimator
from sagemaker import Session

sess = boto3.Session()
sm = sess.client('sagemaker')

## Define SageMaker Experiment

To define SageMaker Experiment, we first install `sagemaker-experiments` package.

In [None]:
!pip install sagemaker-experiments

Next, we import the SageMaker Experiment modules.

In [None]:
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker
import time

Next, we define a Tracker for tracking input data used in the SageMaker Trials in this Experiment. Specify the S3 URL of your dataset in the value below and change the name of the dataset if you are using a different dataset.

In [None]:
with Tracker.create(display_name="Preprocessing", sagemaker_boto_client=sm) as tracker:
    # we can log the s3 uri to the dataset used for training
    tracker.log_input(name="nyc-tlc-dataset", 
                      media_type="s3/uri", 
                      value= "s3://aws-ajayvohra-ml-data/emr/output/uber_nyc/v1" # specify S3 URL to your dataset
                     )

Next, we create a SageMaker Experiment.

In [None]:
xgb_experiment = Experiment.create(
    experiment_name=f"xgb-nyc-tlc-experiment-{int(time.time())}", 
   description="XGBoost NYC TLC experiment", 
   sagemaker_boto_client=sm)
print(xgb_experiment)

### Start Training Job ###
After specifying best values for `num_round` and `max_depth` hyper-parameters, we are ready to start the training job.

In [None]:
trial_params = [ (110, 11), (100, 10), (90, 9), (80, 8)]
sagemaker_session=Session(sess, sm)
for num_round, max_depth in trial_params:
    
    trial_name = f"xgb-nyc-tlc-{int(time.time())}"
    xgb_trial = Trial.create(
                        trial_name=trial_name, 
                        experiment_name=xgb_experiment.experiment_name,
                        sagemaker_boto_client=sm,
    )
    
    # associate the proprocessing trial component with the current trial
    xgb_trial.add_trial_component(tracker.trial_component)
    print(xgb_trial)

    xgb = Estimator(image_name=xgboost_image,
                            role=role, 
                            train_instance_count=2, 
                            train_instance_type='ml.c5.4xlarge',
                            output_path=output_path,
                            sagemaker_session=sagemaker_session)
    
    xgb.set_hyperparameters(objective='reg:linear',
                        grow_policy='depthwise',
                        tree_method = 'approx',
                        num_round=num_round, 
                        max_depth=max_depth)
    
    
    xgb.fit(inputs=data_channels, 
                        logs=True,  
                        experiment_config={"TrialName": xgb_trial.trial_name, 
                                           "TrialComponentDisplayName": "Training"},
                        wait=False)
    print(xgb)

    # sleep in between starting two trials
    time.sleep(2)

### Compare the model training runs for an experiment

Now we will use the analytics capabilities of Python SDK to query and compare the training runs for identifying the best model produced by our experiment. You can retrieve trial components by using a search expression.

In [None]:
search_expression = {
    "Filters":[
        {
            "Name": "DisplayName",
            "Operator": "Equals",
            "Value": "Training",
        }
    ],
}

In [None]:
from sagemaker.analytics import ExperimentAnalytics

trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=sagemaker_session,
    experiment_name=xgb_experiment.experiment_name,
    search_expression=search_expression,
    sort_by="metrics.validation:rmse.min",
    sort_order="Ascending",
    metric_names=['validation:accuracy', 'train:rmse'],
    parameter_names=['max_depth', 'num_round']
)


In [None]:
analytic_table = trial_component_analytics.dataframe()

In [None]:
analytic_table