<h2>Kaggle Bike Sharing Demand Dataset</h2>

In this notebook, we will train a model that predicts bike sharing demand. This was Kaggle compition. You can download the data from https://www.kaggle.com/c/bike-sharing-demand/data


<h3>Objective:</h3> 
<quote>You are provided hourly rental data spanning two years. For this competition, the training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month. You must predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period (Ref: Kaggle.com)</quote>

### Import packages

In [2]:
import numpy as np
import pandas as pd

# Define IAM role

import boto3
import re
import sagemaker
from sagemaker import get_execution_role

### Upload Data to S3

In [18]:
bucket_name = 'fish-dsci'
training_file_key = 'sagemaker/tutorial/biketrain/bike_train.csv'
validation_file_key = 'sagemaker/tutorial/biketrain/bike_validation.csv'
test_file_key = 'sagemaker/tutorial/biketrain/bike_test.csv'

s3_model_output_location = r's3://{0}/sagemaker/tutorial/biketrain/model'.format(bucket_name)
s3_training_file_location = r's3://{0}/{1}'.format(bucket_name,training_file_key)
s3_validation_file_location = r's3://{0}/{1}'.format(bucket_name,validation_file_key)
s3_test_file_location = r's3://{0}/{1}'.format(bucket_name,test_file_key)

In [19]:
print(s3_model_output_location)
print(s3_training_file_location)
print(s3_validation_file_location)
print(s3_test_file_location)

s3://fish-dsci/sagemaker/tutorial/biketrain/model
s3://fish-dsci/sagemaker/tutorial/biketrain/bike_train.csv
s3://fish-dsci/sagemaker/tutorial/biketrain/bike_validation.csv
s3://fish-dsci/sagemaker/tutorial/biketrain/bike_test.csv


In [5]:
# A function to write data to s3

def write_to_s3(filename, bucket, key):
    with open(filename,'rb') as f: # Read in binary mode
        return boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_fileobj(f)

In [6]:
write_to_s3('bike_train.csv',bucket_name,training_file_key)
write_to_s3('bike_validation.csv',bucket_name,validation_file_key)
write_to_s3('bike_test.csv',bucket_name,test_file_key)

## Training Algorithm Docker Image
### AWS Maintains a separate image for every region and algorithm

In [7]:
# Registry Path for algorithms provided by SageMaker
#  https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html

containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest',
              'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
              'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest',
              'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest'}

In [8]:
role = get_execution_role()

In [9]:
# This role contains the permissions needed to train, deploy models
# SageMaker Service is trusted to assume this role
print(role)

arn:aws:iam::586260461210:role/service-role/AmazonSageMaker-ExecutionRole-20180812T064411


## Build Model

In [10]:
sess = sagemaker.Session()

In [11]:
# Access appropriate algorithm container image
#  Specify how many instances to use for distributed training and what type of machine to use
#  Finally, specify where the trained model artifacts needs to be stored
#   Reference: http://sagemaker.readthedocs.io/en/latest/estimators.html

estimator = sagemaker.estimator.Estimator(containers[boto3.Session().region_name],
                                       role, 
                                       train_instance_count=1, 
                                       train_instance_type='ml.m4.xlarge',
                                       output_path=s3_model_output_location,
                                       sagemaker_session=sess,
                                       base_job_name ='xgboost-biketrain-v1')

In [12]:
# Specify hyper parameters that appropriate for the training algorithm
# XGBoost Training Parameter Reference: 
#   https://github.com/dmlc/xgboost/blob/master/doc/parameter.md

# max_depth=5,eta=0.1,subsample=0.7,num_round=150

estimator.set_hyperparameters(max_depth=5,objective="reg:linear",
                              eta=0.1,subsample=0.7,num_round=150)

In [13]:
estimator.hyperparameters()

{'max_depth': 5,
 'objective': 'reg:linear',
 'eta': 0.1,
 'subsample': 0.7,
 'num_round': 150}

### Specify Training Data Location and Optionally, Validation Data Location

In [20]:
# content type can be libsvm or csv for XGBoost

training_input_config = sagemaker.session.s3_input(s3_data=s3_training_file_location,content_type="csv")
validation_input_config = sagemaker.session.s3_input(s3_data=s3_validation_file_location,content_type="csv")

In [21]:
print(training_input_config.config)
print(validation_input_config.config)

{'DataSource': {'S3DataSource': {'S3DataDistributionType': 'FullyReplicated', 'S3DataType': 'S3Prefix', 'S3Uri': 's3://fish-dsci/sagemaker/tutorial/biketrain/bike_train.csv'}}, 'ContentType': 'csv'}
{'DataSource': {'S3DataSource': {'S3DataDistributionType': 'FullyReplicated', 'S3DataType': 'S3Prefix', 'S3Uri': 's3://fish-dsci/sagemaker/tutorial/biketrain/bike_validation.csv'}}, 'ContentType': 'csv'}


### Train the model

In [16]:
# XGBoost supports "train", "validation" channels
# Reference: Supported channels by algorithm
#   https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html


estimator.fit({'train':training_input_config, 'validation':validation_input_config})

INFO:sagemaker:Creating training-job with name: xgboost-biketrain-v1-2018-08-25-00-32-37-073


....................
[31mArguments: train[0m
[31m[2018-08-25:00:35:47:INFO] Running standalone xgboost training.[0m
[31m[2018-08-25:00:35:47:INFO] File size need to be processed in the node: 0.65mb. Available memory size in the node: 8603.52mb[0m
[31m[2018-08-25:00:35:47:INFO] Determined delimiter of CSV input is ','[0m
[31m[00:35:47] S3DistributionType set as FullyReplicated[0m
[31m[00:35:47] 7620x13 matrix with 99060 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[31m[2018-08-25:00:35:47:INFO] Determined delimiter of CSV input is ','[0m
[31m[00:35:47] S3DistributionType set as FullyReplicated[0m
[31m[00:35:47] 3266x13 matrix with 42458 entries loaded from /opt/ml/input/data/validation?format=csv&label_column=0&delimiter=,[0m
[31m[00:35:47] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 46 extra nodes, 0 pruned nodes, max_depth=5[0m
[31m[0]#011train-rmse:3.90451#011validation-rmse:3.91509[0m
[31m[00:35:47] src/tre


Billable seconds: 88


## Deploy Model

In [17]:
# Ref: http://sagemaker.readthedocs.io/en/latest/estimators.html

predictor = estimator.deploy(initial_instance_count=1,
                             instance_type='ml.m4.xlarge',
                             endpoint_name = 'xgboost-biketrain-v1')

INFO:sagemaker:Creating model with name: xgboost-2018-08-25-00-36-19-364
INFO:sagemaker:Creating endpoint with name xgboost-biketrain-v1


---------------------------------------------------------------!