# XGBoost Built-in Algorithm - Bike Rental Regression Example 

## Overview

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.

## Data Fields



* datetime - hourly date + timestamp  

* season -    1 = spring,   
            2 = summer, 
            3 = fall, 
            4 = winter 

* holiday - whether the day is considered a holiday

* workingday - whether the day is neither a weekend nor holiday

* weather -     1: Clear, Few clouds, Partly cloudy, Partly cloudy 
                2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 
                3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 
                4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
* temp - temperature in Celsius
* atemp - "feels like" temperature in Celsius
* humidity - relative humidity
* windspeed - wind speed
* casual - number of non-registered user rentals initiated
* registered - number of registered user rentals initiated
* count - number of total rentals

In [None]:
import numpy as np
import pandas as pd

import boto3
import re
import math

import sagemaker
from sagemaker import get_execution_role
# SageMaker Boto3 Documentation: https://aws.amazon.com/sdk-for-python/
# SageMaker Python SDK Documentation:https://sagemaker.readthedocs.io/en/stable/

## Upload Data to S3

In [None]:
# Defining bucket_name and objects(folders)
bucket_name = 'bucket-.....-.....'
project_name = 'bikerental'

training_folder = r'{0}/training/'.format(project_name)
validation_folder = r'{0}/validation/'.format(project_name)
test_folder = r'{0}/test/'.format(project_name)
output_folder =r'{0}/model/'.format(project_name)

s3_model_output_location = r's3://{0}/{1}'.format(bucket_name,output_folder)
s3_training_file_location = r's3://{0}/{1}'.format(bucket_name,training_folder)
s3_validation_file_location = r's3://{0}/{1}'.format(bucket_name,validation_folder)
s3_test_file_location = r's3://{0}/{1}'.format(bucket_name,test_folder)

In [None]:
print(s3_model_output_location)
print(s3_training_file_location)
print(s3_validation_file_location)
print(s3_test_file_location)

In [None]:
# Write and Reading from S3 is just as easy
# files are referred as objects in S3.  
# file name is referred as key name in S3

# File stored in S3 is automatically replicated across 3 different availability zones 
# in the region where the bucket was created.


def write_to_s3(filename, bucket, key):
    with open(filename,'rb') as f: # Read in binary mode
        return boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_fileobj(f)



In [None]:
write_to_s3('bike_train.csv', 
            bucket_name,
            training_folder + 'bike_train.csv')

write_to_s3('bike_validation.csv',
            bucket_name,
            validation_folder + 'bike_validation.csv')

write_to_s3('test.csv',
            bucket_name,
            test_folder + 'bike_test.csv')

In [None]:
conn = boto3.client('s3')
contents = conn.list_objects(Bucket=bucket_name, Prefix=project_name)['Contents']
for f in contents:
    print(f['Key'])

## Training Algorithm Docker Image
### SageMaker maintains a separate image for algorithm and region


In [None]:
# Use Spot Instance - Save up to 90% of training cost by using spot instances when compared to on-demand instances
# Reference: https://aws.amazon.com/ec2/spot/

# if you are still on two-month free-tier you can use the on-demand instance by setting:
# use_spot_instances = False

# We will use spot for training
use_spot_instances = False
max_run = 3600 # in seconds
max_wait = 7200 if use_spot_instances else None # in seconds

job_name = 'xgboost-bikerental'

#checkpoint_s3_uri = None

#if use_spot_instances:
 #   checkpoint_s3_uri = f's3://{bucket_name}/bikerental/checkpoints/{job_name}'
    
#print (f'Checkpoint uri: {checkpoint_s3_uri}')

In [None]:
# Establish a session with AWS
sess = sagemaker.Session()


In [None]:
print(sess.sagemaker_client)
print(sess.sagemaker_runtime_client)
print(sess.boto_session)
print(sess.boto_region_name)

In [None]:
role = get_execution_role()

In [None]:
role

In [None]:
# This role contains the permissions needed to train, deploy models
# SageMaker Service is trusted to assume this role
print(role)

In [None]:
# https://sagemaker.readthedocs.io/en/stable/api/utility/image_uris.html#sagemaker.image_uris.retrieve
#https://sagemaker.readthedocs.io/en/stable/api/utility/session.html

# SDK 2 uses image_uris.retrieve the container image location

# Use XGBoost 1.2 version 
container = sagemaker.image_uris.retrieve("xgboost",sess.boto_region_name,version="1.2-2")

print (f'Using XGBoost Container: {container}')

## Build Model

In [None]:
# Configure the training job
# Specify type and number of instances to use
# 50 hours of m4.xlarge or m5.xlarge instances
# S3 location where final artifacts needs to be stored


# for managed spot training, specify the use_spot_instances flag, max_run, max_wait and checkpoint_s3_uri

estimator = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    output_path=s3_model_output_location,
    sagemaker_session=sess,
    base_job_name = job_name,
    use_spot_instances=use_spot_instances,
    max_run=max_run,
    max_wait=max_wait,
    #checkpoint_s3_uri=checkpoint_s3_uri
    )

In [None]:
# Specify hyper parameters that appropriate for the training algorithm
# XGBoost Training Parameter Reference
# https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html
estimator.set_hyperparameters(max_depth=5,
                              eta=0.1,
                              num_round=150)

In [None]:
estimator.hyperparameters()

### Specify Training Data Location and Optionally, Validation Data Location

In [None]:
# content type can be libsvm or csv for XGBoost
training_input_config = sagemaker.session.TrainingInput(
    s3_data=s3_training_file_location,
    content_type='csv',
    s3_data_type='S3Prefix')

validation_input_config = sagemaker.session.TrainingInput(
    s3_data=s3_validation_file_location,
    content_type='csv',
    s3_data_type='S3Prefix'
)

data_channels = {'train': training_input_config, 'validation': validation_input_config}

In [None]:
print(training_input_config.config)
print(validation_input_config.config)

### Train the model

In [None]:
# XGBoost supports "train", "validation" channels
# Reference: Supported versions of xgboost algorithm
# https://docs.aws.amazon.com/sagemaker/latest/dg/ecr-us-east-1.html#xgboost-us-east-1.title
estimator.fit(data_channels)

In [None]:
conn = boto3.client('s3')
contents = conn.list_objects(Bucket=bucket_name, Prefix=project_name)['Contents']
for f in contents:
    print(f['Key'])

## Deploy Model

In [None]:
# Ref: https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html
predictor = estimator.deploy(initial_instance_count=1,
                             instance_type='ml.m5.xlarge',
                             endpoint_name = job_name)

## Run Predictions

In [None]:

from sagemaker.serializers import CSVSerializer

In [None]:
predictor.serializer = CSVSerializer()

In [None]:
result=predictor.predict([[3,0,1,2,28.7,33.335,79,12.998,2011,7,7,3]])
result = result.decode("utf-8")
result

In [None]:
print ('Predicted Count', math.expm1(float(result)))