# Bike Rental Forecasting

This is an attempt to create a very simple forecasting model of bike rentals based on https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset.

(Yes, this is similar to https://gallery.azure.ai/Experiment/bike-rentals-regression but using SageMaker.)

"Very simple" means, it does so by using the bare minimum amount of function calls in order to get the data, create and deploy the model.

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

In [4]:
bucket = 'sagemakertestak'
prefix = 'sagemaker/Bike-Rental-Forecasting'

# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

Now we'll import the Python libraries we'll need.

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os
import time
import json
import sagemaker.amazon.common as smac
import sagemaker
from sagemaker.predictor import csv_serializer, json_deserializer

## Get the Data

Let's download the data from the bucket:

In [6]:
local = "hour.csv"
key = "hour.csv"
boto3.resource('s3').Bucket(bucket).download_file(key, local)

Let's see some rows of the csv:

In [7]:
bike = pd.read_csv(local)
bike = bike.drop('dteday', axis = 1) #XXX removing dteday. See if it can be included as a category
display(bike.head())

Unnamed: 0,instant,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


Next, we have to split the dataset between train and test sets:

In [8]:
split_train = int(len(bike) * 0.6)
split_test = int(len(bike) * 0.8)

train_y = bike['cnt'][:split_train]
train_X = bike.drop('cnt', axis=1).iloc[:split_train, ].to_numpy()
validation_y = bike['cnt'][split_train:split_test]
validation_X = bike.drop('cnt', axis=1).iloc[split_train:split_test, ].to_numpy()
test_y = bike['cnt'][split_test:]
test_X = bike.drop('cnt', axis=1).iloc[split_test:, ].to_numpy()

Now, we'll convert the datasets to the recordIO-wrapped protobuf format used by the Amazon SageMaker algorithms and upload this data to S3.  We'll start with training data.

In [9]:
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, np.array(train_X).astype('float32'), np.array(train_y).astype('float32'))
buf.seek(0)

0

In [10]:
key = 'linear_train.data'
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)
s3_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, key)
print('uploaded training data location: {}'.format(s3_train_data))

uploaded training data location: s3://sagemakertestak/sagemaker/Bike-Rental-Forecasting/train/linear_train.data


Next we'll convert and upload the validation dataset.

In [11]:
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, np.array(validation_X).astype('float32'), np.array(validation_y).astype('float32'))
buf.seek(0)

0

In [12]:
key = 'linear_validation.data'
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation', key)).upload_fileobj(buf)
s3_validation_data = 's3://{}/{}/validation/{}'.format(bucket, prefix, key)
print('uploaded validation data location: {}'.format(s3_validation_data))

uploaded validation data location: s3://sagemakertestak/sagemaker/Bike-Rental-Forecasting/validation/linear_validation.data


## Train the Model

Now we can begin to specify our linear model.  First, let's specify the containers for the Linear Learner algorithm.  Since we want this notebook to run in all of Amazon SageMaker's regions, we'll use a convenience function to look up the container image name for our current region.  More details on algorithm containers can be found in [AWS documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html).

In [13]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'linear-learner')

The method get_image_uri has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


Amazon SageMaker's Linear Learner actually fits many models in parallel, each with slightly different hyperparameters, and then returns the one with the best fit.  This functionality is automatically enabled.  We can influence this using parameters like:

- `num_models` to increase to total number of models run.  The specified parameters will always be one of those models, but the algorithm also chooses models with nearby parameter values in order to find a solution nearby that may be more optimal.  In this case, we're going to use the max of 32.
- `loss` which controls how we penalize mistakes in our model estimates.  For this case, let's use absolute loss as we haven't spent much time cleaning the data, and absolute loss will adjust less to accomodate outliers.
- `wd` or `l1` which control regularization.  Regularization can prevent model overfitting by preventing our estimates from becoming too finely tuned to the training data, which can actually hurt generalizability.  In this case, we'll leave these parameters as their default "auto" though.

Let'd kick off our training job in SageMaker's distributed, managed training.  Because training is managed (AWS handles spinning up and spinning down hardware), we don't have to wait for our job to finish to continue, but for this case, we'll use the Python SDK to track to wait and track our progress.

In [14]:
sess = sagemaker.Session()

linear = sagemaker.estimator.Estimator(container,
                                       role, 
                                       train_instance_count=1, 
                                       train_instance_type='ml.m5.large',
                                       output_path='s3://{}/{}/output'.format(bucket, prefix),
                                       sagemaker_session=sess)
linear.set_hyperparameters(feature_dim=15,
                           mini_batch_size=100,
                           predictor_type='regressor',
                           epochs=10,
                           num_models=32,
                           loss='absolute_loss')

linear.fit({'train': s3_train_data, 'validation': s3_validation_data})

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


2021-02-01 16:27:59 Starting - Starting the training job...
2021-02-01 16:28:22 Starting - Launching requested ML instancesProfilerReport-1612196878: InProgress
.........
2021-02-01 16:29:43 Starting - Preparing the instances for training......
2021-02-01 16:30:59 Downloading - Downloading input data
2021-02-01 16:30:59 Training - Downloading the training image...
2021-02-01 16:31:27 Training - Training image download completed. Training in progress..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[02/01/2021 16:31:30 INFO 140433487607616] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'loss_insensitivity': u'0.01', u'epochs': u'15', u'feature_dim': u'auto', u'init_bias': u'0.0', u'lr_scheduler_factor': u'auto', u'num_calibration_samples': u'10000000', u'accuracy_top_k': u'3', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'num_point_for_sc

## Deploy the Model

Now that we've trained the linear algorithm on our data, let's create a model and deploy that to a hosted endpoint.

In [15]:
linear_predictor = linear.deploy(initial_instance_count=1,
                                 instance_type='ml.t2.medium')

---------------------!

## Forecast

Now, we create the predictor:

In [19]:
linear_predictor.serializer = csv_serializer
linear_predictor.deserializer = json_deserializer

Next, we'll invoke the endpoint to get predictions.

In [20]:
result = linear_predictor.predict(test_X)
one_step = np.array([r['score'] for r in result['predictions']])

The csv_serializer has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
The json_deserializer has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


Here we just want to see how a piece of input data for prediction looks like:

In [21]:
display(test_X[1])

array([1.3905e+04, 3.0000e+00, 1.0000e+00, 8.0000e+00, 1.3000e+01,
       0.0000e+00, 2.0000e+00, 1.0000e+00, 2.0000e+00, 8.0000e-01,
       7.4240e-01, 5.2000e-01, 1.9400e-01, 6.8000e+01, 1.8500e+02])

Let's verify we got results back:

In [22]:
display(result['predictions'][0])

{'score': 284.24127197265625}

And here's an array to manually play around with some values:

In [23]:
test = [1, 3, 1, 8, 13, 0, 2, 1, 2, 1, 1, 1, 1, 800, 300]
result = linear_predictor.predict(test)
display(result['predictions'][0])

The csv_serializer has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
The json_deserializer has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


{'score': 1112.4727783203125}

For the record, you can call this from the aws-cli with a command like this one: aws sagemaker-runtime invoke-endpoint --endpoint-name "endpoint name" --body "base64-encoded csv text" --content-type "text/csv" <outputfile.json>


## (Optional) Clean-up

If you're ready to be done with this notebook, please run the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [None]:
sagemaker.Session().delete_endpoint(linear_predictor.endpoint)