# Bike Rental Forecasting

This is an attempt to create a very simple forecasting model of bike rentals based on https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset.

(Yes, this is similar to https://gallery.azure.ai/Experiment/bike-rentals-regression but using SageMaker.)

"Very simple" means, it does so by using the bare minimum amount of function calls in order to get the data, create and deploy the model.

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

In [None]:
bucket = 'datalake-ak'
prefix = 'sagemaker/Bike-Rental-Forecasting'

# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

Now we'll import the Python libraries we'll need.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os
import time
import json
import sagemaker.amazon.common as smac
import sagemaker
from sagemaker.predictor import csv_serializer, json_deserializer

## Get the Data

Let's download the data from the bucket:

In [None]:
local = "hour.csv"
key = "hour.csv"
boto3.resource('s3').Bucket(bucket).download_file(key, local)

Let's see some rows of the csv:

In [None]:
bike = pd.read_csv(local)
bike = bike.drop('dteday', axis = 1) #XXX removing dteday. See if it can be included as a category
display(bike.head())

Next, we have to split the dataset between train and test sets:

In [None]:
split_train = int(len(bike) * 0.6)
split_test = int(len(bike) * 0.8)

train_y = bike['cnt'][:split_train]
train_X = bike.drop('cnt', axis=1).iloc[:split_train, ].to_numpy()
validation_y = bike['cnt'][split_train:split_test]
validation_X = bike.drop('cnt', axis=1).iloc[split_train:split_test, ].to_numpy()
test_y = bike['cnt'][split_test:]
test_X = bike.drop('cnt', axis=1).iloc[split_test:, ].to_numpy()

Now, we'll convert the datasets to the recordIO-wrapped protobuf format used by the Amazon SageMaker algorithms and upload this data to S3.  We'll start with training data.

In [None]:
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, np.array(train_X).astype('float32'), np.array(train_y).astype('float32'))
buf.seek(0)

In [None]:
key = 'linear_train.data'
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)
s3_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, key)
print('uploaded training data location: {}'.format(s3_train_data))

Next we'll convert and upload the validation dataset.

In [None]:
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, np.array(validation_X).astype('float32'), np.array(validation_y).astype('float32'))
buf.seek(0)

In [None]:
key = 'linear_validation.data'
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation', key)).upload_fileobj(buf)
s3_validation_data = 's3://{}/{}/validation/{}'.format(bucket, prefix, key)
print('uploaded validation data location: {}'.format(s3_validation_data))

## Train the Model

Now we can begin to specify our linear model.  First, let's specify the containers for the Linear Learner algorithm.  Since we want this notebook to run in all of Amazon SageMaker's regions, we'll use a convenience function to look up the container image name for our current region.  More details on algorithm containers can be found in [AWS documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html).

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'linear-learner')

Amazon SageMaker's Linear Learner actually fits many models in parallel, each with slightly different hyperparameters, and then returns the one with the best fit.  This functionality is automatically enabled.  We can influence this using parameters like:

- `num_models` to increase to total number of models run.  The specified parameters will always be one of those models, but the algorithm also chooses models with nearby parameter values in order to find a solution nearby that may be more optimal.  In this case, we're going to use the max of 32.
- `loss` which controls how we penalize mistakes in our model estimates.  For this case, let's use absolute loss as we haven't spent much time cleaning the data, and absolute loss will adjust less to accomodate outliers.
- `wd` or `l1` which control regularization.  Regularization can prevent model overfitting by preventing our estimates from becoming too finely tuned to the training data, which can actually hurt generalizability.  In this case, we'll leave these parameters as their default "auto" though.

Let'd kick off our training job in SageMaker's distributed, managed training.  Because training is managed (AWS handles spinning up and spinning down hardware), we don't have to wait for our job to finish to continue, but for this case, we'll use the Python SDK to track to wait and track our progress.

In [None]:
sess = sagemaker.Session()

linear = sagemaker.estimator.Estimator(container,
                                       role, 
                                       train_instance_count=1, 
                                       train_instance_type='ml.m5.large',
                                       output_path='s3://{}/{}/output'.format(bucket, prefix),
                                       sagemaker_session=sess)
linear.set_hyperparameters(feature_dim=15,
                           mini_batch_size=100,
                           predictor_type='regressor',
                           epochs=10,
                           num_models=32,
                           loss='absolute_loss')

linear.fit({'train': s3_train_data, 'validation': s3_validation_data})

## Deploy the Model

Now that we've trained the linear algorithm on our data, let's create a model and deploy that to a hosted endpoint.

In [None]:
linear_predictor = linear.deploy(initial_instance_count=1,
                                 instance_type='ml.t2.medium')

## Forecast

Now, we create the predictor:

In [None]:
linear_predictor.content_type = 'text/csv'
linear_predictor.serializer = csv_serializer
linear_predictor.deserializer = json_deserializer

Next, we'll invoke the endpoint to get predictions.

In [None]:
result = linear_predictor.predict(test_X)
one_step = np.array([r['score'] for r in result['predictions']])

Here we just want to see how a piece of input data for prediction looks like:

In [None]:
display(test_X[1])

Let's verify we got results back:

In [None]:
display(result['predictions'][0])

And here's an array to manually play around with some values:

In [None]:
test = [1, 3, 1, 8, 13, 0, 2, 1, 2, 1, 1, 1, 1, 80, 200]
result = linear_predictor.predict(test)
display(result['predictions'][0])

For the record, you can call this from the aws-cli with a command like this one: aws sagemaker-runtime invoke-endpoint --endpoint-name "endpoint name" --body "base64-encoded csv text" --content-type "text/csv" <outputfile.json>


## (Optional) Clean-up

If you're ready to be done with this notebook, please run the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [None]:
sagemaker.Session().delete_endpoint(linear_predictor.endpoint)