# Time Series Forecasting with Linear Learner
_**Using Linear Regression to Forecast Monthly Demand**_

---

---

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Host](#Host)
  1. [Forecast](#Forecast)
1. [Extensions](#Extensions)

---

## Background

Forecasting is potentially the most broadly relevant machine learning topic there is.  Whether predicting future sales in retail, housing prices in real estate, traffic in cities, or patient visits in healthcare, almost every industry could benefit from improvements in their forecasts.  There are numerous statistical methodologies that have been developed to forecast time-series data, but still, the process for developing forecasts tends to be a mix of objective statistics and subjective interpretations.

Properly modeling time-series data takes a great deal of care.  What's the right level of aggregation to model at?  Too granular and the signal gets lost in the noise, too aggregate and importent variation is missed.  Also, what is the right cyclicality?  Daily, weekly, monthly?  Are there holiday peaks?  How should we weight recent versus overall trends?

Linear regression with appropriate controls for trend, seasonality, and recent behavior, remains a common method for forecasting stable time-series with reasonable volatility.  This notebook will build a linear model to forecast weekly output for US gasoline products starting in 1991 to 2005.  It will focus almost exclusively on the application.  For a more in-depth treatment on forecasting in general, see [Forecasting: Principles & Practice](https://robjhyndman.com/uwafiles/fpp-notes.pdf).

---

## Setup

Let's start by specifying:

* AWS region.
* The IAM role arn used to give learning and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto call with a the appropriate full IAM role arn string.
* The S3 bucket that you want to use for training and model data.

In [None]:
import os
import boto3

os.environ['AWS_DEFAULT_REGION'] = 'us-west-2'
role = boto3.client('iam').list_instance_profiles()['InstanceProfiles'][0]['Roles'][0]['Arn']

bucket = '<your_s3_bucket_name_here>'
prefix = 'sagemaker/linear_time_series_forecast'

Now we'll import the Python libraries we'll need.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import convert_data
import io
import time
import json

---
## Data

Let's download the data.

In [None]:
!wget http://robjhyndman.com/data/gasoline.csv

And take a look at it.

In [None]:
gas = pd.read_csv('gasoline.csv', header=None, names=['thousands_barrels'])
display(gas.head())
plt.plot(gas)
plt.show()

As we can see, there's a definitive upward trend, some yearly seasonality, but sufficient volatility to make the problem non-trivial.  There are several unexpected dips and years with more or less pronounced seasonality.  These same characteristics are common in many topline time-series.

Next we'll transform the dataset to make it look a bit more like a standard prediction model.  Our target variable is `thousands_barrels`.  Let's create explanatory features, like:
- `thousands_barrels` for each of the 4 preceeding weeks.
- Trend.  The chart above suggests the trend is simply linear, but we'll create log and quadratic trends in case.
- Indicator variables {0 or 1} that will help capture seasonality and key holiday weeks.

In [None]:
gas['thousands_barrels_lag1'] = gas['thousands_barrels'].shift(1)
gas['thousands_barrels_lag2'] = gas['thousands_barrels'].shift(2)
gas['thousands_barrels_lag3'] = gas['thousands_barrels'].shift(3)
gas['thousands_barrels_lag4'] = gas['thousands_barrels'].shift(4)
gas['trend'] = np.arange(len(gas))
gas['log_trend'] = np.log1p(np.arange(len(gas)))
gas['sq_trend'] = np.arange(len(gas)) ** 2
weeks = pd.get_dummies(np.array(list(range(52)) * 15)[:len(gas)], prefix='week')
gas = pd.concat([gas, weeks], axis=1)

Now, we'll:
- Clear out the first four rows where we don't have lagged information.
- Split the target off from the explanatory features.
- Split the data into training, validation, and test groups so that we can tune our model and then evaluate its accuracy on data it hasn't seen yet.  Since this is time-series data, we'll use the first 60% for training, the second 20% for validation, and the final 20% for final test evaluation.

In [None]:
gas = gas.iloc[4:, ]
split_train = int(len(gas) * 0.6)
split_test = int(len(gas) * 0.8)

train_y = gas['thousands_barrels'][:split_train]
train_X = gas.drop('thousands_barrels', axis=1).iloc[:split_train, ].as_matrix()
validation_y = gas['thousands_barrels'][split_train:split_test]
validation_X = gas.drop('thousands_barrels', axis=1).iloc[split_train:split_test, ].as_matrix()
test_y = gas['thousands_barrels'][split_test:]
test_X = gas.drop('thousands_barrels', axis=1).iloc[split_test:, ].as_matrix()

Now, we'll convert the datasets to the recordIO wrapped protobuf format used by the Amazon SageMaker algorithms and upload this data to S3.  We'll start with training data.

In [None]:
train_file = 'linear_train.data'

f = io.BytesIO()
for features, target in zip(train_X, train_y):
    convert_data.write_recordio(f, convert_data.list_to_record_bytes(features, label=target, feature_size=59))
f.seek(0)

boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', train_file)).upload_fileobj(f)

Next we'll convert and upload the validation dataset.

In [None]:
validation_file = 'linear_validation.data'

f = io.BytesIO()
for features, target in zip(validation_X, validation_y):
    convert_data.write_recordio(f, convert_data.list_to_record_bytes(features, label=target, feature_size=59))
f.seek(0)

boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation', train_file)).upload_fileobj(f)

---
## Train

Now we can begin to specify our linear model.  Amazon SageMaker's Linear Learner actually fits many models in parallel, each with slightly different hyperparameters, and then returns the one with the best fit.  This functionality is automatically enabled.  We can influence this using parameters like:

- `num_models` to increase to total number of models run.  The specified parameters will always be one of those models, but the algorithm also chooses models with nearby parameter values in order to find a solution nearby that may be more optimal.  In this case, we're going to use the max of 32.
- `loss` which controls how we penalize mistakes in our model estimates.  For this case, let's use absolute loss as we haven't spent much time cleaning the data, and absolute loss will adjust less to accomodate outliers.
- `wd` or `l1` which control regularization.  Regularization can prevent model overfitting by preventing our estimates from becoming too finely tuned to the training data, which can actually hurt generalizability.  In this case, we'll leave these parameters as their default "auto" though.

In [None]:
linear_job = 'linear-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

print("Job name is:", linear_job)

linear_training_params = {
    "RoleArn": role,
    "TrainingJobName": linear_job,
    "AlgorithmSpecification": {
        "TrainingImage": "900597767885.dkr.ecr.us-east-1.amazonaws.com/aialgorithmslinearlearnercontainer:latest",
        "TrainingInputMode": "File"
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.c4.2xlarge",
        "VolumeSizeInGB": 10
    },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://{}/{}/train/".format(bucket, prefix),
                    "S3DataDistributionType": "ShardedByS3Key"
                }
            },
            "CompressionType": "None",
            "RecordWrapperType": "None"
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://{}/{}/validation/".format(bucket, prefix),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "CompressionType": "None",
            "RecordWrapperType": "None"
        }

    ],
    "OutputDataConfig": {
        "S3OutputPath": "s3://{}/{}/".format(bucket, prefix)
    },
    "HyperParameters": {
        "feature_dim": "59",
        "mini_batch_size": "100",
        "predictor_type": "regressor",
        "epochs": "10",
        "num_models": "32",
        "loss": "absolute_loss"
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 60 * 60
    }
}

Now let's kick off our training job in SageMaker's distributed, managed training, using the parameters we just created.  Because training is managed (AWS handles spinning up and spinning down hardware), we don't have to wait for our job to finish to continue, but for this case, let's setup a while loop so we can monitor the status of our training.

In [None]:
%%time

region = boto3.Session().region_name
sm = boto3.Session().client(service_name='sagemaker', endpoint_url='https://im.{}.amazonaws.com'.format(region))

sm.create_training_job(**linear_training_params)

status = sm.describe_training_job(TrainingJobName=linear_job)['TrainingJobStatus']
print(status)
sm.get_waiter('TrainingJob_Created').wait(TrainingJobName=linear_job)
if status == 'Failed':
    message = sm.describe_training_job(TrainingJobName=linear_job)['FailureReason']
    print('Training failed with the following error: {}'.format(message))
    raise Exception('Training job failed')

---
## Host

Now that we've trained the linear algorithm on our data, let's setup a model which can later be hosted.  We will:
1. Point to the scoring container
1. Point to the model.tar.gz that came from training
1. Create the hosting model

In [None]:
linear_hosting_container = {
    'Image': "900597767885.dkr.ecr.us-east-1.amazonaws.com/aialgorithmslinearlearnercontainer:latest",
    'ModelDataUrl': sm.describe_training_job(TrainingJobName=linear_job)['ModelArtifacts']['S3ModelArtifacts']
}

create_model_response = sm.create_model(
    ModelName=linear_job,
    ExecutionRoleArn=role,
    PrimaryContainer=linear_hosting_container)

print(create_model_response['ModelArn'])

Once we've setup a model, we can configure what our hosting endpoints should be.  Here we specify:
1. EC2 instance type to use for hosting
1. Lower and upper bounds for number of instances
1. Our hosting model name

In [None]:
linear_endpoint_config = 'linear-endpoint-config-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
print(linear_endpoint_config)
create_endpoint_config_response = sm.create_endpoint_config(
    EndpointConfigName=linear_endpoint_config,
    ProductionVariants=[{
        'InstanceType': 'ml.c4.2xlarge',
        'InitialInstanceCount': 1,
        'ModelName': linear_job,
        'VariantName': 'AllTraffic'}])

print("Endpoint Config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

Now that we've specified how our endpoint should be configured, we can create them.  This can be done in the background, but for now let's run a loop that updates us on the status of the endpoints so that we know when they are ready for use.

In [None]:
%%time

linear_endpoint = 'linear-endpoint-' + time.strftime("%Y%m%d%H%M", time.gmtime())
print(linear_endpoint)
create_endpoint_response = sm.create_endpoint(
    EndpointName=linear_endpoint,
    EndpointConfigName=linear_endpoint_config)
print(create_endpoint_response['EndpointArn'])

resp = sm.describe_endpoint(EndpointName=linear_endpoint)
status = resp['EndpointStatus']
print("Status: " + status)

sm.get_waiter('Endpoint_Created').wait(EndpointName=linear_endpoint)

resp = sm.describe_endpoint(EndpointName=linear_endpoint)
status = resp['EndpointStatus']
print("Arn: " + resp['EndpointArn'])
print("Status: " + status)

if status != 'InService':
    raise Exception('Endpoint creation did not succeed')

### Forecast

Now that we have our hosted endpoint, we can generate statistical forecasts from it.  Let's forecast on our test dataset to understand how accurate our model may be.

There are many metrics to measure forecast error.  Common examples include include:
- Root Mean Square Error (RMSE)
- Mean Absolute Percent Error (MAPE)
- Geometric Mean of the Relative Absolute Error (GMRAE)
- Quantile forecast errors
- Errors that account for asymmetric loss in over or under-prediction

For our example we'll keep things simple and use Median Absolute Percent Error (MdAPE), but we'll also compare it to a naive benchmark forecast (that week last year's demand * that week last year / that week two year's ago).

There are also multiple ways to generate forecasts.
- One-step-ahead forecasts:  When predicting for multiple data points, one-step-ahead forecasts update the history with the correct known value.  These are common, easy to produce, and can give us some intuition of whether out model is performing as expected.  However, they can also present an unnecessarily optimistic evaluation of the forecast.  In most real-life cases, we want to predict out well into the future, because the actions we may take based on that forecast are not immediate.  In these cases, we want know what the time-periods in between will bring, so generating a forecast based on the knowledge that we do, can be misleading.
- Multi-step-ahead (or horizon) forecasts: In this case, when forecasting out of sample, each forecast builds off of the forecasted periods that precede it.  So, errors early on in the test data can compound to create large deviations for observations late in the test data.  Although this is more realistic, it can be difficult to create the forecasts, particularly as model complexity increases.

For our example, we'll calculate both, but focus on the multi-step forecast accuracy.

Let's start by generating the naive forecast.

In [None]:
gas['thousands_barrels_lag52'] = gas['thousands_barrels'].shift(52)
gas['thousands_barrels_lag104'] = gas['thousands_barrels'].shift(104)
gas['thousands_barrels_naive_forecast'] = gas['thousands_barrels_lag52'] ** 2 / gas['thousands_barrels_lag104']
naive = gas[split_test:]['thousands_barrels_naive_forecast'].as_matrix()

And investigating it's accuracy.

In [None]:
print('Naive MdAPE =', np.median(np.abs(test_y - naive) / test_y))
plt.plot(np.array(test_y), label='actual')
plt.plot(naive, label='naive')
plt.legend()
plt.show()

Now we'll generate the one-step-ahead forecast.  First we need a function to convert our numpy arrays into a format that can be handled by our inference container.  In this case that's a simple CSV. 

In [None]:
def np2csv(arr):
    csv = io.BytesIO()
    np.savetxt(csv, arr, delimiter=',', fmt='%g')
    return csv.getvalue().decode().rstrip()

Next, we'll invoke the endpoint to get predictions.

In [None]:
runtime = boto3.Session().client(service_name='sagemaker-runtime', endpoint_url='https://maeveruntime.prod.us-west-2.ml-platform.aws.a2z.com')

payload = np2csv(test_X)
response = runtime.invoke_endpoint(EndpointName=linear_endpoint,
                                   ContentType='text/csv',
                                   Body=payload)
result = json.loads(response['Body'].read().decode())
one_step = np.array([r['score'] for r in result['predictions']])

Let's compare forecast errors.

In [None]:
print('One-step-ahead MdAPE = ', np.median(np.abs(test_y - one_step) / test_y))
plt.plot(np.array(test_y), label='actual')
plt.plot(one_step, label='forecast')
plt.legend()
plt.show()

As we can see our MdAPE is substantially better than the naive, and we actually swing from a forecasts that's too volatile to one that under-represents the noise in our data.  However, the overall shape of the statistical forecast does appear to better represent the actual data.

Next, let's generate multi-step-ahead forecast.  To do this, we'll need to loop over invoking the endpoint one row at a time and make sure the lags in our model are updated appropriately.

In [None]:
multi_step = []
lags = test_X[0, 0:4]
for row in test_X:
    row[0:4] = lags
    payload = np2csv([row])
    response = runtime.invoke_endpoint(EndpointName=linear_endpoint,
                                       ContentType='text/csv',
                                       Body=payload)
    result = json.loads(response['Body'].read().decode())
    prediction = result['predictions'][0]['score']
    multi_step.append(prediction)
    lags[1:4] = lags[0:3]
    lags[0] = prediction

multi_step = np.array(multi_step)

And now calculate the accuracy of these predictions.

In [None]:
print('Multi-step-ahead MdAPE =', np.median(np.abs(test_y - multi_step) / test_y))
plt.plot(np.array(test_y), label='actual')
plt.plot(one_step, label='forecast')
plt.legend()
plt.show()

As we can see our multi-step ahead error performs worse than our one-step ahead forecast, but nevertheless remains substantially stronger than the naive benchmark forecast.  This 1.5 percentage point difference may not seem particularly meaningful, but at the large scale of many topline forecasts can mean millions of dollars in excess supply or lost sales.

---
## Extensions

Our linear model does a good job of predicting gasoline demand, but of course, improvements could be made.  The fact that statistical forecast actually underrepresents some of the volatility in the data could suggest that we have actually over-regularized the data.  Or, perhaps our choice of absolute loss was incorrect.  Rerunning the model with further tweaks to these hyperparameters may provide more accurate out of sample forecasts.  We also did not do a large amount of feature engineering.  Occasionally, the lagging time-periods have complex interrelationships with one another that should be explored.  Finally, alternative forecasting algorithms could be explored.  Less interpretable methods like ARIMA, and black-box methods like LSTM Recurrent Neural Networks have been shown to predict time-series very well.  Balancing the simplicity of a linear model with predictive accuracy is an important subjective question where the right answer depends on the problem being solved, and its implications to the business.