# Machine Learning Immersion Day

This notebook will serve as a template for the overall process of taking a non ideal time series dataset and integrating it into [Amazon Forecast](https://aws.amazon.com/forecast/).

## Overview

1. Introduction to Amazon Forecast
1. Obtaining Your Data
1. Fitting the Data to Forecast
1. Determining Your Forecast Horizon (1st pass)
1. Building Your Predictors
1. Visualizing Predictors
1. Making Decisions
1. Next Steps


## Introduction to Amazon Forecast

If you are not familiar with Amazon Forecast you can learn more about this tool on these pages:

* [Product Page](https://aws.amazon.com/forecast/)
* [GitHub Sample Notebooks](https://github.com/aws-samples/amazon-forecast-samples)
* [Product Docs](https://docs.aws.amazon.com/forecast/latest/dg/what-is-forecast.html)


## Obtaining Your Data 

A critical requirement to use Amazon Forecast is to have access to time-series data for your selected use case. To learn more about time series data:

1. [Wikipedia](https://en.wikipedia.org/wiki/Time_series)
1. [Toward's Data Science Primer](https://towardsdatascience.com/the-complete-guide-to-time-series-analysis-and-forecasting-70d476bfe775)
1. [O'Reilly Book](https://www.amazon.com/gp/product/1492041653/ref=ppx_yo_dt_b_search_asin_title?ie=UTF8&psc=1)

For this exercise, we use the individual household electric power consumption dataset. (Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.) We aggregate the usage data hourly.

To begin, use Pandas to read the CSV and to show a sample of the data.

To begin the cell below will complete the following: (not required if you clone the repository as data is already part of the repository)

1. Create a directory for the data files.
1. Download the sample data into the directory.
1. Extract the archive file into the directory.

If you are running this in Amazon Sagemaker Studio, please select Data Science Image and Python3 Kernel. 

We will first upgrade pandas followed by Kernel restart. After this we will import the Pandas library as well as a few other data science tools in order to inspect the information


In [None]:
!pip install --upgrade pip
!pip install --upgrade pandas

After upgrade is done, please restart the kernel

In [None]:
import boto3
from time import sleep
import subprocess
import pandas as pd
import json
import time
import pprint
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter
import matplotlib.dates as mdates
import dateutil.parser

In [None]:
df = pd.read_csv("../data/item-demand-time.csv", dtype = object, names=['timestamp','value','item'])
df.drop(df.loc[df['item']!='client_12'].index, inplace=True)
df.head(3)

In [None]:
df.describe()

Notice in the output above there are 3 columns of data:

1. The Timestamp
1. A Value
1. An Item

These are the 3 key required pieces of information to generate a forecast with Amazon Forecast. More can be added but these 3 must always remain present.

The dataset happens to span January 01, 2014 to Deceber 31, 2014. For our testing we would like to keep the last month of information in a different CSV. We are also going to save January to November to a different CSV as well.

You may notice a variable named `df` this is a popular convention when using Pandas if you are using the library's dataframe object, it is similar to a table in a database. You can learn more here: https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

In our dataset we have information about 3 clients, lets focus on client_12 on this excercise

In [None]:
jan_to_oct = df[(df['timestamp'] >= '2014-01-01') & (df['timestamp'] <= '2014-10-31')]
remaining_df = df[(df['timestamp'] >= '2014-10-31') & (df['timestamp'] <= '2014-12-01')]

Now export them to CSV files and place them into your `data` folder.

In [None]:
jan_to_oct.to_csv("../data/item-demand-time-train.csv", header=False, index=False)
remaining_df.to_csv("../data/item-demand-time-validation.csv", header=False, index=False)

### Uploading your training data to S3

At this time the data is ready to be sent to S3 where Forecast will use it later. The following cells will upload the data to S3.

Please paste the Bucket Name and the Forecast Role ARN from your Cloudformation outputs section


In [None]:
# Replace this bucket name and your Role ARN , get execution role from Sagemkaer Studio Console and bucket name from S3 console

bucket_name = "forecastimmersiondayluseloso"
role_arn = "arn:aws:iam::144386903708:role/ForecastSteps-ForecastRole-OPD2KYME2LGI"
role_name = role_arn.split("/")[1]

target_time_series_filename ="elec_data/item-demand-time-train.csv"

boto3.Session().resource('s3').Bucket(bucket_name).Object(target_time_series_filename).upload_file("../data/item-demand-time-train.csv")

## Getting Started With Forecast

Now that all of the required data to get started exists, our next step is to build the dataset groups and datasets required for our problem. Inside Amazon Forecast a DatasetGroup is an abstraction that contains all the datasets for a particular collection of Forecasts. There is no information sharing between DatasetGroups so if you'd like to try out various alternatives to the schemas we create below, you could create a new DatasetGroup and make your changes inside its corresponding Datasets.

The order of the process below will be as follows:

1. Create a DatasetGroup for our POC.
1. Create a `Target-Time-Series` Dataset.
1. Attach the Dataset to the DatasetGroup.
1. Import the data into the Dataset.
1. Generate Forecasts with CNN-QR.
1. Query their Forecasts.
1. Plot the Forecasts and metrics. 


At that point we can see how the model is performing and discuss how to add related data to our POC.

The cell below defines a few global settings for our POC with the service.

In [None]:
DATASET_FREQUENCY = "H" 
TIMESTAMP_FORMAT = "yyyy-MM-dd hh:mm:ss"

project = 'forecast_immersion_day'
datasetName= project+'_ds'
datasetGroupName= project +'_dsg'

Now using the metada stored on this instance of a SageMaker Notebook determine the region we are operating in. If you are using a Jupyter Notebook outside of SageMaker simply define `region` as the string that indicates the region you would like to use for Forecast and S3.


In [None]:
with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:
    data = json.load(notebook_info)
    resource_arn = data['ResourceArn']
    region = resource_arn.split(':')[3]
print(region)

Configure your AWS APIs

In [None]:
session = boto3.Session(region_name=region) 
forecast = session.client(service_name='forecast') 
forecast_query = session.client(service_name='forecastquery')

You will get a permission related error in the next step as the Role does not have required permissions for using Amazon forecast. In order to fix it, Go to IAM Console and assign 'AmazonForecastFullAccess' policy to the sagemaker execution role. Also update the trust relationship as per below .

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": [
"sagemaker.amazonaws.com",
"forecast.amazonaws.com"
]
},
"Action": "sts:AssumeRole"
}
]
}


In [None]:
# Create the DatasetGroup -  
create_dataset_group_response = forecast.create_dataset_group(DatasetGroupName=datasetGroupName,
                                                              Domain="CUSTOM",
                                                             )
datasetGroupArn = create_dataset_group_response['DatasetGroupArn']

In [None]:
forecast.describe_dataset_group(DatasetGroupArn=datasetGroupArn)

In [None]:
# Specify the schema of your dataset here. Make sure the order of columns matches the raw data files.
schema ={
   "Attributes":[
      {
         "AttributeName":"timestamp",
         "AttributeType":"timestamp"
      },
      {
         "AttributeName":"target_value",
         "AttributeType":"float"
      },
      {
         "AttributeName":"item_id",
         "AttributeType":"string"
      }
   ]
}

In [None]:
response=forecast.create_dataset(
                    Domain="CUSTOM",
                    DatasetType='TARGET_TIME_SERIES',
                    DatasetName=datasetName,
                    DataFrequency=DATASET_FREQUENCY, 
                    Schema = schema
)

In [None]:
target_datasetArn = response['DatasetArn']
forecast.describe_dataset(DatasetArn=target_datasetArn)

In [None]:
# Attach the Dataset to the Dataset Group:
forecast.update_dataset_group(DatasetGroupArn=datasetGroupArn, DatasetArns=[target_datasetArn])

In [None]:
# Finally we can call import the dataset
target_s3DataPath = "s3://"+bucket_name+"/"+target_time_series_filename
datasetImportJobName = 'DSIMPORT_JOB_TARGET'
ds_import_job_response=forecast.create_dataset_import_job(DatasetImportJobName=datasetImportJobName,
                                                          DatasetArn=target_datasetArn,
                                                          DataSource= {
                                                              "S3Config" : {
                                                                 "Path":target_s3DataPath,
                                                                 "RoleArn": role_arn
                                                              } 
                                                          },
                                                          TimestampFormat=TIMESTAMP_FORMAT
                                                         )

In [None]:
ds_import_job_arn=ds_import_job_response['DatasetImportJobArn']
print(ds_import_job_arn)

The cell below will run and poll every 30 seconds until the import process has completed. From there we will be able to create a few models.

In [None]:
while True:
    dataImportStatus = forecast.describe_dataset_import_job(DatasetImportJobArn=ds_import_job_arn)['Status']
    print(dataImportStatus)
    if dataImportStatus != 'ACTIVE' and dataImportStatus != 'CREATE_FAILED':
        sleep(30)
    else:
        break

# Model building bits

Given that that our data is hourly and we want to generate a forecast on the hour, Forecast limits us to a horizon of 500 of whatever the slice is. This means we will be able to predict about 20 days into the future. In our case we are going to predict 3 days or 72 hours.

The cells below will define a few variables to be used with all of our models. Then there will be an API call to create  `Predictor`  based on CNN-QR


In [None]:
forecastHorizon = 72
NumberOfBacktestWindows = 1
BackTestWindowOffset = 72
ForecastFrequency = "H"

In [None]:
cnnQR_algorithmArn = 'arn:aws:forecast:::algorithm/CNN-QR'

In [None]:
# CNN-QR Specifics
cnnQR_predictorName= project+'_cnnQR_algo_1'

In [None]:
# Build CNN-QR:
cnnQR_create_predictor_response=forecast.create_predictor(PredictorName=cnnQR_predictorName, 
                                                  AlgorithmArn=cnnQR_algorithmArn,
                                                  ForecastHorizon=forecastHorizon,
                                                  PerformAutoML= False,
                                                  PerformHPO=False,
                                                  EvaluationParameters= {"NumberOfBacktestWindows": NumberOfBacktestWindows, 
                                                                         "BackTestWindowOffset": BackTestWindowOffset}, 
                                                  InputDataConfig= {"DatasetGroupArn": datasetGroupArn},
                                                  FeaturizationConfig= {"ForecastFrequency": ForecastFrequency, 
                                                                        "Featurizations": 
                                                                        [
                                                                          {"AttributeName": "target_value", 
                                                                           "FeaturizationPipeline": 
                                                                            [
                                                                              {"FeaturizationMethodName": "filling", 
                                                                               "FeaturizationMethodParameters": 
                                                                                {"frontfill": "none", 
                                                                                 "middlefill": "zero", 
                                                                                 "backfill": "zero"}
                                                                              }
                                                                            ]
                                                                          }
                                                                        ]
                                                                       }
                                                 )

These calls will take an hour or so to complete in full. So feel free to take lunch here, go grab a pint, really anything that is going to kill a decent volume of time.

The following while loop keeps track of the CNN-QR predictor progress

In [None]:
while True:
    status = forecast.describe_predictor(PredictorArn=cnnQR_create_predictor_response['PredictorArn'])['Status']
    print(status)
    if status != 'ACTIVE' and status != 'CREATE_FAILED':
        sleep(30)
    else:
        break

## Examine the Model

First we are going to get the metrics for the model:

In [None]:
# CNN-QR Metrics
cnnQR_arn = cnnQR_create_predictor_response['PredictorArn']
cnnQR_metrics = forecast.get_accuracy_metrics(PredictorArn=cnnQR_arn)
pp = pprint.PrettyPrinter()
pp.pprint(cnnQR_metrics)

## Generate a Forecast 

The next phase is to generate a Forecast from the Predictor so we can see the results and understand visually how model is performing.

In [None]:
# CNN-QR
cnnQR_forecastName = project+'_cnnQR_algo_forecast'
cnnQR_create_forecast_response=forecast.create_forecast(ForecastName=cnnQR_forecastName,
                                                  PredictorArn=cnnQR_arn)
cnnQR_forecast_arn = cnnQR_create_forecast_response['ForecastArn']

In [None]:
while True:
    status = forecast.describe_forecast(ForecastArn=cnnQR_forecast_arn)['Status']
    print(status)
    if status != 'ACTIVE' and status != 'CREATE_FAILED':
        sleep(30)
    else:
        break

## Exporting your Forecasts to S3

In [None]:
#CNN-QR Forecast

cnnQR_path = "s3://" + bucket_name + "/cnnQR"
cnnQR_job_name = "mlimday_cnnQR_algo_forecast"
forecast.create_forecast_export_job(ForecastExportJobName=cnnQR_job_name,
                                    ForecastArn=cnnQR_forecast_arn,
                                    Destination={
                                        "S3Config": {
                                            "Path": cnnQR_path,
                                            "RoleArn": role_arn
                                        }
                                    })

This exporting process is another one of those items that will take **5 minutes** to complete. Just poll for progress in the console. From the earlier page where you saw the status turn `Active` for a Forecast, click it and you can see the progress of the export.

### Obtaining the Forecasts

At this point they are exported into S3 but you need to obtain the results locally so we can explore them, the cells below will do that.

In [None]:

# CNN-QR File
s3 = boto3.resource('s3')
s3_bucket = s3.Bucket(bucket_name)
cnnQR_filename = ""
cnnQR_files = list(s3_bucket.objects.filter(Prefix="cnnQR"))
for file in cnnQR_files:
    #There will be a collection of CSVs if the forecast is large, modify this to go get them all
    if "csv" in file.key:
        cnnQR_filename = file.key.split('/')[1]
        s3.Bucket(bucket_name).download_file(file.key, "../data/"+cnnQR_filename)
print(cnnQR_filename)


## Evaluating the Forecast

Event before exporting the forecasts themselves we can see a few things in the logs above...

Mainly, the RMSE for model:

CNN-QR - RMSE: 13.448013461697567,



This tells us that our model is doing the best when evaluating the p50 result.

The next stage would be to plot these numbers over a particular window.

To make this particular process easier we are going to export them all as CSV's from the console then read them in later. An improvement would be to use the JSON API and convert to a DF that way.

Note the files were downloaded and placed into the `../data/` folder for exploration.

In [None]:
# CNNQR Eval
cnnQR_predicts = pd.read_csv("../data/" + cnnQR_filename)
cnnQR_predicts.sample()

In [None]:
cnnQR_predicts.plot()

In [None]:
# Remove the timezone
cnnQR_predicts['date'] = pd.to_datetime(cnnQR_predicts['date'])

In [None]:

cnnQR_predicts.sample()

In [None]:
cnnQR_predicts['date'] = cnnQR_predicts['date'].dt.tz_convert(None)
cnnQR_predicts.set_index('date', inplace=True)

In [None]:
cnnQR_predicts.plot()

In [None]:
print (cnnQR_predicts.index.min())
print (cnnQR_predicts.index.max())

Here we can see our prediction goes from October 31st to November 2nd as expectged given our 72 hour interval forecast horizon. Also we can see the cyclical nature of the predictions over the entire timeframe. 

Now we are going to create a dataframe of the prediction values from this Forecast and the actual values.

First let us remove the column ID of item before continuing.

In [None]:
cnnQR_predicts = cnnQR_predicts[['p10', 'p50', 'p90']]
cnnQR_predicts.plot()

In [None]:
# Now strip the timezone information
cnnQR_predicts.info()

In [None]:
actual_df = pd.read_csv("../data/item-demand-time-validation.csv", names=['timestamp','value','item'])
actual_df.tail()

In [None]:
actual_df = actual_df[(actual_df['timestamp'] >= '2014-10-31') & (actual_df['timestamp'] < '2014-11-03')]

results_df = pd.DataFrame(columns=['timestamp', 'value', 'source'])
for index, row in actual_df.iterrows():
    clean_timestamp = dateutil.parser.parse(row['timestamp'])
    results_df = results_df.append({'timestamp' : clean_timestamp , 'value' : row['value'], 'source': 'actual'} , ignore_index=True)
                                   
validation_df = results_df.pivot(columns='source', values='value', index="timestamp")

In [None]:
validation_df.plot()

In [None]:
# Finally let us join the dataframes together
cnnQR_val_df = cnnQR_predicts.join(validation_df, how='outer')

In [None]:
# Plot
cnnQR_val_df.plot()

In [None]:
boto3.Session().resource('s3').Bucket(bucket_name).Object("DeepAR/mlimday_deep_ar_algo_forecast_2020-04-16T18-12-03Z_part0.csv").delete()

In [None]:
%store datasetGroupArn
%store target_datasetArn
%store role_name
%store key
%store bucket_name
%store region
%store ds_import_job_arn
%store cnnQR_forecast_arn
%storec cnnQR_arn
%store cnnQR_filename