# Module 4 Companion Notebook

This Jupyter notebook  is the companion notebook for the Module 4 demonstration, Creating a Forecast with Amazon Forecast.


## Dataset attributions

This notebook uses the following dataset: 

[Online Retail II Data Set](https://archive.ics.uci.edu/ml/datasets/Online+Retail+II)


This dataset is from:
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.


## Instructor notes

In this demonstration, you will show how to create a forecast by using Amazon Forecast. Students will work through the same process as part of the lab. Thus, you could choose to use this demonstration as a summary of the lab, or you could also omit the demonstration, if needed.

You could choose to deliver this demonstration in a few different ways:

1. Run the entire notebook before the demonstration, and walk through the console instructions at the end of this notebook. See the section in this notebook titled **Reviewing the forecast creation in the console** (Recommended).

2. Work through this notebook with the students. (**Note:** If you choose this option, it can take an hour to complete this demonstration.)

3. Prepare the data by using this notebook, but create the forecast in the console. See the section in this notebook titled **Creating the forecast by using the console**. (**Note:** If you choose this option, it can take an hour to complete this demonstration.)

Regardless of your choice, you should review this notebook in its entirety before you start the demonstration. 

## Notebook summary

This notebook loads and preprocesses the Online Retail II dataset. The data is uploaded to Amazon Simple Storage Service (Amazon S3), where it is used to create a forecast by using Amazon Forecast. This notebook performs the following steps:

- **Importing Python packages and creating functions** – Imports the packages that are used and creates helper functions
- **Importing data** – Downloads and loads the data into a pandas DataFrame
- **Pre-processing data** – Filters the data that's ready for training
- **Generating training and test DataFrames** – Downsamples the data to a daily frequency, and splits the dataset into a training and test DataFrame
- **Uploading to Amazon S3** – Uploads the DataFrames to Amazon S3 as comma-separated values (CSV) files
- **Creating the Amazon Forecast dataset group** – Creates the project dataset group
- **Creating the datasets** – Creates the datasets in the dataset group and waits for the import to complete
- **Creating the predictor** – Trains the predictor by using the dataset group
- **Getting accuracy metrics** – Displays the metrics for the predictor
- **Creating the forecast** – Creates a test forecast
- **Optional: Cleaning up** – Lists instructions for performing cleanup after the demonstration is complete

This notebook takes 60–90 minutes to complete.




## Importing Python packages and creating functions

The following code imports these packages:

- *boto3* represents the AWS SDK for Python (Boto3), which is the Python library for AWS
- *pandas* provides DataFrames for manipulating time series data
- *matplotlib* provides plotting functions
- *sagemaker* represents the API that's needed to work with Amazon SageMaker
- *time*, *sys*, *os*, *io*, and *json* provide helper functions 

The code also creates two helper functions:

- `upload_s3_csv` uploads pandas DataFrames to Amazon S3 as CSV files. The header is removed, but the index is *not* removed.
- `StatusIndicator` provides a status function for long-running API calls to Forecast.



In [None]:
import warnings
warnings.filterwarnings('ignore')
bucket_name='c45317a617679l1523854t1w00381652629-sandboxbucket-3apxi73oxsw6'

import boto3
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import time, sys, os, io, json
import sagemaker

s3_resource = boto3.Session().resource('s3')

def upload_s3_csv(filename, folder, dataframe):
    csv_buffer = io.StringIO()
    dataframe.to_csv(filename, header=False, index=True)
    dataframe.to_csv(csv_buffer, header=False, index=True )
    s3_resource.Bucket(bucket_name).Object(os.path.join(prefix, folder, filename)).put(Body=csv_buffer.getvalue())

class StatusIndicator:
    
    def __init__(self):
        self.previous_status = None
        self.need_newline = False
        
    def update( self, status ):
        if self.previous_status != status:
            if self.need_newline:
                sys.stdout.write("\n")
            sys.stdout.write( status + " ")
            self.need_newline = True
            self.previous_status = status
        else:
            sys.stdout.write(".")
            self.need_newline = True
        sys.stdout.flush()

    def end(self):
        if self.need_newline:
            sys.stdout.write("\n")

## Importing data

The following cell downloads the dataset, which is an Microsoft Excel file. This file is loaded into pandas as a DataFrame.

This cell takes 1–2 minutes to complete.

In [None]:
%%time

session = boto3.Session()
forecast = session.client(service_name='forecast') 
forecast_query = session.client(service_name='forecastquery')

!aws s3 cp s3://aws-tc-largeobjects/CUR-TF-200-ACAIML-1/forecast/ . --recursive
retail = pd.concat(pd.read_excel('online_retail_II.xlsx',sheet_name=None), ignore_index=True)

## Pre-processing data

The following cell completes these pre-processing steps:

- Removes instances with missing values
- Sets the index to the InvoiceDate feature
- Only keeps instances that are from the United Kingdom
- Only keeps instances that use the target stock code (21232)
- Keeps instances where the price is greater than 0



In [None]:
retail = retail.dropna()
retail['InvoiceDate'] = pd.to_datetime(retail.InvoiceDate)
retail = retail.set_index('InvoiceDate')

country_filter = ['United Kingdom']
retail = retail[retail['Country'].isin(country_filter)]

#stockcodes = ['ADJUST', 'ADJUST2', 'POST', 'M']
#stockcodes = [21232,22423]
stockcodes = [21232]
retail = retail[retail.StockCode.isin(stockcodes)]

retail = retail[retail['Price']>0]

## Generating training and test DataFrames

The following cell:

- Splits the data into a time series DataFrame and a related time series DataFrame.
- Downsamples the data from multiple sales entries per day into a single daily value. The **Quantity** column is summed, and the mean is used for the **Price** column.
- Splits the DataFrames into a training set that contains data from January 2010–October 2010, and a test set that contains data from November 2010–December 2010



In [None]:
%%time

retail_timeseries = retail[['StockCode','Quantity']]

retail_timeseries = retail_timeseries.groupby('StockCode').resample('D').sum().reset_index().set_index(['InvoiceDate'])

df_related_time_series = retail[['StockCode','Price']]
df_related_time_series2 = df_related_time_series.groupby('StockCode').resample('D').mean().reset_index().set_index(['InvoiceDate'])
df_related_time_series3 = df_related_time_series2.groupby('StockCode').pad()

#df_related_time_series4 = df_related_time_series3.reset_index().set_index('InvoiceDate')

# Select January to November for one dataframe.
jan_to_oct = retail_timeseries['2009-12':'2011-10']
nov_to_dec = retail_timeseries['2011-11':'2011-12']
jan_to_oct_related = df_related_time_series2['2009-12':'2011-10']

## Uploading to Amazon S3

The following cell uploads the DataFrames to Amazon S3 by using the helper function that was created earlier.

*Tip:* Update the prefix to something unique. If previous demos have not cleaned up completely, the notebook will fail. Changing the prefix will avoid this.

In [None]:
%%time

prefix='mod_4_demo'
train='retail_ts_train.csv'
train_related='related_ts_train.csv'
test='retail_ts_test.csv'

key=prefix + '/forecast/' + train
# key='lab_4_forecast_t/forecast/retail_time_series_train.csv'
related_key = prefix + '/forecast/' + train_related
# related_key='lab_4_forecast_t/forecast/related.csv'

upload_s3_csv(train, 'forecast', jan_to_oct)
upload_s3_csv(train_related, 'forecast', jan_to_oct_related)
upload_s3_csv(test, 'forecast', nov_to_dec)

dataset_frequency = "D" 
timestamp_format = "yyyy-MM-dd"

# project = prefix
dataset_name = prefix+'_ds'
related_dataset_name = prefix+'_rds'
dataset_group_name = prefix +'_dsg'

s3_data_path = "s3://"+bucket_name+"/"+key
s3_related_data_path = "s3://"+bucket_name+"/"+related_key

In [None]:

%store prefix
%store train
%store test
%store key

# STOP HERE IF YOU ARE GOING TO DEMONSTRATE THE PROCESS OF CREATING FORECASTS IN THE CONSOLE

For steps that outline how to create the forecast, see the **Creating the forecast by using the console** instructions at the end of this notebook.

**Tip**: Remove this cell before you deliver the demonstration to students.

## Creating the Amazon Forecast dataset group

The following cell creates the dataset group for the forecast.

In [None]:
%%time
create_dataset_group_response = forecast.create_dataset_group(DatasetGroupName=dataset_group_name,
                                                              Domain="RETAIL"
                                                             )
dataset_group_arn = create_dataset_group_response['DatasetGroupArn']

## Creating the datasets

The following cell creates the time series and related datasets, and adds them to the dataset group.

The cell will wait loop and display the status until the datasets are created.

Note: Update the ARN below with the arn from your sandbox environment. 

In [None]:
%%time

iam = boto3.resource('iam')
role_arn = iam.Role('ForecastRole').arn


# This is the schema of the timeseries dataset.
schema ={
   "Attributes":[
      {
         "AttributeName":"timestamp",
         "AttributeType":"timestamp"
      },
      {
         "AttributeName":"item_id",
         "AttributeType":"string"
      },
      {
         "AttributeName":"demand",
         "AttributeType":"float"
      }
   ]
}

time_series_response=forecast.create_dataset(
                    Domain="RETAIL",
                    DatasetType='TARGET_TIME_SERIES',
                    DatasetName=dataset_name,
                    DataFrequency=dataset_frequency, 
                    Schema = schema
)

dataset_arn = time_series_response['DatasetArn']
# forecast.describe_dataset(DatasetArn=dataset_arn)

# Create the import job for the time series dataset
dataset_import_job_name = 'EP_DSIMPORT_JOB_TARGET'
data_source = {"S3Config" : {"Path":s3_data_path,"RoleArn": role_arn} }
ds_import_job_response=forecast.create_dataset_import_job(DatasetImportJobName=dataset_import_job_name,
                                                          DatasetArn=dataset_arn,
                                                          DataSource= data_source,
                                                          TimestampFormat=timestamp_format
                                                         )

ds_import_job_arn=ds_import_job_response['DatasetImportJobArn']

# This is the schema of the related data, containing the price.
related_schema ={
   "Attributes":[
      {
         "AttributeName":"timestamp",
         "AttributeType":"timestamp"
      },
      {
         "AttributeName":"item_id",
         "AttributeType":"string"
      },
      {
         "AttributeName":"price",
         "AttributeType":"float"
      }
   ]
}

related_time_series_response=forecast.create_dataset(
                    Domain="RETAIL",
                    DatasetType='RELATED_TIME_SERIES',
                    DatasetName=related_dataset_name,
                    DataFrequency=dataset_frequency, 
                    Schema = related_schema
)
related_dataset_arn = related_time_series_response['DatasetArn']

# forecast.describe_dataset(DatasetArn=related_dataset_arn)


related_dataset_import_job_name = 'EP_DSIMPORT_JOB_TARGET_RELATED'

related_data_source = {"S3Config" : {"Path":s3_related_data_path,"RoleArn": role_arn} }

ds_related_import_job_response=forecast.create_dataset_import_job(DatasetImportJobName=related_dataset_import_job_name,
                                                          DatasetArn=related_dataset_arn,
                                                          DataSource= related_data_source,
                                                          TimestampFormat=timestamp_format
                                                         )

ds_related_import_job_arn=ds_related_import_job_response['DatasetImportJobArn']

# Add the time series and related dataset to the dataset group.
forecast.update_dataset_group(DatasetGroupArn=dataset_group_arn, DatasetArns=[dataset_arn, related_dataset_arn])
#forecast.update_dataset_group(DatasetGroupArn=dataset_group_arn, DatasetArns=[dataset_arn])

# Wait for the related dataset to finish
status_indicator = StatusIndicator()

while True:
    status = forecast.describe_dataset_import_job(DatasetImportJobArn=ds_related_import_job_arn)['Status']
    status_indicator.update(status)
    if status in ('ACTIVE', 'CREATE_FAILED'): break
    time.sleep(10)

status_indicator.end()

# Wait for the time series dataset to finish - this typically takes longer than the related set.
status_indicator = StatusIndicator()

while True:
    status = forecast.describe_dataset_import_job(DatasetImportJobArn=ds_import_job_arn)['Status']
    status_indicator.update(status)
    if status in ('ACTIVE', 'CREATE_FAILED'): break
    time.sleep(10)

status_indicator.end()

The following code stores the Amazon Resource Names (ARNs) for the forecast objects that were previously created. They can be loaded from other notebooks.

In [None]:
%store ds_import_job_arn
%store dataset_arn
%store dataset_group_arn
%store related_dataset_arn
%store ds_related_import_job_arn

## Creating the predictor

The following cell creates the predictor by using the following parameters:

- The forecast horizon is set to *30 days*.
- *DeepAR+* is the selected algorithm. For more information, see [DeepARP+ Algorithm](https://docs.aws.amazon.com/forecast/latest/dg/aws-forecast-recipe-deeparplus.html) in the AWS Documentation.
- Hyperparameters are specified for the algorithm. These hyperparameters were generated by running the forecast with **PerformHPO** set to *true*. This setting created a hyperparameter tuning job on the model, which produced the values that you see in the following cell.
- A single backtest window for *30 days* is used.
- The **input_data_config** field is set to the dataset group that was created previously.
- Holidays in the United Kingdom are added as supplementary features.
- A featurization pipeline is created for the price features. For more information, see the [Handling Missing Values](https://docs.aws.amazon.com/forecast/latest/dg/howitworks-missing-values.html) topic in the documentation.

The cell will wait loop and display the status until the datasets are created.

In [None]:
%%time

predictor_name= prefix+'_deeparp_algo'
forecast_horizon = 90
algorithm_arn = 'arn:aws:forecast:::algorithm/Deep_AR_Plus'

training_parameters =  {'context_length': '172', 
                        'epochs': '500', 
                        'learning_rate': '0.00023391131837525837', 
                        'learning_rate_decay': '0.5', 
                        'likelihood': 'student-t', 
                        'max_learning_rate_decays': '0', 
                        'num_averaged_models': '1', 
                        'num_cells': '40', 
                        'num_layers': '2', 
                        'prediction_length': '30'}

evaluation_parameters= {"NumberOfBacktestWindows": 1, "BackTestWindowOffset": 90}

input_data_config = {"DatasetGroupArn": dataset_group_arn, "SupplementaryFeatures": [ {"Name": "holiday","Value": "UK"} ]}
                  
featurization_config= {"ForecastFrequency": dataset_frequency,
                      "Featurizations": 
                      [
                          {
                            "AttributeName": "price",
                            "FeaturizationPipeline": [
                                {
                                    "FeaturizationMethodName": "filling",
                                    "FeaturizationMethodParameters": {
                                        "middlefill": "median",
                                        "backfill": "min",
                                        "futurefill": "max"               
                                        }
                                }
                            ]
                        }
                      ]}


create_predictor_response=forecast.create_predictor(PredictorName = predictor_name, 
                                                  AlgorithmArn = algorithm_arn,
                                                  ForecastHorizon = forecast_horizon,
                                                  PerformAutoML = False,
                                                  PerformHPO = False,
                                                  EvaluationParameters= evaluation_parameters, 
                                                  InputDataConfig = input_data_config,
                                                  FeaturizationConfig = featurization_config #,
#                                                   TrainingParameters = training_parameters
                                                 )

predictor_arn = create_predictor_response['PredictorArn']
status_indicator = StatusIndicator()

while True:
    status = forecast.describe_predictor(PredictorArn=predictor_arn)['Status']
    status_indicator.update(status)
    if status in ('ACTIVE', 'CREATE_FAILED'): break
    time.sleep(10)

status_indicator.end()

In [None]:
f = forecast.describe_predictor(PredictorArn=predictor_arn)
print(f['TrainingParameters'])

## Getting accuracy metrics

The next cell prints the accuracy metrics for the predictor that was just created.

In [None]:
forecast.get_accuracy_metrics(PredictorArn=predictor_arn)

## Creating the forecast

The following cell creates a forecast from the predictor that was created previously. 

The predictor and forecast ARN values are stored so that they can be retreived from the lab notebook.



In [None]:
%%time
forecast_Name= prefix+'_deeparp_algo_forecast'
create_forecast_response=forecast.create_forecast(ForecastName=forecast_Name,
                                                  PredictorArn=predictor_arn)
forecast_arn = create_forecast_response['ForecastArn']

In [None]:
%store forecast_arn
%store predictor_arn

In [None]:
%%time
status_indicator = StatusIndicator()
while True:
    status = forecast.describe_forecast(ForecastArn=forecast_arn)['Status']
    status_indicator.update(status)
    if status in ('ACTIVE', 'CREATE_FAILED'): break
    time.sleep(10)

status_indicator.end()

print(forecast_arn)

The following cell creates a quick forecast as a test, which is useful for troubleshooting.

In [None]:
print()
forecast_response = forecast_query.query_forecast(
    ForecastArn=forecast_arn,
    Filters={"item_id":"21232"}
)
print(forecast_response)

## Optional: Cleaning up

To delete the forecast that was generated by using this notebook, select the following cell, change the cell to code by pressing Y, and then run them.

forecast.delete_forecast(ForecastArn=forecast_arn)
time.sleep(60)

forecast.delete_predictor(PredictorArn=predictor_arn)
time.sleep(60)

forecast.delete_dataset_import_job(DatasetImportJobArn=ds_import_job_arn)
time.sleep(60)

forecast.delete_dataset(DatasetArn=dataset_arn)
time.sleep(60)

forecast.delete_dataset_group(DatasetGroupArn=dataset_group_arn)

## Demonstration steps complete!

If you walked through the code with students, you can stop here and delete the following cells.


# CREATING THE FORECAST BY USING THE CONSOLE: 
# STEP-BY-STEP INSTRUCTIONS

The following instructions demonstrate how to complete the demonstration by using the console. 

**Note:** You should have ran the previous cells to create the dataset *and* upload the data to Amazon S3.

## Task 1: Creating the dataset group

1. On the AWS Management Console, on the **Services** menu, choose **Amazon Forecast**.
2. Choose **Create dataset group**, and in the form, provide these values.

    - **Dataset group name**: Enter an appropriate name
    - **Forecasting domain**: *Retail*

Your screen should look similiar to the following example:

![Screen capture of the Creating dataset group task](images/mod4-demo.PNG)

3. Choose **Next**.




## Task 2: Creating the target time series dataset

1. In the **Dataset name** box, enter an appropriate name.
2. Update the **Data schema** by moving the timestamp to the first position, like in the following example:

![Screen capture of the Creating target time series dataset task](images/mod4-demo2.PNG)

3. Choose **Next**.

Make a note of the S3 bucket where the data is located by running the following cell. You will need this information in the next step.

In [None]:
print(f's3://{bucket_name}/{prefix}/forecast/{train}')

## Task 3: Importing target time series data

1. In the **Dataset import details** form, provide these values.

    - **Dataset import name**: Enter an appropriate name
    - **Timestamp format**: `yyyy-MM-dd`
    - **IAM Role**: Select the existing role (**Note:** This role was created as part of creating the sandbox environment)
    - **Dataset location**: Enter the path to the S3 bucket

The screen should look like the following example:

![Screen capture of the Importing the target time series data task](images/mod4-demo3.PNG)

2. Choose **Start import**.


**Note:** It takes 5–20 minutes for the data to be imported. *Make sure that the import is completed before you proceed.*

## Task 4: Training a predictor

1. Under the **Train a predictor** section, choose **Start**.
2. Provide the following values.

    - **Predictor name**: Enter an appropriate name
    - **Forecast horizon**: `90` 
    - **Algorithm selection**: *Manual*
    - **Algorithm**: *Deep_AR_Plus*
    - **Country for holidays**: *United Kingdom* 

Your screen should look similiar to the following:

![Screen capture of the Training a predictor task](images/mod4-demo4.PNG)

3. Choose **Train predictor**.

**Note:** It takes between 20–40 minutes to train the predictor. *Make sure that the predictor training has finished before you continue.*

## Task 5: Generating the forecast

1. In the **Generate forecasts** section, choose **Start**, and provide these values.
    - **Forecast name**: Enter an appropriate name
    - **Predictor**: Select the predictor that you just created 
    - **Forecast types**: Leave this box empty

You screen should look like the following example:

![Screen capture of the Generating the forecast task](images/mod4-demo5.PNG)

2. Choose **Create a forecast**.


**Note:** It takes between 20–40 minutes to create the forecast. *Make sure that the forecast has been created before you continue.*

## Task 6: Looking up the forecast

1. Choose **Lookup forecast** and provide these values.
    - **Forecast**: Select the forecast that you just created 
    - **Start date**: `2011/10/02`
    - **End date**: `2011/12/31`
    - **Value**: `21232`

Your screen should look similiar to the following example:

![Screen capture of the Looking up the forecast task](images/mod4-demo6.PNG)

6. Choose **Get Forecast**.



In a few seconds, you should get a forecast that's similar to the following example:

![Screen capture of the forecast results](images/mod4-demo7.PNG)

# Reviewing the creation of the forecast in the console 



## Task 1: Reviewing the datasets

1. Choose **View dataset groups**.
2. From the list of dataset groups, select **mod_4_demo_dsg**.

If you have previously created the forecast, your Amazon Forecast dashboard should look like the following example:
    
![Amazon Forecast dashboard](images/mod4-demo8.PNG)

3. Choose **View datasets**.

You screen should look like the following example:

![Dataset View](images/mod4-demo9.PNG)

4. Choose **mod_4_demo_ds**, which is the time series dataset.
5. With the students, review the **Dataset import field statistics** and **Schema** sections.
6. To return to the dataset dashboard, choose **Dashboard**.


## Task 2: Reviewing the predictor

1. Choose **View predictors**.
2. From the list of predictors, select **mod_4_demo_deeparp_algo**.

Your screen should look similiar to the following example:

![Predictor overview](images/mod4-demo9.PNG)


3. Review the **Forecast Configurations** section, and point out the **Forecast horizon**, **Forecast frequency**, and **Country for holidays** settings.
4. Review the **Predictor metrics** section.
5. To return to the dataset dashboard, choose **Dashboard**.


## Task 3: Looking up a forecast

1. Choose **Lookup forecast** and provide these values.
    - **Forecast**: *mod_4_demo_deeparp_algo_forecast*
    - **Start date**: `2011/10/02`
    - **End date**: `2011/12/31`
    - **Value**: `21232`
6. Choose **Get Forecast**.



In a few seconds, you should get a forecast that's similar to the following example:

![Screen capture of the forecast results](images/mod4-demo7.PNG)

## Console demonstration steps complete!