## Using Amazon Forecast's Machine Learning tools to predict P&G future sales in Oman
Chirayu Khimji

MSc Applied Computational Science and Engineering

Imperial College London


Note: This notebook will not run on other machines due to privacy aspect of uploading P&G data (data is stored locally on my machine), its purpose is just to showcase the power of data science for KR CPG group.

The overall process for using Amazon Forecast is the following:

1. Create a Dataset Group, this is the large box that isolates models and the data they are trained on from each other.
2. Create a Dataset, in Forecast there are 3 types of dataset, Target Time Series, Related Time Series, and Item Metadata. The Target Time Series is required, the others provide additional context with certain algorithms. 
3. Import data, this moves the information from S3 into a storage volume where the data can be used for training and validation.
4. Train a model, Forecast automates this process for you but you can also select particular algorithms, and you can provide your own hyper parameters or use Hyper Parameter Optimization(HPO) to determine the most performant values for you.
5. Deploy a Predictor, here you are deploying your model so you can use it to generate a forecast.
6. Query the Forecast, given a request bounded by time for an item, return the forecast for it. Once you have this you can evaluate its performance or use it to guide your decisions about the future.

In this notebook we walk through the steps outlined above. One additional task that will be done here is to trim part of our training and validation data so that we can measure the accuracy of a forecast against our predictions. 

## Table Of Contents
* Setup
* Data Preparation
* Creating the Dataset Group and Dataset
* Next Steps


**Read Every Cell FULLY before executing it**

For more informations about APIs, please check the [documentation](https://docs.aws.amazon.com/forecast/latest/dg/what-is-forecast.html)

## Setup

Import the standard Python libraries that are used in this lesson.



In [1]:
import sys
import os
import json
import time
import pandas as pd
import glob
import paramiko
import boto3

# importing forecast notebook utility from notebooks/common directory
sys.path.insert( 0, os.path.abspath("../../common") )
import util

Configure the S3 bucket name and region name for this lesson.

- If you don't have an S3 bucket, create it first on S3. If you used CloudFormation Wizard to set up the environment, use same bucket name as you specified in the setup process.
- Although we have set the region to us-west-2 as a default value below, you can choose any of the regions that the service is available in.

In [2]:
##Code needs to be fixed
text_widget_bucket = util.create_text_widget( "bucket_name", "forecastdemochirayukhimji")
text_widget_region = util.create_text_widget( "region", "ap-south-1")

Text(value='', description='bucket_name', placeholder='forecastdemochirayukhimji')

Text(value='', description='region', placeholder='ap-south-1')

In [3]:
##Code needs to be fixed
bucket_name = "forecastdemochirayukhimji"
assert bucket_name, "bucket_name not set."

region = "ap-south-1"
assert region, "region not set."

The last part of the setup process is to validate that your account can communicate with Amazon Forecast, the cell below does just that.

In [4]:
##Code needs to be fixed
session = boto3.Session(region_name=region) 
forecast = session.client(service_name='forecast') 
forecastquery = session.client(service_name='forecastquery')

## Data Preparation
First we combine all the p&gsales_*.csv files using Pandas. This is done inorder to get one data frame which is chronologically sorted as a time series from 2016 to 2020

In [5]:
#df = pd.read_excel("./p&gsales_1.xlsx", index_col = "Customer")
df1 = pd.read_csv("./p&gsales_1.csv")
df2 = pd.read_csv("./p&gsales_2.csv")
df3 = pd.read_csv("./p&gsales_3.csv")
df4 = pd.read_csv("./p&gsales_4.csv")

frames = [df1, df2, df3, df4]
df = pd.concat(frames)

Check the first 5 entries i.e. make sure they start as p&gsales_1.csv starts

In [6]:
df.head()

Unnamed: 0,Customer,Date,Material,Billed Value,Billed Qty
0,60000707,03/01/2016,179045,40.752,24.0
1,60000707,03/01/2016,179047,13.296,4.0
2,60000707,03/01/2016,179051,18.176,64.0
3,60000707,03/01/2016,179053,13.296,4.0
4,60000707,03/01/2016,179058,40.752,24.0


Check the last 5 entries i.e. make sure they end as p&gsales_4.csv ends

In [7]:
df.tail()

Unnamed: 0,Customer,Date,Material,Billed Value,Billed Qty
302134,60097965,30/04/2020,212450,-8.556,-6.0
302135,60097965,30/04/2020,248420,-22.128,-12.0
302136,60097965,30/04/2020,266730,-6.24,-6.0
302137,60097965,30/04/2020,267363,-15.096,-12.0
302138,60097965,30/04/2020,283299,-27.36,-12.0


Print out the dimensions of this dataframe

In [8]:
print(df.shape)


(3056772, 5)


## Further Cleaning the data and formatting columns to correct data types

Convert Date column into date-time data type

Convert Billed Value column into float data type

Convert Billed Qty column into float data type


In [9]:
df['Date'] = df['Date'].astype('datetime64[ns]')
df['Billed Value']= df['Billed Value'].astype('float')
df['Billed Qty']=df['Billed Qty'].astype('float')

Inspect the dataframe

In [10]:
df

Unnamed: 0,Customer,Date,Material,Billed Value,Billed Qty
0,60000707,2016-03-01,179045,40.752,24.0
1,60000707,2016-03-01,179047,13.296,4.0
2,60000707,2016-03-01,179051,18.176,64.0
3,60000707,2016-03-01,179053,13.296,4.0
4,60000707,2016-03-01,179058,40.752,24.0
...,...,...,...,...,...
302134,60097965,2020-04-30,212450,-8.556,-6.0
302135,60097965,2020-04-30,248420,-22.128,-12.0
302136,60097965,2020-04-30,266730,-6.240,-6.0
302137,60097965,2020-04-30,267363,-15.096,-12.0


## Create Training and Validation Set

## Creating the Dataset Group and Dataset <a class="anchor" id="dataset"></a>

In Amazon Forecast , a dataset is a collection of file(s) which contain data that is relevant for a forecasting task. A dataset must conform to a schema provided by Amazon Forecast. 

More details about `Domain` and dataset type can be found on the [documentation](https://docs.aws.amazon.com/forecast/latest/dg/howitworks-domains-ds-types.html) . For this example, we are using [CUSTOM](https://docs.aws.amazon.com/forecast/latest/dg/custom-domain.html) domain with 3 required attributes `timestamp`, `target_value` and `item_id`.


It is important to also convey how Amazon Forecast can understand your time-series information. That the cell immediately below does that, the next one configures your variable names for the Project, DatasetGroup, and Dataset.

In [11]:
DATASET_FREQUENCY = "D"  
TIMESTAMP_FORMAT = "yyyy-MM-dd"

In [12]:
project = 'KRCPG_P&G_forecastdemo'
datasetName= project+'_ds'
datasetGroupName= project +'_dsg'
s3DataPath = "s3://"+bucket_name+"/"+key

NameError: name 'key' is not defined