# Getting Data Ready

The overall process for using Amazon Forecast is the following:

1. Create a Dataset Group, this is the large box that isolates models and the data they are trained on from each other.
1. Create a Dataset, in Forecast there are 3 types of dataset, Target Time Series, Related Time Series, and Item Metadata. The Target Time Series is required, the others provide additional context with certain algorithms. 
1. Import data, this moves the information from S3 into a storage volume where the data can be used for training and validation.
1. Train a model, Forecast automates this process for you but you can also select particular algorithms, and you can provide your own hyper parameters or use Hyper Parameter Optimization(HPO) to determine the most performant values for you.
1. Deploy a Predictor, here you are deploying your model so you can use it to generate a forecast.
1. Query the Predictor, given a request bounded by time for an item, return the forecast for it. Once you have this you can evaluate its performance or use it to guide your decisions about the future.

In this notebook we will be walking through the first 3 steps outlined above. One additional task that will be done here is to trim part of our training and validation data so that we can measure the accuracy of a forecast against our predictions. 


## Table Of Contents
* Setup
* Data Preparation
* Creating the Dataset Group and Dataset
* Next Steps


**Read Every Cell FULLY before executing it**

For more informations about APIs, please check the [documentation](https://docs.aws.amazon.com/forecast/latest/dg/what-is-forecast.html)

## Setup
Amazon Forecast is still in preview, to update to the latest functionality execute the cells below.

In [1]:
# Configures your AWS CLI for Amazon Forecast
!aws configure add-model --service-model file://../sdk/forecastquery-2019-08-12.normal.json --service-name forecastquery
!aws configure add-model --service-model file://../sdk/forecast-2019-08-12.normal.json --service-name forecast

Next import the standard Python Libraries that are used in this lesson.

In [2]:
import boto3
from time import sleep
import subprocess
import pandas as pd
import json
import time

The last part of the setup process is to validate that your account can communicate with Amazon Forecast, the cell below does just that.

In [3]:
session = boto3.Session(region_name='us-west-2') 
forecast = session.client(service_name='forecast') 
forecastquery = session.client(service_name='forecastquery')

## Data Preparation<a class="anchor" id="DataPrep"></a>

For this exercise, we use the individual household electric power consumption dataset. (Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.) We aggregate the usage data hourly. 

To begin, use Pandas to read the CSV and to show a sample of the data.

In [4]:
df = pd.read_csv("../data/item-demand-time.csv", dtype = object, names=['timestamp','value','item'])
df.head(3)

Unnamed: 0,timestamp,value,item
0,2014-01-01 01:00:00,38.34991708126038,client_12
1,2014-01-01 02:00:00,33.5820895522388,client_12
2,2014-01-01 03:00:00,34.41127694859037,client_12


Notice in the output above there are 3 columns of data:

1. The Timestamp
1. A Value
1. An Item

These are the 3 key required pieces of information to generate a forecast with Amazon Forecast. More can be added but these 3 must always remain present.

The dataset happens to span January 01, 2014 to Deceber 31, 2014. For our testing we would like to keep the last month of information in a differennt CSV. We are also going to save January to November to a different CSV as well. Both will be uploaded to S3 for use later.

You may notice a variable named `df` this is a popular convention when using Pandas if you are using the library's dataframe object, it is similar to a table in a database. You can learn more here: https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html


In [5]:
# Select January to November for one dataframe.
jan_to_nov = df[(df['timestamp'] >= '2014-01-01') & (df['timestamp'] <= '2014-11-30')]

# Select the month of December for another dataframe.
dec_df = df[(df['timestamp'] >= '2014-12-01') & (df['timestamp'] <= '2014-12-31')]

Now export them to CSV files and place them into your `data` folder.

In [6]:
jan_to_nov.to_csv("../data/item-demand-time.csv", header=False, index=False)
jan_to_nov.to_csv("../data/item-demand-time-validation.csv", header=False, index=False)

At this time the data is ready to be sent to S3 where Forecast will use it later. Update the bucketname in the cell below to reflect the value you obtained from our output earlier when you finished the CloudFormation step.

In [7]:
bucket_name = "forecastdemochrisking" # Rember to change this.

The following cells will upload the data to S3.

In [8]:
s3 = session.client('s3')
key="elec_data/item-demand-time.csv"
s3.upload_file(Filename="../data/item-demand-time.csv", Bucket=bucket_name, Key=key)

## Creating the Dataset Group and Dataset <a class="anchor" id="dataset"></a>

In Amazon Forecast , a dataset is a collection of file(s) which contain data that is relevant for a forecasting task. A dataset must conform to a schema provided by Amazon Forecast. 

More details about `Domain` and dataset type can be found on the [documentation](https://docs.aws.amazon.com/forecast/latest/dg/howitworks-domains-ds-types.html) . For this example, we are using [CUSTOM](https://docs.aws.amazon.com/forecast/latest/dg/custom-domain.html) domain with 3 required attributes `timestamp`, `target_value` and `item_id`.


It is importan to also convey how Amazon Forecast can understand your time-series information. That the cell immediately below does that, the next one configures your variable names for the Project, DatasetGroup, and Dataset.

In [9]:
DATASET_FREQUENCY = "H" 
TIMESTAMP_FORMAT = "yyyy-MM-dd hh:mm:ss"

In [10]:
project = 'util_power_forecastdemo'
datasetName= project+'_ds'
datasetGroupName= project +'_dsg'
s3DataPath = "s3://"+bucket_name+"/"+key

### Creating a Schema

Amazon Forecast relies on a concept called a schema to map the content in your CSV file into a format that Forecast can understand. The cells below will create a schema that matches the CSV provided, adn will call the API calls needed to build the Datastet and Datatset Group.

In [11]:
# Specify the schema of your dataset here. Make sure the order of columns matches the raw data files.
schema ={
   "Attributes":[
      {
         "AttributeName":"timestamp",
         "AttributeType":"timestamp"
      },
      {
         "AttributeName":"target_value",
         "AttributeType":"float"
      },
      {
         "AttributeName":"item_id",
         "AttributeType":"string"
      }
   ]
}

In [12]:
response=forecast.create_dataset(
                    Domain="CUSTOM",
                    DatasetType='TARGET_TIME_SERIES',
                    DatasetName=datasetName,
                    DataFrequency=DATASET_FREQUENCY, 
                    Schema = schema
)

In [13]:
datasetArn = response['DatasetArn']
forecast.describe_dataset(DatasetArn=datasetArn)

{'DatasetArn': 'arn:aws:forecast:us-west-2:059124553121:dataset/util_power_forecastdemo_ds',
 'DatasetName': 'util_power_forecastdemo_ds',
 'Domain': 'CUSTOM',
 'DatasetType': 'TARGET_TIME_SERIES',
 'DataFrequency': 'H',
 'Schema': {'Attributes': [{'AttributeName': 'timestamp',
    'AttributeType': 'timestamp'},
   {'AttributeName': 'target_value', 'AttributeType': 'float'},
   {'AttributeName': 'item_id', 'AttributeType': 'string'}]},
 'EncryptionConfig': {},
 'Status': 'ACTIVE',
 'CreationTime': datetime.datetime(2019, 8, 20, 21, 38, 17, 377000, tzinfo=tzlocal()),
 'LastModificationTime': datetime.datetime(2019, 8, 20, 21, 38, 17, 377000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': '9c50284f-35c0-4541-8f33-a2e0519d86d5',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Tue, 20 Aug 2019 21:38:18 GMT',
   'x-amzn-requestid': '9c50284f-35c0-4541-8f33-a2e0519d86d5',
   'content-length': '513',
   'connection': 'keep-alive'},
 

In [14]:
create_dataset_group_response = forecast.create_dataset_group(DatasetGroupName=datasetGroupName,
                                                              Domain="CUSTOM",
                                                              DatasetArns= [datasetArn]
                                                             )
datasetGroupArn = create_dataset_group_response['DatasetGroupArn']

In [15]:
forecast.describe_dataset_group(DatasetGroupArn=datasetGroupArn)

{'DatasetGroupName': 'util_power_forecastdemo_dsg',
 'DatasetGroupArn': 'arn:aws:forecast:us-west-2:059124553121:dataset-group/util_power_forecastdemo_dsg',
 'DatasetArns': ['arn:aws:forecast:us-west-2:059124553121:dataset/util_power_forecastdemo_ds'],
 'Domain': 'CUSTOM',
 'Status': 'ACTIVE',
 'CreationTime': datetime.datetime(2019, 8, 20, 21, 38, 21, 709000, tzinfo=tzlocal()),
 'LastModificationTime': datetime.datetime(2019, 8, 20, 21, 38, 21, 709000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': 'b100c938-3eb7-4c7b-9b4d-8258a510a4e3',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Tue, 20 Aug 2019 21:38:22 GMT',
   'x-amzn-requestid': 'b100c938-3eb7-4c7b-9b4d-8258a510a4e3',
   'content-length': '353',
   'connection': 'keep-alive'},
  'RetryAttempts': 0}}

### Create IAM Role for Forecast

Like many AWS services, Forecast will need to assume an IAM role in order to interact with your S3 resources securely. The code below will create the role and it will be used later for accessing your data in S3.


In [16]:
iam = boto3.client("iam")

role_name = "ForecastRoleDemo"
assume_role_policy_document = {
    "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "forecast.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
    ]
}

create_role_response = iam.create_role(
    RoleName = role_name,
    AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
)

# AmazonPersonalizeFullAccess provides access to any S3 bucket with a name that includes "personalize" or "Personalize" 
# if you would like to use a bucket with a different name, please consider creating and attaching a new policy
# that provides read access to your bucket or attaching the AmazonS3ReadOnlyAccess policy to the role
policy_arn = "arn:aws:iam::aws:policy/AmazonForecastFullAccess"
iam.attach_role_policy(
    RoleName = role_name,
    PolicyArn = policy_arn
)

# Now add S3 support
iam.attach_role_policy(
    PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess',
    RoleName=role_name
)
time.sleep(60) # wait for a minute to allow IAM role policy attachment to propagate

role_arn = create_role_response["Role"]["Arn"]
print(role_arn)

arn:aws:iam::059124553121:role/ForecastRoleDemo


### Create Data Import Job


Now that Forecast knows how to understand the CSV we are providing, the next step is to import the data from S3 into Amazon Forecaast.

In [17]:
datasetImportJobName = 'EP_AML_DSIMPORT_JOB_TARGET'
ds_import_job_response=forecast.create_dataset_import_job(DatasetImportJobName=datasetImportJobName,
                                                          DatasetArn=datasetArn,
                                                          DataSource= {
                                                              "S3Config" : {
                                                                 "Path":s3DataPath,
                                                                 "RoleArn": role_arn
                                                              } 
                                                          },
                                                          TimestampFormat=TIMESTAMP_FORMAT
                                                         )

In [18]:
ds_import_job_arn=ds_import_job_response['DatasetImportJobArn']
print(ds_import_job_arn)

arn:aws:forecast:us-west-2:059124553121:dataset-import-job/util_power_forecastdemo_ds/EP_AML_DSIMPORT_JOB_TARGET


Check the status of dataset, when the status change from **CREATE_IN_PROGRESS** to **ACTIVE**, we can continue to next steps. Depending on the data size. It can take 10 mins to be **ACTIVE**. This process will take 5 to 10 minutes.

In [None]:
while True:
    dataImportStatus = forecast.describe_dataset_import_job(DatasetImportJobArn=ds_import_job_arn)['Status']
    print(dataImportStatus)
    if dataImportStatus != 'ACTIVE' and dataImportStatus != 'CREATE_FAILED':
        sleep(30)
    else:
        break

CREATE_PENDING
CREATE_IN_PROGRESS
CREATE_IN_PROGRESS
CREATE_IN_PROGRESS
CREATE_IN_PROGRESS
CREATE_IN_PROGRESS
ACTIVE


In [20]:
forecast.describe_dataset_import_job(DatasetImportJobArn=ds_import_job_arn)

{'DatasetImportJobName': 'EP_AML_DSIMPORT_JOB_TARGET',
 'DatasetImportJobArn': 'arn:aws:forecast:us-west-2:059124553121:dataset-import-job/util_power_forecastdemo_ds/EP_AML_DSIMPORT_JOB_TARGET',
 'DatasetArn': 'arn:aws:forecast:us-west-2:059124553121:dataset/util_power_forecastdemo_ds',
 'TimestampFormat': 'yyyy-MM-dd hh:mm:ss',
 'DataSource': {'S3Config': {'Path': 's3://forecastdemochrisking/elec_data/item-demand-time.csv',
   'RoleArn': 'arn:aws:iam::059124553121:role/ForecastRoleDemo'}},
 'FieldStatistics': {'date': {'Count': 23973,
   'CountDistinct': 7991,
   'CountNull': 0,
   'Min': '2014-01-01T01:00:00Z',
   'Max': '2014-11-29T23:00:00Z'},
  'item': {'Count': 23973, 'CountDistinct': 3, 'CountNull': 0},
  'target': {'Count': 23973,
   'CountDistinct': 4818,
   'CountNull': 0,
   'CountNan': 0,
   'Min': '0.0',
   'Max': '212.27197346600326',
   'Avg': 50.447323170680725,
   'Stddev': 38.72169238224658}},
 'DataSize': 0.0010688817128539085,
 'Status': 'ACTIVE',
 'CreationTime': d

## Next Steps

At this point you have successfully imported your data into Amazon Forecast and now it is time to get started in the next notebook to build your first model. To Continue, open `Building_Your_Model.ipynb` and paste in the values for your DatasetGroup. Called out below:

In [23]:
print("DatasetArn: ")
print(datasetGroupArn)

DatasetArn: 
arn:aws:forecast:us-west-2:059124553121:dataset-group/util_power_forecastdemo_dsg
