# Validating and Importing Related Time Series Data

## Obtaining Your Data

This will take off where you stopped regarding your target time series data. In this particular exmaple, one master file contained both the target and the related time series information. That may or may not be the case for your problem. The goal here is to produce a file that contains the following 2 required attributes:

1. Timestamp - Must be of the same format and total range as the target-time series data, as well as slices of values into the dates for your forecast.
1. Item_ID - Must exist for all the time stamps for each item in your time series dataset

In addition to those attributes we are looking for variables that shift over time that are impactful in some way towards our desired goal of predicting traffic volumes.

Again the data was already bundled together for us in this sample so we will skip obtaining it a second time but that is where you would start otherwise.

With the data ready to go, skip the blank cell ( feel free to add to it if you need to manipulate your own data) and execute the cells to handle our imports and retrieving our stored values from the previous notebook.


In [28]:
import boto3
from time import sleep
import subprocess
import pandas as pd
import json
import time
import pprint
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter
import matplotlib.dates as mdates
%store -r

## Building The Related Time Series File

The challenge here is to make sure that we leave absolutely 0 entries with NaN values or the service will throw an error when building a Predictor. This is because the values must be present in order for us to make assumptions about their impact overall.

In [8]:
related_time_series_df = targtet_df.copy()
related_time_series_df.dropna()
related_time_series_df = full_df.join(related_time_series_df, how='outer')
cols = related_time_series_df.columns.tolist()
related_time_series_df[cols] = related_time_series_df[cols].replace('', np.nan).ffill()
related_time_series_df = related_time_series_df.loc['2017-01-01':]
print (related_time_series_df.index.min())
print (related_time_series_df.index.max())

2017-01-01 00:00:00
2018-09-30 23:00:00


We can see now that the data covers the range of our target time series of 2017's entire year to the end of our known data about 2018. We have not yet defined a forecast horizon yet but it is important to note here that the related data needs to cover that time span. To spoil later work, the horizon for us is 480 hours or 20 days, plenty of time with 9 months of validation data.

Lastly on prepping the base set of the data we validate there are no blanks or NaNs.

In [9]:
related_time_series_df[related_time_series_df.isnull().any(axis=1)]

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,traffic_volume


### Look at the columns and decide what we should keep:


In [10]:
related_time_series_df.sample(3)

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,traffic_volume
2017-11-23 22:00:00,,278.15,0.0,0.0,1.0,Clear,sky is clear,1962.0
2018-04-02 06:00:00,,265.99,0.0,0.0,1.0,Clear,sky is clear,5147.0
2018-05-02 02:00:00,,285.51,0.0,0.0,40.0,Drizzle,drizzle,253.0


A few things to note here:

* Holidays are not needed given this date is in the US, we can just use the Holidays feature within Forecast: https://docs.aws.amazon.com/forecast/latest/dg/API_SupplementaryFeature.html
* Weather description seems to have more variety
* Traffic volume will be removed here. 
* We still need to add back the item_id field.

This leaves us with the following schema:

* `timestamp` - The Index
* `temp` - float
* `rain_1h` - float
* `snow_1h` - float
* `clouds_all` - float
* `weather_description` - string
* `item_ID` - string

The cell below will build that file for us.


In [11]:
# Restrict the columns to timestamp, traffic_volume
related_time_series_df = related_time_series_df[['temp', 'rain_1h', 'snow_1h', 'clouds_all', 'weather_description']]
# Add in item_id
related_time_series_df['item_ID'] = "1"
# Validate the structure
related_time_series_df.head()


Unnamed: 0,temp,rain_1h,snow_1h,clouds_all,weather_description,item_ID
2017-01-01 00:00:00,269.75,0.0,0.0,75.0,broken clouds,1
2017-01-01 01:00:00,269.95,0.0,0.0,1.0,sky is clear,1
2017-01-01 02:00:00,269.75,0.0,0.0,1.0,sky is clear,1
2017-01-01 03:00:00,269.65,0.0,0.0,40.0,scattered clouds,1
2017-01-01 04:00:00,269.48,0.0,0.0,1.0,sky is clear,1


In [13]:
# Save it off as a file:
related_time_series_filename = "related_time_series.csv"
related_time_series_path = data_dir + "/" + related_time_series_filename
related_time_series_df.to_csv(related_time_series_path, header=False)

## Adding Related Data to the DatasetGroup

Next we are going to create a related-time-series dataset, then add it to our dataset group and finally import our information and validate that it looks good. We will also delete this dataset import after we are done so that the first models do not yet receive any extra info from the related data.

You can of course to not delete and get started right away with related data informed models.

In [22]:
session = boto3.Session(region_name=region)
forecast = session.client(service_name='forecast')
forecast_query = session.client(service_name='forecastquery')

In [14]:
# Upload Related File
boto3.Session().resource('s3').Bucket(bucket_name).Object(related_time_series_filename).upload_file(related_time_series_path)
related_s3DataPath = "s3://"+bucket_name+"/"+related_time_series_filename

In [15]:
# Specify the schema of your dataset here. Make sure the order of columns matches the raw data files.
related_schema ={
   "Attributes":[
      {
         "AttributeName":"timestamp",
         "AttributeType":"timestamp"
      },
      {
         "AttributeName":"temperature",
         "AttributeType":"float"
      },
       {
         "AttributeName":"rain_1h",
         "AttributeType":"float"
      },
       {
         "AttributeName":"snow_1h",
         "AttributeType":"float"
      },
       {
         "AttributeName":"clouds_all",
         "AttributeType":"float"
      },
       {
         "AttributeName":"weather",
         "AttributeType":"string"
      },
      {
         "AttributeName":"item_id",
         "AttributeType":"string"
      }
   ]
}

In [25]:
related_DSN = datasetName + "_related"
response=forecast.create_dataset(
                    Domain="CUSTOM",
                    DatasetType='RELATED_TIME_SERIES',
                    DatasetName=related_DSN,
                    DataFrequency=DATASET_FREQUENCY, 
                    Schema = related_schema
)

In [26]:
related_datasetArn = response['DatasetArn']
forecast.describe_dataset(DatasetArn=related_datasetArn)

{'DatasetArn': 'arn:aws:forecast:us-east-1:059124553121:dataset/forecast_poc_ds_related',
 'DatasetName': 'forecast_poc_ds_related',
 'Domain': 'CUSTOM',
 'DatasetType': 'RELATED_TIME_SERIES',
 'DataFrequency': 'H',
 'Schema': {'Attributes': [{'AttributeName': 'timestamp',
    'AttributeType': 'timestamp'},
   {'AttributeName': 'temperature', 'AttributeType': 'float'},
   {'AttributeName': 'rain_1h', 'AttributeType': 'float'},
   {'AttributeName': 'snow_1h', 'AttributeType': 'float'},
   {'AttributeName': 'clouds_all', 'AttributeType': 'float'},
   {'AttributeName': 'weather', 'AttributeType': 'string'},
   {'AttributeName': 'item_id', 'AttributeType': 'string'}]},
 'EncryptionConfig': {},
 'Status': 'ACTIVE',
 'CreationTime': datetime.datetime(2019, 12, 31, 20, 41, 19, 417000, tzinfo=tzlocal()),
 'LastModificationTime': datetime.datetime(2019, 12, 31, 20, 41, 19, 417000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': '369185a5-b46e-4bd6-b90a-23f7f1b9ba03',
  'HTTPStatusCode': 2

In [36]:
datasetImportJobName = 'DSIMPORT_JOB_RELATEDPOC'
related_ds_import_job_response=forecast.create_dataset_import_job(DatasetImportJobName=datasetImportJobName,
                                                          DatasetArn=related_datasetArn,
                                                          DataSource= {
                                                              "S3Config" : {
                                                                 "Path":related_s3DataPath,
                                                                 "RoleArn": role_arn
                                                              } 
                                                          },
                                                          TimestampFormat=TIMESTAMP_FORMAT
                                                         )

In [37]:
rel_ds_import_job_arn=related_ds_import_job_response['DatasetImportJobArn']
print(rel_ds_import_job_arn)

arn:aws:forecast:us-east-1:059124553121:dataset-import-job/forecast_poc_ds_related/DSIMPORT_JOB_RELATEDPOC


The cell below will poll until the import process has completed, once that has been accomplished we can review the metrics and decide to delete the data or not.

In [38]:
while True:
    dataImportStatus = forecast.describe_dataset_import_job(DatasetImportJobArn=rel_ds_import_job_arn)['Status']
    print(dataImportStatus)
    if dataImportStatus != 'ACTIVE' and dataImportStatus != 'CREATE_FAILED':
        sleep(30)
    else:
        break

CREATE_PENDING
CREATE_IN_PROGRESS
CREATE_IN_PROGRESS
CREATE_IN_PROGRESS
CREATE_IN_PROGRESS
ACTIVE


## Evaluating the Related Time Series Data

First let us examine the dataframe that we provided to Forecast:

In [32]:
related_time_series_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 18609 entries, 2017-01-01 00:00:00 to 2018-09-30 23:00:00
Data columns (total 6 columns):
temp                   18609 non-null float64
rain_1h                18609 non-null float64
snow_1h                18609 non-null float64
clouds_all             18609 non-null float64
weather_description    18609 non-null object
item_ID                18609 non-null object
dtypes: float64(4), object(2)
memory usage: 1.6+ MB


In [33]:
related_time_series_df.sample(3)

Unnamed: 0,temp,rain_1h,snow_1h,clouds_all,weather_description,item_ID
2017-06-13 01:00:00,292.55,0.0,0.0,1.0,mist,1
2017-04-19 16:00:00,281.84,0.0,0.0,90.0,mist,1
2018-07-04 01:00:00,297.42,0.0,0.0,1.0,sky is clear,1


Above we see 18,609 entries and not one is a NaN value! This is perfect. Now to double check what we imported:

In [34]:
forecast.describe_dataset_import_job(DatasetImportJobArn=rel_ds_import_job_arn)

{'DatasetImportJobName': 'DSIMPORT_JOB_RELATEDPOC',
 'DatasetImportJobArn': 'arn:aws:forecast:us-east-1:059124553121:dataset-import-job/forecast_poc_ds_related/DSIMPORT_JOB_RELATEDPOC',
 'DatasetArn': 'arn:aws:forecast:us-east-1:059124553121:dataset/forecast_poc_ds_related',
 'TimestampFormat': 'yyyy-MM-dd hh:mm:ss',
 'DataSource': {'S3Config': {'Path': 's3://059124553121forecastpoc/related_time_series.csv',
   'RoleArn': 'arn:aws:iam::059124553121:role/ForecastRolePOC'}},
 'FieldStatistics': {'clouds_all': {'Count': 18609,
   'CountDistinct': 21,
   'CountNull': 0,
   'CountNan': 0,
   'Min': '0.0',
   'Max': '92.0',
   'Avg': 48.0432586382933,
   'Stddev': 39.557027100024285},
  'item_id': {'Count': 18609, 'CountDistinct': 1, 'CountNull': 0},
  'rain_1h': {'Count': 18609,
   'CountDistinct': 87,
   'CountNull': 0,
   'CountNan': 0,
   'Min': '0.0',
   'Max': '10.6',
   'Avg': 0.052013004460207436,
   'Stddev': 0.41238663420793753},
  'snow_1h': {'Count': 18609,
   'CountDistinct': 1,

Fantastict! No NaNs or nulls and the entire dataset is ready to go. If you'd like to delete this information so you can build your models without related data simply uncomment the cell below. Once that is done you are ready to move forward building your models with Amazon Forecast.

In [35]:
#forecast.delete_dataset_import_job(DatasetImportJobArn=rel_ds_import_job_arn)

{'ResponseMetadata': {'RequestId': 'efd0f06f-a7ad-4b47-91a0-7d67df5aeab4',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Tue, 31 Dec 2019 20:53:56 GMT',
   'x-amzn-requestid': 'efd0f06f-a7ad-4b47-91a0-7d67df5aeab4',
   'content-length': '0',
   'connection': 'keep-alive'},
  'RetryAttempts': 0}}