# Preparing Related Time-Series Data

> *This notebook should work well in the `Python 3 (Data Science)` kernel in SageMaker Studio, or `conda_python3` in SageMaker Notebook Instances*

In [Notebook 1](1.%20Preparing%20Target%20Time-Series%20Data.ipynb), we prepared training and validation datasets for our **target time-series**: the quantity we're actually trying to predict, which is the minimum required data to get started with Amazon Forecast.

In most real-world use cases, forecasts can be significantly improved using:

- **Related Time-Series**: Other time-varying factors which can be informative *inputs* to our forecast
- **Item Metadata**: Static attributes which help us find correlations between different `item_ids` in our forecast

**In this notebook:** we'll use Python code to prepare a Related Time-Series file for our example use case - ignoring item metadata since our example contains only a single `item_id`.

## Sourcing Data

As we saw right at the start of the first notebook, our raw example traffic data actually **already includes** some time-varying attributes which might make good RTS candidates: **weather information**. This is the data source we'll use in this notebook.

In real-world use-cases, extra information might be stored in the same system as your target variable or might be somewhere else: For example bringing weather or product discontinuation/stock data together with demand recorded in sales data.

Our goal should be to find variables which are likely *significant and useful to our forecast*. For example:

- Adding out-of-stock data to a retail forecast might be very important, because without it we could be training our model to forecast 0 sales on days when actually there is lots of demand - just because we sold out of a product in the past.
- Adding a calendar of promotional events in Indonesia might *not* be very relevant, if our forecast is only modelling sales in the Philippines.

Since our traffic data is already loaded on the notebook, we'll simply start by loading our required libraries as there's no extra downloading to do!

In [None]:
%load_ext autoreload
%autoreload 2

# Python Built-Ins:
from time import sleep

# External Dependencies:
import boto3
import numpy as np
import pandas as pd

# Local Dependencies:
import util

%store -r

In [None]:
# Our RTS source data was already downloaded in notebook 1!

## Reviewing and Pre-Processing the Data

As with the TTS, it's vitally important that we understand whether any data might be [missing](https://docs.aws.amazon.com/forecast/latest/dg/howitworks-missing-values.html) from our RTS or if there might be any [conflicts in data frequency](https://docs.aws.amazon.com/forecast/latest/dg/howitworks-datasets-groups.html#howitworks-data-alignment) and how those should be resolved.

Although Amazon Forecast provides functionality for handling both these cases (see the linked docs), we need to check that any such treatments are correct for our use case.

Because related time-series are **inputs** to our forecast, they must typically span **the forecast horizon as well as the history**, as shown below:

![Graph: RTS must extend past TTS into the Forecast Horizon](https://docs.aws.amazon.com/forecast/latest/dg/images/short-long-rts.png)

Per [the documentation](https://docs.aws.amazon.com/forecast/latest/dg/related-time-series-datasets.html#related-time-series-historical-futurelooking), **only the CNN-QR algorithm** is able to use so-called "historical related time-series" where future values are not provided... and even for this algorithm, RTS data will be much more useful where forecasts are available.

Since we'll be exploring a range of algorithms, our goal in preparing this example data will be to **remove any NaN/missing** entries from our weather data.

To get started, we'll load up our target dataframe from the first notebook and explore the available data range:

In [None]:
related_time_series_df = target_df.copy()
related_time_series_df = full_df.join(related_time_series_df, how="outer")
cols = related_time_series_df.columns.tolist()
related_time_series_df = related_time_series_df.loc["2017-01-01":]
print(related_time_series_df.index.min())
print(related_time_series_df.index.max())

We can see now that the data covers the range of our target time series of 2017's entire year to the end of our known data about September 2018.

However, we will see that many records are missing weather data fields:


In [None]:
related_time_series_df[related_time_series_df.isnull().any(axis=1)]

> ⚠️ **Remember** as with TTS, These may not be the only values your Forecast Predictor sees as "missing", if for example there are hours in the history or forecast period with no record at all!

For this example, we will:

- Forward-fill these missing values before preparing the data file for Forecast
- Assume (correctly) that every required timestep does have **at least one** record in the file (so filling missing cells is sufficient)
- Assume (correctly) that every hour timestep has **exactly one** record in the file (so no duplicate or close-together records will get aggregated)
- Therefore ignore Forecast's missing value and aggregation configurations, because our input data is fully sanitized.

Below we fill the missing values and check again for any issues:

In [None]:
# Forward-fill missing values:
related_time_series_df[cols] = related_time_series_df[cols].replace("", np.nan).ffill()

# Re-check for missing:
related_time_series_df[related_time_series_df.isnull().any(axis=1)]

With all our missing/duplicate worries solved, let's **review the columns and decide what we should keep:**

In [None]:
related_time_series_df.sample(3)

A few things to note here:

- `holiday` information is not needed because the use case is in the a supported country, so we can simply use the [Holidays feature within Amazon Forecast](https://docs.aws.amazon.com/forecast/latest/dg/API_SupplementaryFeature.html).
- `traffic_volume` is our actual target field, so of course will not be part of the RTS
- We'll still need to add the `item_id` field back in to the dataset, as we did for the TTS.
- `weather_main` seems pretty redundant if `weather_description` is provided, but the rest of the fields seem interesting.

Therefore we'll scope our RTS dataset to the following fields:

* `timestamp` - The Index
* `temp` - float
* `rain_1h` - float
* `snow_1h` - float
* `clouds_all` - float
* `weather_description` - string
* `item_id` - string

In [None]:
# Restrict the columns to keep
related_time_series_df = related_time_series_df[["temp", "rain_1h", "snow_1h", "clouds_all"]]

# Add in item_id
related_time_series_df["item_id"] = "all"

# Validate the structure
related_time_series_df.head()

## Saving the Related Time-Series File

Since our uploaded RTS data should extend out into the forecast horizon, there's no need to split it into a training and validation set as we did for the TTS.

We'll simply save the full set to a CSV file, as below:

In [None]:
# Save it off as a file:
related_time_series_filename = "related_time_series.csv"
related_time_series_path = f"{data_dir}/{related_time_series_filename}"
related_time_series_df.to_csv(related_time_series_path, header=False)

## Uploading Data to Amazon S3

As before, our final step is to upload the prepared file to Amazon S3 ready for import to Amazon Forecast:

In [None]:
# Replace the below with e.g. region = "ap-southeast-1" if you didn't run notebook 0
%store -r region  
assert isinstance(region, str), "`region` must be a region name string e.g. 'us-east-1'"

# Replace the below e.g. bucket_name = "DOC-EXAMPLE-BUCKET" if you didn't run notebook 0
%store -r bucket_name 
assert isinstance(bucket_name, str), "`bucket_name` must be a data bucket name string"

session = boto3.Session(region_name=region)
s3 = session.resource("s3")

In [None]:
# Upload Related File
s3.Bucket(bucket_name).Object(related_time_series_filename).upload_file(related_time_series_path)
related_s3uri = f"s3://{bucket_name}/{related_time_series_filename}"
%store related_s3uri
print(f"Uploaded RTS to {related_s3uri}")

## All Done!

Now our Related Time-Series data is prepared and staged in an Amazon S3 bucket ready to import.

In the next notebooks, we'll show how to import this additional dataset and use it to build improved forecast models.

You can follow along with either the [notebook 4a (AWS Console)](4a.%20Incorporating%20RTS%20Data%20(Console).ipynb) or [notebook 4b (Python SDK)](4b.%20Incorporating%20RTS%20Data%20(Python%20SDK).ipynb) variant.