# Time-Series Forecasting - Merge Amazon Forecast Datasets for Amazon SageMaker Canvas API

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/autopilot|autopilot_ts_data_merge.ipynb)

---

### 1. Introduction <a name='introduction'>

Canvas API for TimeSeries Forecasting uses one dataset unlike Amazon Forecast which has seperate datasets for target-time-series(tts), related-time-series(rts) and item-meta-data. This notebook is useful for customer who would like to move from Amazon Forecast to SageMaker Canvas as it demonstrates 1/ using python code snippet to combine 3 different datasets into one, 2/create a configuration file to separately identify related time series field from item-metadata, and 3/uses autoML API to progrmatically train the dataset and make batch inference.

These artifacts include: 
- backtest (holdout) forecasts per base model over multiple time windows,
- accuracy metrics per base model,
- backtest results and accuracy metrics for the ensembled model,
- a scaled explainability report displaying the importance of each covariate and static metadata feature.
- all model artifacts are provided as well on S3, which can be registered or use for batch/real-time inference

If you are running this notebook in your own environment, you can bring in your own data. The sample dataset we have used is available [here](https://amazon-forecast-samples.s3.us-west-2.amazonaws.com/ml_ops/FoodDemand.zip). Once you download the data set, unzip and take out the three csv files, create a 'data' folder in the same level with the notebook and place the csv files in the 'data' folder.

### 2. Setup <a name='setup'>

In [None]:
# # Update boto3 using this method, or your preferred method
!pip install --upgrade boto3 --quiet
!pip install --upgrade sagemaker --quiet

In [None]:
# This is the client we will use to interact with SageMaker Autopilot
import sagemaker
import boto3
from botocore.exceptions import ClientError
import os
import json
from sagemaker import get_execution_role
from time import gmtime, strftime, sleep
import pandas as pd
from datetime import datetime as dt

region = boto3.Session().region_name
session = sagemaker.Session()
client = boto3.client("sts")
account_id = client.get_caller_identity()["Account"]

# Modify the following default_bucket to use a bucket of your choosing
bucket = session.default_bucket()
data_bucket = "rawdata-" + region + "-" + account_id
# bucket = 'my-bucket'
prefix = "moving-to-canvas"

role = get_execution_role()

# This is the client we will use to interact with SageMaker Autopilot
sm = boto3.Session().client(service_name="sagemaker", region_name=region)

In [None]:
# Assign column heading to 3 different data files for target time series, related time series and item metadata
columns_tts = ["item_id", "store_id", "demand", "ts"]

columns_rts = ["item_id", "store_id", "price", "ts"]

columns_items = ["item_id", "item_type", "item_description"]

In [None]:
# Read from data file and explore the data. Also change the time stamp format to desired one if needed.
tbl_tts = pd.read_csv("data/food-forecast-tts-uc1.csv", header=None)
tbl_tts.columns = columns_tts
tbl_tts["ts"] = pd.to_datetime(tbl_tts["ts"], format="%m/%d/%y").dt.strftime("%Y-%m-%d")
# print(tbl_tts.shape)
# tbl_tts.head()
# tbl_tts['ts'].min(), tbl_tts['ts'].max()
# print(tbl_tts.dtypes)
# print(tbl_tts.isnull().sum())

In [None]:
# read from data file and explore the data. Also change the time stamp format to desired one if needed.
tbl_rts = pd.read_csv("data/food-forecast-rts-uc1.csv", header=None)
tbl_rts.columns = columns_rts
tbl_rts["ts"] = pd.to_datetime(tbl_rts["ts"], format="%m/%d/%y").dt.strftime("%Y-%m-%d")
# print(tbl_rts.shape)
# tbl_rts .head()
# tbl_rts['ts'].min(), tbl_rts['ts'].max()
# print(tbl_rts.dtypes)
# print(tbl_rts.isnull().sum())

In [None]:
# read from data file and explore the data
tbl_item = pd.read_csv("data/food-forecast-item.csv", header=None)
tbl_item.columns = columns_items
# tbl_item = tbl_item.set_index('item_id', inplace=True)
# print(tbl_item.shape)
# tbl_item.head()

In [None]:
# Join the data files into one data file
tts_rts_combined_outer = tbl_tts.merge(tbl_rts, how="outer")
tts_rts_combined_outer
combined_tts_rts_im = tts_rts_combined_outer.merge(tbl_item, how="left")
combined_tts_rts_im

In [None]:
# Write the combined dataset to csv file which will be used for training the model using SageMaker Canvas API
file_name = "combined_tts_rts_item.csv"
full_path = "data/" + file_name
combined_tts_rts_im.to_csv(full_path, index=False)

In [None]:
# All columns in tts will be included in TimeSeriesConfig as it contains
# target, itemID, timestamp, and additional forecast dimensions.
exclude_columns = columns_tts
columns_to_include = [col for col in combined_tts_rts_im.columns if col not in exclude_columns]

json_data = {"FeatureAttributeNames": columns_to_include, "FeatureDataTypes": {}}

for col in columns_to_include:
    dtype = combined_tts_rts_im[col].dtype
    # All rts columns must be numeric to be treated as related features
    if col in columns_rts:
        json_data["FeatureDataTypes"][col] = "numeric"
    elif isinstance(dtype, pd.CategoricalDtype):
        json_data["FeatureDataTypes"][col] = "categorical"
    elif pd.api.types.is_datetime64_any_dtype(dtype):
        json_data["FeatureDataTypes"][col] = "datetime"
    else:
        json_data["FeatureDataTypes"][col] = "text"

json_str = json.dumps(json_data, indent=4)

# print(json_str)

with open("data/feature.json", "w") as f:
    f.write(json_str)

In [None]:
# Upload the data file and config file to S3 bucket

s3 = boto3.client("s3")
object_name = prefix + "/train/" + file_name
# print(object_name)
try:
    response = s3.upload_file(full_path, bucket, object_name)
except ClientError as e:
    logging.error(e)

config_file_name = "feature.json"
object_name = prefix + "/" + config_file_name
config_full_path = "data/" + config_file_name

try:
    response = s3.upload_file(config_full_path, data_bucket, object_name)
except ClientError as e:
    logging.error(e)

### 3. Model Training <a name='training'>

Establish an AutoML training job name

In [None]:
timestamp_suffix = strftime("%Y%m%d-%H%M%S", gmtime())
auto_ml_job_name = "ts-" + timestamp_suffix
print("AutoMLJobName: " + auto_ml_job_name)

Define training job specifications. More information about [create_auto_ml_job_v2](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_auto_ml_job_v2.html) can be found in our SageMaker documentation.</n></n>This JSON body leverages the built-in sample data schema. Please consult the documentation to understand how to alter the parameters for your unique schema.

In [None]:
input_data_config = [
    {
        "ChannelType": "training",
        "ContentType": "text/csv;header=present",
        "CompressionType": "None",
        "DataSource": {
            "S3DataSource": {
                "S3DataType": "S3Prefix",
                "S3Uri": "s3://{}/{}/train/".format(bucket, prefix),
            }
        },
    }
]

output_data_config = {"S3OutputPath": "s3://{}/{}/train_output".format(bucket, prefix)}

optimizaton_metric_config = {"MetricName": "AverageWeightedQuantileLoss"}

automl_problem_type_config = {
    "TimeSeriesForecastingJobConfig": {
        "FeatureSpecificationS3Uri": "s3://{}/{}/feature.json".format(data_bucket, prefix),
        "ForecastFrequency": "M",
        "ForecastHorizon": 2,
        "ForecastQuantiles": ["p50", "p60", "p70", "p80", "p90"],
        "Transformations": {
            "Filling": {
                "demand": {"middlefill": "zero", "backfill": "zero"},
                "price": {"middlefill": "zero", "backfill": "zero", "futurefill": "zero"},
            }
        },
        "TimeSeriesConfig": {
            "TargetAttributeName": "demand",
            "TimestampAttributeName": "ts",
            "ItemIdentifierAttributeName": "item_id",
            "GroupingAttributeNames": ["store_id"],
        },
    }
}

With parameters now defined, invoke the [training job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_auto_ml_job_v2.html) and monitor for its completion.

In [None]:
sm.create_auto_ml_job_v2(
    AutoMLJobName=auto_ml_job_name,
    AutoMLJobInputDataConfig=input_data_config,
    OutputDataConfig=output_data_config,
    AutoMLProblemTypeConfig=automl_problem_type_config,
    AutoMLJobObjective=optimizaton_metric_config,
    RoleArn=role,
)

Next, we demonstrate a looping mechanism to query (monitor) job status. When the status is ```Completed```, you may review the accuracy of the model and decide whether to perform inference on a batch or real-time API basis as described in this notebook. Please consult documentation for [describe_auto_ml_job_v2](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/describe_auto_ml_job_v2.html) as needed.

NOTE: Training the Model will take approxmiately 30 minutes. Please take this time to work other Labs in this workshop.

In [None]:
describe_response = sm.describe_auto_ml_job_v2(AutoMLJobName=auto_ml_job_name)
job_run_status = describe_response["AutoMLJobStatus"]

while job_run_status not in ("Failed", "Completed", "Stopped"):
    describe_response = sm.describe_auto_ml_job_v2(AutoMLJobName=auto_ml_job_name)
    job_run_status = describe_response["AutoMLJobStatus"]

    print(
        dt.now(),
        describe_response["AutoMLJobStatus"]
        + " - "
        + describe_response["AutoMLJobSecondaryStatus"],
    )
    sleep(180)

Once training is completed, you can use the describe function to iterate over model leaderboard results. Below is an example to use the best candidate in the subsequent inference phase. Please consult our documentation on [create_model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_model.html) as needed.

In [None]:
best_candidate = sm.describe_auto_ml_job_v2(AutoMLJobName=auto_ml_job_name)["BestCandidate"]
best_candidate_containers = best_candidate["InferenceContainers"]
best_candidate_name = best_candidate["CandidateName"]

reponse = sm.create_model(
    ModelName=best_candidate_name, ExecutionRoleArn=role, Containers=best_candidate_containers
)

print("BestCandidateName:", best_candidate_name)
print("BestCandidateContainers:", best_candidate_containers)

### 4. Batch Predictions (Inference) <a name='batch'>

Please review [service limits](https://docs.aws.amazon.com/marketplace/latest/userguide/ml-service-restrictions-and-limits.html
) with batch transform. At the time of writing, the documentation says the maximum size of the input data per invocation is 100 MB. Translated, when working with 
datasets over 100MB, you will need to prepare your data by splitting/sharding into multiple files.
 Take care to ensure each file contains whole time series. One potential way to do this is to use
 a function that splits data on the item key, or similar.



In [None]:
timestamp_suffix = strftime("%Y%m%d-%H%M%S", gmtime())
transform_job_name = f"{best_candidate_name}-" + timestamp_suffix
print("BatchTransformJob: " + transform_job_name)

The next cell downloads a dataset once again and this time places in a ```batch_transform/input``` folder. Ideally, this input dataset can be all of your time-series, or a fraction thereof. Please take care to ensure the dataset is within the limits described.

In [None]:
# modify the input file for inference to remove n/a values
df = pd.read_csv("data/combined_tts_rts_item.csv")
df.fillna(0, inplace=True)
df.to_csv("data/combined_tts_rts_item_modified.csv", index=False)

In [None]:
# Upload the data file to S3 bucket for batch prediction
s3 = boto3.client("s3")
file_name = "combined_tts_rts_item.csv"
modified_file_name = "combined_tts_rts_item_modified.csv"
full_path = "data/" + modified_file_name
object_name = prefix + "/batch_transform/input/" + file_name
# print(object_name)
try:
    response = s3.upload_file(full_path, bucket, object_name)
except ClientError as e:
    logging.error(e)

In [None]:
response = sm.create_transform_job(
    TransformJobName=transform_job_name,
    ModelName=best_candidate_name,
    MaxPayloadInMB=0,
    ModelClientConfig={"InvocationsTimeoutInSeconds": 3600},
    TransformInput={
        "DataSource": {
            "S3DataSource": {
                "S3DataType": "S3Prefix",
                "S3Uri": "s3://{}/{}/batch_transform/input/".format(bucket, prefix),
            }
        },
        "ContentType": "text/csv",
        "SplitType": "None",
    },
    TransformOutput={
        "S3OutputPath": "s3://{}/{}/batch_transform/output/".format(bucket, prefix),
        "AssembleWith": "Line",
    },
    TransformResources={"InstanceType": "ml.m5.4xlarge", "InstanceCount": 1},
)

Poll for batch transformation job to complete. Once completed, resulting prediction files are available at the URI shown in the prior cell, ```S3OutputPath```. We use the API method [describe_transform_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/describe_transform_job.html) to complete this step.

In [None]:
describe_response = sm.describe_transform_job(TransformJobName=transform_job_name)

job_run_status = describe_response["TransformJobStatus"]

while job_run_status not in ("Failed", "Completed", "Stopped"):
    describe_response = sm.describe_transform_job(TransformJobName=transform_job_name)
    job_run_status = describe_response["TransformJobStatus"]

    print(dt.now(), describe_response["TransformJobStatus"])
    sleep(120)

Once the batch predictions are complete, download and review the resulting output. This will display the first 10 predictions.



In [None]:
s3 = boto3.resource("s3")
s3.Bucket(bucket).download_file(
    "{}/batch_transform/output/combined_tts_rts_item.csv.out".format(prefix),
    "combined_tts_rts_item.csv.out",
)
df = pd.read_csv("combined_tts_rts_item.csv.out")
df.head(10)

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/autopilot|autopilot_ts_data_merge.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/autopilot|autopilot_ts_data_merge.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/autopilot|autopilot_ts_data_merge.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/autopilot|autopilot_ts_data_merge.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/autopilot|autopilot_ts_data_merge.ipynb
![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/autopilot|autopilot_ts_data_merge.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/autopilot|autopilot_ts_data_merge.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/autopilot|autopilot_ts_data_merge.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/autopilot|autopilot_ts_data_merge.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/autopilot|autopilot_ts_data_merge.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/autopilot|autopilot_ts_data_merge.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/autopilot|autopilot_ts_data_merge.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/autopilot|autopilot_ts_data_merge.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/autopilot|autopilot_ts_data_merge.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/autopilot|autopilot_ts_data_merge.ipynb)