# Time-Series Forecasting with Amazon SageMaker Autopilot

### Contents

1. [Introduction](#introduction)
1. [Setup](#setup)
1. [Model Training](#training)
1. [Batch Predictions (Inference)](#batch)


### 1. Introduction <a name='introduction'>

This notebook uses Amazon SageMaker Autopilot to train a time-series model and produce predictions against the trained model. At the top-level, customers provide a set of tabular historical data on S3 and make an API to train a model. Once the model has been trained, you can elect to produce prediction as a batch or via a real-time endpoint.</n></n>  As part of the training process, SageMaker Autopilot manages and runs multiple time series models concurrently. All of these models are combined into a single ensembled model which blends the candidate models in a ratio that minimizes forecast error. Customers are provided with metadata and models for the ensemble and all underlying candidate models too. SageMaker Autopilot orchestrates this entire process and provides several artifacts as a result.

These artifacts include: 
- backtest (holdout) forecasts per base model over multiple time windows,
- accuracy metrics per base model,
- backtest results and accuracy metrics for the ensembled model,
- a scaled explainability report displaying the importance of each covariate and static metadata feature.
- all model artifacts are provided as well on S3, which can be registered or use for batch/real-time inference

### 2. Setup <a name='setup'>

In [2]:
# Update boto3 using this method, or your preferred method
#!pip install --upgrade boto3 --quiet
#!pip install --upgrade sagemaker --quiet

In [None]:
import sagemaker
import boto3
from sagemaker import get_execution_role
from time import gmtime, strftime, sleep
import datetime
import pandas as pd

region = boto3.Session().region_name
session = sagemaker.Session()
client = boto3.client('sts')
account_id = client.get_caller_identity()["Account"]

# Modify the following default_bucket to use a bucket of your choosing
bucket = session.default_bucket()
data_bucket = 'rawdata-' + region + '-' + account_id
#bucket = 'my-bucket'
prefix = 'my-project-name'

role = get_execution_role()

# This is the client we will use to interact with SageMaker Autopilot
sm = boto3.Session().client(service_name="sagemaker", region_name=region)

We provide a sample set of data to accompany this notebook. You may use our synthetic dataset, or alter this notebook to accommodate your own data. As a note, the next cell will copy a file to your S3 bucket and prefix defined in the last cell. As an alternate, we provide a method to copy the file to your local disk too.

IMPORTANT: When training a model, your input data can contain a mixture of covariate and static item metadata. Take care to create future-dated rows that extend to the end of your prediction horizon. In the future-dated rows, carry your static item metadata and expected covariate values. Future-dated target-value (y) should be empty. Please download the example synthetic file using the S3 copy command in the next cell. You can observe the data programmatically or in a text editor as an example.

The structure of the CSV file provided is as follows:
- item_id (INT)
- store_id (INT)
- ts (TIMESTAMP)
- demand (FLOAT)
- price (FLOAT)

In [4]:
s3 = boto3.resource('s3')
copy_source = {
    'Bucket': data_bucket,
    'Key': 'consumer_electronics.csv'
}

s3.meta.client.copy(copy_source, bucket, prefix+'/train/consumer_electronics.csv')

### 3. Model Training <a name='training'>

Establish an AutoML training job name

In [None]:
timestamp_suffix = strftime("%Y%m%d-%H%M%S", gmtime())
auto_ml_job_name = "ts-" + timestamp_suffix
print("AutoMLJobName: " + auto_ml_job_name)

Define training job specifications. More information about [create_auto_ml_job_v2](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_auto_ml_job_v2.html) can be found in our SageMaker documentation.</n></n>This JSON body leverages the built-in sample data schema. Please consult the documentation to understand how to alter the parameters for your unique schema.

In [9]:
input_data_config = [
    {  'ChannelType': 'training',
            'ContentType': 'text/csv;header=present',
            'CompressionType': 'None',
        'DataSource': {
            'S3DataSource': {
                'S3DataType': 'S3Prefix',
                'S3Uri': 's3://{}/{}/train/'.format(bucket, prefix),
            }
        }
    }
]

output_data_config = {'S3OutputPath': 's3://{}/{}/train_output'.format(bucket, prefix)}

optimizaton_metric_config = {'MetricName': 'AverageWeightedQuantileLoss'}

automl_problem_type_config ={
        'TimeSeriesForecastingJobConfig': {
            'ForecastFrequency': 'M',
            'ForecastHorizon': 2,
            'ForecastQuantiles': ['p50','p60','p70','p80','p90'],
            'Transformations': {
            'Filling': {
                'demand': {
                    'middlefill' : 'zero',
                    'backfill' : 'zero'
                    },
                'price': {
                    'middlefill' : 'zero',
                    'backfill' : 'zero',
                    'futurefill' : 'zero'
                    }                        
            }
            },
            'TimeSeriesConfig': {
                'TargetAttributeName': 'demand',
                'TimestampAttributeName': 'ts',
                'ItemIdentifierAttributeName': 'item_id',
                'GroupingAttributeNames': [
                    'store_id'
                ]
            }
        }
    }

With parameters now defined, invoke the [training job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_auto_ml_job_v2.html) and monitor for its completion.

In [None]:
sm.create_auto_ml_job_v2(
    AutoMLJobName=auto_ml_job_name,
    AutoMLJobInputDataConfig=input_data_config,
    OutputDataConfig=output_data_config,
    AutoMLProblemTypeConfig = automl_problem_type_config,
    AutoMLJobObjective=optimizaton_metric_config,
    RoleArn=role
)

Next, we demonstrate a looping mechanism to query (monitor) job status. When the status is ```Completed```, you may review the accuracy of the model and decide whether to perform inference on a batch or real-time API basis as described in this notebook. Please consult documentation for [describe_auto_ml_job_v2](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/describe_auto_ml_job_v2.html) as needed.

NOTE: Training the Model will take approxmiately 30 minutes. Please take this time to work other Labs in this workshop.

In [None]:
describe_response = sm.describe_auto_ml_job_v2(AutoMLJobName=auto_ml_job_name)
job_run_status = describe_response["AutoMLJobStatus"]

while job_run_status not in ("Failed", "Completed", "Stopped"):
    describe_response = sm.describe_auto_ml_job_v2(AutoMLJobName=auto_ml_job_name)
    job_run_status = describe_response["AutoMLJobStatus"]

    print(
       datetime.datetime.now(), describe_response["AutoMLJobStatus"] + " - " + describe_response["AutoMLJobSecondaryStatus"]
    )
    sleep(180)

Once training is completed, you can use the describe function to iterate over model leaderboard results. Below is an example to use the best candidate in the subsequent inference phase. Please consult our documentation on [create_model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_model.html) as needed.

In [None]:
best_candidate = sm.describe_auto_ml_job_v2(AutoMLJobName=auto_ml_job_name)['BestCandidate']
best_candidate_containers = best_candidate['InferenceContainers'] 
best_candidate_name = best_candidate['CandidateName']

reponse = sm.create_model(
ModelName = best_candidate_name,
ExecutionRoleArn = role,
Containers = best_candidate_containers
)

print('BestCandidateName:',best_candidate_name)
print('BestCandidateContainers:',best_candidate_containers)

### 4. Batch Predictions (Inference) <a name='batch'>

Please review [service limits](https://docs.aws.amazon.com/marketplace/latest/userguide/ml-service-restrictions-and-limits.html
) with batch transform. At the time of writing, the documentation says the maximum size of the input data per invocation is 100 MB. Translated, when working with 
datasets over 100MB, you will need to prepare your data by splitting/sharding into multiple files.
 Take care to ensure each file contains whole time series. One potential way to do this is to use
 a function that splits data on the item key, or similar.



Launch Batch Transformation Job using [create_transform_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_transform_job.html). The runtime of the job is a function of the size of your data and the InstanceType and InstanceCount provided. Once the task is complete, results are available on S3 at the declared ```S3OutputPath``` location. From there, you can use an event handler or other mechanism to consume the results.

In [None]:
timestamp_suffix = strftime("%Y%m%d-%H%M%S", gmtime())
transform_job_name=f'{best_candidate_name}-' + timestamp_suffix
print("BatchTransformJob: " + transform_job_name)

The next cell downloads a dataset once again and this time places in a ```batch_transform/input``` folder. Ideally, this input dataset can be all of your time-series, or a fraction thereof. Please take care to ensure the dataset is within the limits described.

IMPORTANT: The data you supply for inference must have at least four valid historical values for each time-series.

In [14]:
s3 = boto3.resource('s3')
copy_source = {
    'Bucket': data_bucket,
    'Key': 'consumer_electronics_payload.csv'
}

s3.meta.client.copy(copy_source, bucket, prefix+'/batch_transform/input/consumer_electronics_payload.csv')

In [17]:
response = sm.create_transform_job(
    TransformJobName=transform_job_name, 
    ModelName=best_candidate_name,
    MaxPayloadInMB=0,
    ModelClientConfig={
        'InvocationsTimeoutInSeconds': 3600
    },
    TransformInput={
        'DataSource': {
            'S3DataSource': {
                'S3DataType': 'S3Prefix',
                'S3Uri': 's3://{}/{}/batch_transform/input/'.format(bucket, prefix)
            }
        },
        'ContentType': 'text/csv',
        'SplitType': 'None'
    },
    TransformOutput={
        'S3OutputPath': 's3://{}/{}/batch_transform/output/'.format(bucket, prefix),
        'AssembleWith': 'Line',
    },
    TransformResources={
        'InstanceType': 'ml.m5.4xlarge',
        'InstanceCount': 1
    }
    )

Poll for batch transformation job to complete. Once completed, resulting prediction files are available at the URI shown in the prior cell, ```S3OutputPath```. We use the API method [describe_transform_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/describe_transform_job.html) to complete this step.

In [None]:
describe_response = sm.describe_transform_job(TransformJobName=transform_job_name)

job_run_status = describe_response["TransformJobStatus"]

while job_run_status not in ("Failed", "Completed", "Stopped"):
    describe_response = sm.describe_transform_job(TransformJobName=transform_job_name)
    job_run_status = describe_response["TransformJobStatus"]

    print(
       datetime.datetime.now(), describe_response["TransformJobStatus"]
    )
    sleep(60)

Once the batch predictions are complete, download and review the resulting output. This will display the first 10 predictions.



In [None]:
s3.Bucket(bucket).download_file('{}/batch_transform/output/consumer_electronics_payload.csv.out'.format(prefix), 
                                'consumer_electronics_payload.csv.out')
df = pd.read_csv('consumer_electronics_payload.csv.out')
df.head(10)