# Lab 1: time series forecast with Amazon SageMaker Canvas
This notebook shows how to use no code AutoML [Amazon SageMaker Canvas](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas.html) to train models and generate predictions without needing to write any code.

Canvas supports training of a custom model for [time series forecasting](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-time-series.html). Canvas time series model training uses the [Sagemaker Autopilot](https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-automate-model-development.html), which enables the use of various Autopilot’s public APIs. These include operations like `CreateAutoMLJobV2`, `ListCandidatesForAutoMLJob`, and `DescribeAutoMLJobV2` among others. This integration facilitates a streamlined process for training machine learning models directly within the Canvas environment.

Canvas automatically trains candidate models using [these time series forecasting algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/timeseries-forecasting-algorithms.html) and then creates a model ensemble as a final model.

In this notebook you learn how to use Canvas features to explore and process data and how to train a time series model in different build modes. Finally you learn how to deploy a trained model to a [SageMaker real-time inference endpoint](real-time) and register the model in the [SageMaker Model Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html).

<div class="alert alert-info">
While working with Canvas, you don't need to write any code or use Jupyter notebooks. This notebook is only for instructions and to download and preprocess a time series dataset.
</div>

## Setup notebook environment

In [1]:
import boto3
import zipfile
import sagemaker
import os
import numpy as np
import pandas as pd
import json
from pprint import pprint
from time import gmtime, strftime, sleep

sagemaker.__version__

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


'2.227.0'

In [2]:
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name

experiment_prefix = "canvas"
s3_bucket = sagemaker_session.default_bucket()  # replace with your bucket if needed
s3_prefix = "canvas-demo-notebook" 
s3_data_path = f"s3://{s3_bucket}/{s3_prefix}/data"
sm = boto3.client("sagemaker")

In [3]:
# create a folder to keep model performance reports
os.makedirs("./model-performance", exist_ok=True)

In [4]:
# get domain_id and user profile name
NOTEBOOK_METADATA_FILE = "/opt/ml/metadata/resource-metadata.json"
domain_id = None

if os.path.exists(NOTEBOOK_METADATA_FILE):
    with open(NOTEBOOK_METADATA_FILE, "rb") as f:
        metadata = json.loads(f.read())
        domain_id = metadata.get('DomainId')
        space_name = metadata.get('SpaceName')
        print(f"SageMaker domain id: {domain_id}")

if not space_name:
    raise Exception(f"Cannot find the current space name. Make sure you run this notebook in a JupyterLab in the SageMaker Studio")
else:
    print(f"Space name: {space_name}")
    
r = boto3.client("sagemaker").describe_space(DomainId=domain_id, SpaceName=space_name)
user_profile_name = r['OwnershipSettings']['OwnerUserProfileName']

assert(user_profile_name)
print(f"User profile: {user_profile_name}")

%store domain_id
%store space_name
%store user_profile_name
%store region

SageMaker domain id: d-mv9ybtbztu4a
Space name: ts-space
User profile: studio-user-ts-4b422b90
Stored 'domain_id' (str)
Stored 'space_name' (str)
Stored 'user_profile_name' (str)
Stored 'region' (str)


## Prepare the data

All notebooks in this workshop using the real-world same dataset. It makes possible to compare model metrics across different approaches. 

You use the [electricity dataset](https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014) from the repository of the University of California, Irvine:
> Trindade, Artur. (2015). ElectricityLoadDiagrams20112014. UCI Machine Learning Repository. https://doi.org/10.24432/C58C86.

### Download the dataset
Download the from the SageMaker example S3 bucket.

In [5]:
os.makedirs("./data", exist_ok=True)

In [6]:
dataset_zip_file_name = 'LD2011_2014.txt.zip'
dataset_path = f'./data/LD2011_2014.txt'

s3_dataset_path = f"datasets/timeseries/uci_electricity/{dataset_zip_file_name}"

In [7]:
if not os.path.isfile(dataset_path):
    print(f'Downloading and unzipping the dataset to {dataset_path}')
    s3_client = boto3.client("s3")
    s3_client.download_file(
        f"sagemaker-example-files-prod-{region}", s3_dataset_path, f"./data/{dataset_zip_file_name}"
    )

    zip_ref = zipfile.ZipFile(f"./data/{dataset_zip_file_name}", "r")
    zip_ref.extractall("./data")
    zip_ref.close()
    dataset_path = '.'.join(zip_ref.filename.split('.')[:-1])
else:
    print(f'The dataset {dataset_path} exists, skipping download and unzip!')

Downloading and unzipping the dataset to ./data/LD2011_2014.txt


In [8]:
# see what is inside the file
!head -n 2 {dataset_path} 

"";"MT_001";"MT_002";"MT_003";"MT_004";"MT_005";"MT_006";"MT_007";"MT_008";"MT_009";"MT_010";"MT_011";"MT_012";"MT_013";"MT_014";"MT_015";"MT_016";"MT_017";"MT_018";"MT_019";"MT_020";"MT_021";"MT_022";"MT_023";"MT_024";"MT_025";"MT_026";"MT_027";"MT_028";"MT_029";"MT_030";"MT_031";"MT_032";"MT_033";"MT_034";"MT_035";"MT_036";"MT_037";"MT_038";"MT_039";"MT_040";"MT_041";"MT_042";"MT_043";"MT_044";"MT_045";"MT_046";"MT_047";"MT_048";"MT_049";"MT_050";"MT_051";"MT_052";"MT_053";"MT_054";"MT_055";"MT_056";"MT_057";"MT_058";"MT_059";"MT_060";"MT_061";"MT_062";"MT_063";"MT_064";"MT_065";"MT_066";"MT_067";"MT_068";"MT_069";"MT_070";"MT_071";"MT_072";"MT_073";"MT_074";"MT_075";"MT_076";"MT_077";"MT_078";"MT_079";"MT_080";"MT_081";"MT_082";"MT_083";"MT_084";"MT_085";"MT_086";"MT_087";"MT_088";"MT_089";"MT_090";"MT_091";"MT_092";"MT_093";"MT_094";"MT_095";"MT_096";"MT_097";"MT_098";"MT_099";"MT_100";"MT_101";"MT_102";"MT_103";"MT_104";"MT_105";"MT_106";"MT_107";"MT_108";"MT_109";"MT_110";"MT_111

### Dataset overview

The dataset contains electricity consumption measurements recorded every 15 minutes from 2011 to 2014. It is a large-scale dataset with 370 instances or clients and 140,256 features, making it suitable for various machine learning tasks such as regression and clustering.

#### Data Structure and Format

The data is stored in a CSV format with semicolon (;) separators. The file structure is as follows:

- The first column contains date and time information in the format 'yyyy-mm-dd hh:mm:ss'
- Subsequent columns represent individual clients, with each containing float values of electricity consumption in kilowatts (kW)

#### Key Characteristics

- **Time resolution**: Measurements are taken every 15 minutes, resulting in 96 readings per day (24 hours * 4 readings per hour).
- **Unit of measurement**: Values are recorded in kilowatts (kW). To convert to kilowatt-hours (kWh), the values need to be divided by 4.
- **Data completeness**: The dataset has no missing values, ensuring data integrity for analysis.
- **Time Zone**: All time labels are reported in Portuguese time.

#### Special Considerations

1. **New clients**: Some clients were added after 2011. For periods before their addition, their consumption is recorded as zero.
2. **Daylight saving time changes**:
   - In March (spring forward): The day with 23 hours shows zero consumption between 1:00 AM and 2:00 AM for all clients.
   - In October (fall back): The 25-hour day combines the consumption of two hours between 1:00 AM and 2:00 AM.

#### Potential Applications

This dataset is valuable for various research and practical applications in the field of energy consumption analysis and forecasting. It can be used for:

- Developing predictive models for electricity demand
- Identifying consumption patterns and anomalies
- Clustering clients based on their consumption behaviors
- Studying the impact of seasonal changes on electricity usage

### Preprocess data

In [9]:
# load the dataset into a DataFrame from the file
df_raw = pd.read_csv(
    dataset_path, 
    sep=';', 
    index_col=0,
    decimal=',',
    parse_dates=True,
)
df_raw

Unnamed: 0,MT_001,MT_002,MT_003,MT_004,MT_005,MT_006,MT_007,MT_008,MT_009,MT_010,...,MT_361,MT_362,MT_363,MT_364,MT_365,MT_366,MT_367,MT_368,MT_369,MT_370
2011-01-01 00:15:00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2011-01-01 00:30:00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2011-01-01 00:45:00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2011-01-01 01:00:00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2011-01-01 01:15:00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2014-12-31 23:00:00,2.538071,22.048364,1.737619,150.406504,85.365854,303.571429,11.305822,282.828283,68.181818,72.043011,...,276.945039,28200.0,1616.033755,1363.636364,29.986962,5.851375,697.102722,176.961603,651.026393,7621.621622
2014-12-31 23:15:00,2.538071,21.337127,1.737619,166.666667,81.707317,324.404762,11.305822,252.525253,64.685315,72.043011,...,279.800143,28300.0,1569.620253,1340.909091,29.986962,9.947338,671.641791,168.614357,669.354839,6702.702703
2014-12-31 23:30:00,2.538071,20.625889,1.737619,162.601626,82.926829,318.452381,10.175240,242.424242,61.188811,74.193548,...,284.796574,27800.0,1556.962025,1318.181818,27.379400,9.362200,670.763828,153.589316,670.087977,6864.864865
2014-12-31 23:45:00,1.269036,21.337127,1.737619,166.666667,85.365854,285.714286,10.175240,225.589226,64.685315,72.043011,...,246.252677,28000.0,1443.037975,909.090909,26.075619,4.095963,664.618086,146.911519,646.627566,6540.540541


For Canvas model building you need to resample the time series to one of the supported forecast intervals. At the time of this writing, Canvas [supports](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-time-series.html) the following intervals:

- 1, 5, 15, 30 min
- 1 hour
- 1 day
- 1 week
- 1 month
- 1 year

The current time series is sampled to 15 min interval, which is supported by Canvas. To decrease the number of rows and model training time in Canvas, you need to resample the dataset to a more coarse interval, for example to 1 hour or to 1 day. Both time series 1H and 1D aggregation can be processed by Canvas.

Consider the following model building times when selecting your aggregation or number of time series in the dataset:

Dataset size|Sampling|Build type|Training time
---|---|---|---
Small - 2 time series, 70130 rows|1 hour|Quick|8-10 min
Small - 2 time series|1 hour|Standard|150 min
Small - 2 time series, 2924 rows|1 day|Quick|8-10 min
Small - 2 time series|1 day|Standard|30-40 min 
Full - 370 time series, 12974050 rows|1 hour|Quick|10-14 min
Full - 370 time series|1 hour|Standard|130 min 
Full - 370 time series, 540940 rows|1 day|Quick|10-14 min
Full - 370 time series|1 day|Standard|30-40 min 

Note that training time is an approximation only and might be different in your environment.

<div class="alert alert-info">
For this workshop we recommend to use a small or full dataset with 1D aggregation. You can also train multiple models in parallel in Canvas. The standard build for this dataset takes about 30 min.
</div>

In [11]:
# resample to Canvas supported interval
# select and uncomment your aggregation interval and the divider for kW data
# freq = "1h" 
# div = 4
freq = "1d"
div = 96

data_kw = df_raw.resample(freq).sum() / div

In [12]:
data_kw

Unnamed: 0,MT_001,MT_002,MT_003,MT_004,MT_005,MT_006,MT_007,MT_008,MT_009,MT_010,...,MT_361,MT_362,MT_363,MT_364,MT_365,MT_366,MT_367,MT_368,MT_369,MT_370
2011-01-01 00:00:00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2011-01-01 01:00:00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2011-01-01 02:00:00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2011-01-01 03:00:00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2011-01-01 04:00:00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2014-12-31 20:00:00,2.220812,25.248933,1.737619,186.483740,92.073171,340.773810,11.305822,315.656566,91.783217,81.451613,...,333.511777,39700.0,1702.531646,2238.636364,74.967405,4.388531,375.768218,108.931553,688.416422,8405.405405
2014-12-31 21:00:00,2.538071,22.759602,1.737619,162.093496,86.280488,319.940476,11.588468,269.360269,76.486014,70.161290,...,327.266238,38575.0,1649.789030,1477.272727,74.967405,3.949678,465.539947,154.841402,662.023460,8283.783784
2014-12-31 22:00:00,1.903553,22.048364,1.737619,161.077236,86.890244,314.732143,11.305822,251.683502,71.678322,72.311828,...,306.209850,35475.0,1636.075949,1375.000000,64.211213,7.753072,655.179982,195.325543,679.252199,7594.594595
2014-12-31 23:00:00,2.220812,21.337127,1.737619,161.585366,83.841463,308.035714,10.740531,250.841751,64.685315,72.580645,...,271.948608,28075.0,1546.413502,1232.954545,28.357236,7.314219,676.031607,161.519199,659.274194,6932.432432


In [13]:
# select two random time series to include in a small dataset
sample_size = 2
columns_to_keep = np.random.choice(data_kw.columns.to_list(), size=sample_size, replace=False)
columns_to_keep

array(['MT_246', 'MT_022'], dtype='<U6')

In [14]:
data_kw_small = data_kw[columns_to_keep]
data_kw_small

Unnamed: 0,MT_246,MT_022
2011-01-01 00:00:00,310.643154,0.000000
2011-01-01 01:00:00,381.317427,0.000000
2011-01-01 02:00:00,389.616183,0.000000
2011-01-01 03:00:00,390.653527,0.000000
2011-01-01 04:00:00,394.813278,0.000000
...,...,...
2014-12-31 20:00:00,678.703320,42.185554
2014-12-31 21:00:00,641.234440,37.515567
2014-12-31 22:00:00,532.520747,33.156912
2014-12-31 23:00:00,315.912863,33.935243


You need to stack individual time series in one column since Canvas supports only one column with `timeseries id`.

In [15]:
def stack_timeseries(df, ts_name, var_name, value_name):
    # Melt the DataFrame
    melted_df = pd.melt(
        df.reset_index(),
        id_vars='index', 
        value_vars=df.columns, 
        var_name=var_name, 
        value_name=value_name
    )
    
    # Rename the 'index' column to 'timestamp'
    return melted_df.rename(columns={'index': ts_name})

In [16]:
stacked_df = stack_timeseries(data_kw, 'ts', 'mt_id', 'consumption')
stacked_df

Unnamed: 0,ts,mt_id,consumption
0,2011-01-01 00:00:00,MT_001,0.000000
1,2011-01-01 01:00:00,MT_001,0.000000
2,2011-01-01 02:00:00,MT_001,0.000000
3,2011-01-01 03:00:00,MT_001,0.000000
4,2011-01-01 04:00:00,MT_001,0.000000
...,...,...,...
12974045,2014-12-31 20:00:00,MT_370,8405.405405
12974046,2014-12-31 21:00:00,MT_370,8283.783784
12974047,2014-12-31 22:00:00,MT_370,7594.594595
12974048,2014-12-31 23:00:00,MT_370,6932.432432


In [17]:
stacked_df_small = stack_timeseries(data_kw_small, 'ts', 'mt_id', 'consumption')
stacked_df_small

Unnamed: 0,ts,mt_id,consumption
0,2011-01-01 00:00:00,MT_246,310.643154
1,2011-01-01 01:00:00,MT_246,381.317427
2,2011-01-01 02:00:00,MT_246,389.616183
3,2011-01-01 03:00:00,MT_246,390.653527
4,2011-01-01 04:00:00,MT_246,394.813278
...,...,...,...
70125,2014-12-31 20:00:00,MT_022,42.185554
70126,2014-12-31 21:00:00,MT_022,37.515567
70127,2014-12-31 22:00:00,MT_022,33.156912
70128,2014-12-31 23:00:00,MT_022,33.935243


In [18]:
stacked_df.to_csv(f"./data/canvas_ts_full_{freq}.csv", index=False, header=True)
stacked_df_small.to_csv(f"./data/canvas_ts_small_{freq}.csv", index=False, header=True)

### Upload to S3

In [19]:
# !aws s3 rm {s3_data_path}/ --recursive

In [20]:
# upload the datasets to S3
!aws s3 cp ./data/canvas_ts_full_{freq}.csv {s3_data_path}/
!aws s3 cp ./data/canvas_ts_small_{freq}.csv {s3_data_path}/

upload: data/canvas_ts_full_1h.csv to s3://sagemaker-us-east-1-906545278380/canvas-demo-notebook/data/canvas_ts_full_1h.csv
upload: data/canvas_ts_small_1h.csv to s3://sagemaker-us-east-1-906545278380/canvas-demo-notebook/data/canvas_ts_small_1h.csv


In [21]:
!aws s3 ls {s3_data_path}/

2024-08-29 12:20:06   18138757 canvas_ts_full_1D.csv
2024-08-26 20:26:58  546310782 canvas_ts_full_1H.csv
2024-09-24 10:05:31   18138757 canvas_ts_full_1d.csv
2024-09-30 08:22:22  546310782 canvas_ts_full_1h.csv
2024-08-29 12:20:07      88893 canvas_ts_small_1D.csv
2024-08-26 20:27:05    2665442 canvas_ts_small_1H.csv
2024-09-24 10:05:33      96336 canvas_ts_small_1d.csv
2024-09-30 08:22:26    3034606 canvas_ts_small_1h.csv


In [22]:
print(f"""
S3 path: {s3_data_path}/

Uploaded datasets:
Small {data_kw_small.shape[1]} timeseries, {freq} aggregation -> {stacked_df_small.shape[0]} rows
Full {data_kw.shape[1]} timeseries, {freq} aggregation -> {stacked_df.shape[0]} rows
""")


S3 path: s3://sagemaker-us-east-1-906545278380/canvas-demo-notebook/data/

Uploaded datasets:
Small 2 timeseries, 1h aggregation -> 70130 rows
Full 370 timeseries, 1h aggregation -> 12974050 rows



## Check required quotas

In order to follow the optimal workshop flow you need to run at least three AutoML jobs in parallel – two model trainings in Canvas and one in the Autopilot notebook. The following code checks your AWS account quotas for a maximum number of concurrent AutoML Jobs.

In [23]:
r = boto3.client("service-quotas").get_service_quota(ServiceCode="sagemaker", QuotaCode='L-CFC2D5B6')['Quota']
q, n = r["Value"], r["QuotaName"]
min_n = 3

print(f"\033[92mSUCCESS: Quota {q} for {n} >= required {min_n}\033[0m" if q >= min_n else f"\033[91mWARNING: Quota {q} for {n} < required {min_n}\033[0m")

[92mSUCCESS: Quota 4.0 for Maximum number of concurrent AutoML Jobs >= required 3[0m


## Train models in Amazon SageMaker Canvas



### Start Canvas

Log in to SageMaker Canvas from the SageMaker Studio by following [this instructions](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-getting-started.html#canvas-getting-started-step1).

### Step 1: load and explore time series
<div class="alert alert-info">In this section you import the dataset into Canvas.</div>

Follow the instructions how to [import tabular data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-import-dataset.html#canvas-import-dataset-tabular).

Choose Amazon S3 as the data source and use the S3 bucket and path printed by the previous code cell in this notebook.

You must import at least one dataset - for example, the small one `canvas_ts_small_1D.csv`. Optionally you can repeat import operation and create another dataset in Canvas with the data from the full time series dataset or different aggregation - for example, `canvas_ts_full_1D.csv` or `canvas_ts_small_1H.csv`.  The file names, Amazon S3 bucket name, and row number in each datasets are printed by the previous code cell in this notebook.

After you imported dataset or datasets, you see them in the **Datasets** view in Canvas. Note that Canvas shows the size of the datasets and number of cells:

![](../img/canvas-datasets.png)

#### Transformations
You don't need to perform any transformations on this time series dataset. In case you need to perform data cleaning, feature engineering, or any data processing, you can use Canvas built-in [data transformation functionality](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-transform.html) with Amazon SageMaker Data Wrangler.

Refer to documentation for [time series transformations](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-transform.html#canvas-transform-time-series) and [chat for data preparation](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-chat-for-data-prep.html) for more details.

To try out Canvas data processing with Data Wrangler, you need to create a data flow with the imported dataset. In the **Datasets** view choose your dataset and select **Create a data flow**:

![](../img/canvas-create-data-flow.png)

By using Data Wrangler flows you can:
- perform complex data transformations and feature engineering
- generate visualizations on your data
- create a model with the prepared data
- export the dataset to Canvas or to Amazon S3
- export data flow as Python code in a Jupyter notebook

Refer to the documentation about [Canvas data processing](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-data-processing.html) for more details.

#### Visualizations
You can [perform EDA](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-analyses.html) on the imported datasets in Canvas. 

To create built-in visualizations for time series datasets, follow the instructions for:
- [Seasonal trend decomposition in time series data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-analyses.html#canvas-seasonal-trend-decomposition)
- [Detect anomalies in time series data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-analyses.html#canvas-time-series-anomaly-detection)

Note, that with the current dataset you cannot create the visualizations above because there are duplicated timestamps due to stacking all `mt_id` into a single column. You can use Canvas Data Wrangler transformations to "unstack" the dataset for visualizations.

### Step 2: create and configure a model
<div class="alert alert-info">In this section you configure a time series forecast model using the imported dataset.</div>

Canvas can train a [custom model](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-build-model.html) of one of the pre-defined model types. In this example you build a [time series forecast](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-time-series.html).

When you use Canvas for your real-world projects, be aware about the following limitations:

Limit|Time series forecasting
---|---
Quick build time|2-20 min
Standard build time|2-4 hours
Downsample size|30GB
Min number of rows for Quick build|N/A
Min number of rows for Standard build|50
Max number of rows for Quick build|N/A
Max number of rows for Standard build|150K
Max number of columns|1K

Navigate to the **Datasets** view, choose the the dataset you've just imported and choose **Create a model**:

![](../img/canvas-create-model.png)

Configure your model on the model page:

![](../img/canvas-configure-model-config.png)

- Select the `consumption` column as **Target column**
- Select **Configure model**:
    - Choose correct column names: `mt_id` as **Item ID column** and `ts` as **time stamp column**
    - enter 30 hours for prediction length if you use 1h sampling or 7 days if you use 1d sampling. **Note that for Quick build you can choose maximum 30 hours as a prediction interval**
    - choose Avg. wQL as the **Objective metric**
    - choose algorithms in the **Algorithms** view
    - keep default `0.10, 0.50, 0.90` forecast quantiles

Refer to the documentation on [Advanced time series forecasting model settings](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-advanced-settings.html#canvas-advanced-settings-time-series) for more details on algorithms and settings.

### Experiment 1: quick build mode
<div class="alert alert-info">In this section you build a single time series model.</div>

<div class="alert alert-info">
If you use an AWS-provided account as part of an AWS-led workshop and train more than two models, you may encounter an error message in the Canvas UI about resource limits. This is caused by the account-level quota on Amazon SageMaker inference endpoint instances, which limits you to running predictions on a maximum of two endpoints - corresponding to two trained models. However, you can still train as many models in Canvas as you like.
</div>

After configuring the model in the previous section, first choose **Quick build** option from the drop down menu of the build button. Then click **Quick build** to begin a build for your model:

![](../img/canvas-start-quick-build.png)

The quick build takes about 5-7 minutes for this dataset.

Note, that if you do a **Quick build** on a dataset with more than 50,000 rows, then Canvas samples your data down to 50,000 rows for a shorter model training time.

#### Analyze
After the quick build is completed, you can [evaluate the performance of the time series forecasting model](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-scoring.html#canvas-scoring-time-series).

Here is the screenshot of the model metrics for the 1H sampling small dataset with two time series:
![](../img/canvas-analyze-quick-build.png)

You should get similar metrics on your dataset. If you metrics differ by a large margin, check if you configured a correct prediction length in the model configuration.

#### Predict
Now when the model ready, you can [make predictions for your data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-make-predictions.html).

Choose **Predict** in the model view, **Single prediction**, and select a target time seires in the Item drop down:

![](../img/canvas-predictions.png)

Canvas predicts 30 data points for the Quick build if you use 1H sampling or 7 points if you use 1D sampling by using the trained model.

To generate predictions after you build a model in Canvas, Canvas automatically deploys an asynchronous SageMaker endpoint into your AWS account. Canvas uses the endpoint to generate a single prediction. For batch predictions, Canvas starts a SageMaker batch transform job. The endpoint deployed by Canvas can be used only for in-app predictions and cannot be used outside Canvas.

You can see these inference endpoins in the Studio UI by opening the link constructed by the code cell below.

In [36]:
from IPython.display import HTML

# Show the inference endpoints link
display(
    HTML('<b>See <a target="top" href="https://studio-{}.studio.{}.sagemaker.aws/inference-experience/endpoints">inference endpoints</a> in the Studio UI</b>'.format(
            domain_id, region))
)

### Experiment 2: standard build mode
<div class="alert alert-info">In this section you train six different time series models and create an ensemble as the final model.</div>

In standard build mode Canvas trains the [six built-in algorithms]((https://docs.aws.amazon.com/sagemaker/latest/dg/timeseries-forecasting-algorithms.html)) with your target time series. Then, using a stacking ensemble method, it combines these model candidates to create an optimal forecasting model for a given objective metric. A standard build usually takes about 1-4 hours.

To start a standard build, you need to create a new model version. Refer to the documentation [Adding model versions in Amazon SageMaker Canvas](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-update-model.html) for more details on model versioning in Canvas.

To create a new model version navigate to the **My Models** view and choose **View** on your model. Create a new version for the standard build:

![](../img/canvas-model-new-version.png)

<div class="alert alert-info">In the <b>Build</b> view select <b>Configure model</b> and enter 168 hours for prediction length if you use 1H sampling or 7 days if you use 1D sampling.</div>

Now select **Standard build**. Canvas starts the model building. You can navigate away from that page.

<div style="border: 4px solid coral; text-align: center; margin: auto;">
The standard build on 1D dataset takes about 40 minutes. You can continue with the next notebook and then come back to this one.
</div>

#### Analyze
After the standard build finished you see the Status `Ready` in the model version list and can compare the final metrics between builds:

![](../img/canvas-build-comparison.png)

Choose the model version with the standard build to have more details on metrics, generated artifacts, and the [model leaderboard](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-evaluate-model-candidates.html).

![](../img/canvas-analyze-standard-build.png)

In the model leaderboard you can see model details and use any model as the default model for predictions:

![](../img/canvas-model-leaderboard.png)

#### Download and convert model performance report
To compare Canvas model performance with other approaches, download the model performance report.

<div class="alert alert-info">Repeat this section for each model you trained and would like to compare. Note you can download performance report only for a standard build.</div>

In the model **Analyze** view, choose **Download** in the right bottom of the screen and then choose **Download accuracy metrics JSON**:

![](../img/canvas-download-performance-report.png)

Copy the downloaded file to the `notebooks` folder on the JupyterLab file volume: `modern-time-series-forecasting-on-aws/notebooks`. Rename the file if needed.

The following code converts the JSON document to a DataFrame and save as an `CSV` file on the JupyterLab file volume. Each notebook in this workshop saves own model performance report in the same format to compare all approaches.

In [24]:
model_metrics_json = {'Report is not loaded'}
report_file_name = 'report (7).json' # rename file or adjust the variable if needed

try:
    with open(f'./{report_file_name}', 'r') as f:
        # Load the JSON report data from the file
        model_metrics_json = json.load(f)
except FileNotFoundError as e:
    print("Please follow the instructions above, download the performance report from Canvas and copy it to the local JupyterLab file volume.")

model_metrics_json

{'unweighted_p10_quantile_loss': 4087574.648355538,
 'unweighted_p50_quantile_loss': 4840342.181498209,
 'unweighted_p90_quantile_loss': 1888101.066152417,
 'w_p10_quantile_loss': 0.13282640830014242,
 'w_p50_quantile_loss': 0.1572877126955313,
 'w_p90_quantile_loss': 0.06135415367290962,
 'mse': 225106.30766406297,
 'MAPE': 0.5823401377761089,
 'MASE': 2.930670760949932,
 'RMSE': 474.45369390917693,
 'WAPE': 0.1579361085764277,
 'total_demand': 30773809.97247936,
 'mean_wQuantileLoss': 0.11715609155619444}

In [25]:
# set report_is_full_dataset to True if you loaded the full dataset with all time series to Canvas, otherwise leave at False
report_is_full_dataset = False

# set the corresponding time series frequency frequency
# report_freq = '1H' 
# report_freq = '1D'

experiment_name = f"{experiment_prefix}-{report_freq}-{data_kw.shape[1]}-{data_kw.shape[0]}" if report_is_full_dataset else f"{experiment_prefix}-{report_freq}-{data_kw_small.shape[1]}-{data_kw_small.shape[0]}"
timestamp = strftime("%Y%m%d-%H%M%S", gmtime())

In [26]:
# check the experiment name and adjust if needed
# the experiment name contains: {approach_name}-{aggregation}-{number of time series in the dataset}-{number of rows}
experiment_name

'canvas-1H-370-35065'

In [27]:
# convert to a DataFrame and save as a file for later comparison
model_metrics_df = pd.DataFrame.from_dict(model_metrics_json, orient='index', columns=['value'])
model_metrics_df = model_metrics_df.reset_index().rename(columns={'index': 'metric_name'})
model_metrics_df['experiment'] = experiment_name
model_metrics_df['timestamp'] = timestamp
model_metrics_df = model_metrics_df[['timestamp','metric_name','value','experiment']]

model_metrics_df

Unnamed: 0,timestamp,metric_name,value,experiment
0,20240930-104508,unweighted_p10_quantile_loss,4087575.0,canvas-1H-370-35065
1,20240930-104508,unweighted_p50_quantile_loss,4840342.0,canvas-1H-370-35065
2,20240930-104508,unweighted_p90_quantile_loss,1888101.0,canvas-1H-370-35065
3,20240930-104508,w_p10_quantile_loss,0.1328264,canvas-1H-370-35065
4,20240930-104508,w_p50_quantile_loss,0.1572877,canvas-1H-370-35065
5,20240930-104508,w_p90_quantile_loss,0.06135415,canvas-1H-370-35065
6,20240930-104508,mse,225106.3,canvas-1H-370-35065
7,20240930-104508,MAPE,0.5823401,canvas-1H-370-35065
8,20240930-104508,MASE,2.930671,canvas-1H-370-35065
9,20240930-104508,RMSE,474.4537,canvas-1H-370-35065


In [28]:
model_metrics_df.to_csv(f'./model-performance/{experiment_name}-{timestamp}.csv', index=False)

#### Predict

As with the quick build, Canvas automatically deploys an asynchronous SageMaker endpoint to your AWS account. This endpoint cannot be used outside Canvas. If you'd like to deploy a SageMaker endpoint hosting this model, you can use Canvas **Deploy** feature.

<div class="alert alert-info">
If you use an AWS-provided account as a part of AWS-led workshop and train more than two models, you might have an error message about resource limit when you run predict in the Canvas UI. This is caused by the account-level quota on SageMaker inference endpoint instances. With this limit you can run predictions on maximum two endpoints - two trained models.
</div>

### Experiment 3: register a model in the Model Registry
<div class="alert alert-info">In this section you register a trained model in the SageMaker Model Registry.</div>

Canvas supports [MLOps features](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-mlops.html) to implement a no-code end-to-end ML development and deployment workflow.

You can keep you model in the SageMaker central model registry by registering the model directly from Canvas UI. Follow the instructions in [register a model version in the SageMaker model registry](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-register-model.html#canvas-register-model-register) to register any of your trained models.

For example, you can register a model from the model version list:

![](../img/canvas-add-to-model-registry.png)

After registering the model, Canvas shows the model package details:

![](../img/canvas-model-registry-details.png)

You can now use "Model package group name" to have access to a model version via [boto3 API](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/describe_endpoint.html) or [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/index.html).

Click on the link constructed by the code cell below to open the Model Registry experience in the Studio UI.

In [44]:
# Show the model registry link
display(
    HTML('<b>See <a target="top" href="https://studio-{}.studio.{}.sagemaker.aws/models/registered-models">model registry</a> in the Studio UI</b>'.format(
            domain_id, region))
)

### Experiment 4: deploy the trained model as a SageMaker endpoint
<div class="alert alert-info">In this section you deploy a trained model to a real-time inference endpoint.</div>

Follow the instructions in [deploy your models to an endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-deploy-model.html#canvas-deploy-model-deploy) to create a real-time inference endpoint.

One of the options to deploy a specific model version is to select **Deploy** from the model version list:

![](../img/canvas-model-deploy.png)

<div class="alert alert-info">For this model use <code>ml.m5.xlarge</code> or <code>ml.m5.2xlarge</code> instance types.</div>

You can also [view all deployments](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-deploy-model.html#canvas-deploy-model-view) of the model in the Canvas UI on the model version page or in **ML Ops** view.

To see the inference endpoints in the Studio UI, click on the link constructed by the code cell below.

In [66]:
# Show the inference endpoints link
display(
    HTML('<b>See <a target="top" href="https://studio-{}.studio.{}.sagemaker.aws/inference-experience/endpoints">inference endpoints</a> in the Studio UI</b>'.format(
            domain_id, region))
)

## Compare Canvas model performance

Canvas calculates advanced [metrics for time series forecasts](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-metrics.html#canvas-time-series-forecast-metrics) for each model training run and for both quick and standard builds. Refer also to Amazon SageMaker Autopilot [time series metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/timeseries-objective-metric.html) for more details on the calculated metrics.

The following table summarizes typical build metrics for the datasets used in this notebook. Note that your metrics might differ especially for small datasets because the time series in the small datasets are randomly selected from the full set of 370 time series.

Dataset|Build|WAPE|Avg. wQL|MAPE|MASE|RMSE
---|---|---|---|---|---|---
Small 1h|Standard|0.180|**0.131**|0.320|0.402|36.419
Small 1h|Quick|0.170|**0.138**|0.320|1.683|37.239
Small 1d|Standard|0.226|**0.176**|24.413|0.737|71.674
Small 1d|Quick|0.223|**0.198**|25.069|3.509|71.550
Full 1h|Standard|0.161|**0.121**|0.611|3.007|525.860
Full 1h|Quick|0.170|**0.122**|2.063|1.438|166.361
Full 1d|Standard|0.247|**0.195**|29.616|5.478|743.551
Full 1d|Quick|0.188|**0.168**|48.062|3.718|218.941


Print out all saved model performance reports:

In [79]:
path = './model-performance'
dfs = []

for fn in os.listdir(path):
    # Load all "canvas" experiments
    if 'canvas' in fn:
        dfs.append(pd.read_csv(os.path.join(path, fn)))

pd.concat(dfs).set_index('experiment')

Unnamed: 0_level_0,timestamp,metric_name,value
experiment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
canvas-1D-2-35065,20240920-091227,unweighted_p10_quantile_loss,522.1545
canvas-1D-2-35065,20240920-091227,unweighted_p50_quantile_loss,418.2918
canvas-1D-2-35065,20240920-091227,unweighted_p90_quantile_loss,88.93524
canvas-1D-2-35065,20240920-091227,w_p10_quantile_loss,0.2705572
canvas-1D-2-35065,20240920-091227,w_p50_quantile_loss,0.2167402
canvas-1D-2-35065,20240920-091227,w_p90_quantile_loss,0.04608228
canvas-1D-2-35065,20240920-091227,mse,5214.696
canvas-1D-2-35065,20240920-091227,MAPE,13.40919
canvas-1D-2-35065,20240920-091227,MASE,0.8540038
canvas-1D-2-35065,20240920-091227,RMSE,72.21285


---

## Clean up

### Delete unused inference endpoints
If you deployed any Canvas model to a SageMaker endpoint, delete the deployed endpoints to avoid incurring costs.

Run the code cell below to see what endpoints exists in your AWS account.

In [87]:
endpoints = boto3.client("sagemaker").list_endpoints()

In [88]:
# this doesn't show Canvas async endpoints deployed automatically
[f"{e['EndpointName']} -> {e['EndpointStatus']}" for e in endpoints['Endpoints'] if 'async' not in e['EndpointName']]

['canvas-new-deployment-08-27-2024-3-57-PM -> InService',
 'deepar-electricity-demo-2024-08-27-13-24-59-350 -> InService']

Open the link constructed by the code cell below and delete unused endpoints:

In [89]:
# Show the inference endpoints link
display(
    HTML('<b>See <a target="top" href="https://studio-{}.studio.{}.sagemaker.aws/inference-experience/endpoints">inference endpoints</a> in the Studio UI</b>'.format(
            domain_id, region))
)

### Log out of Canvas UI
[Log out](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-log-out.html) of Canvas after you finished working with it.