# Lab 1: time series forecast with Amazon SageMaker Canvas
This notebook shows how to use no code AutoML [Amazon SageMaker Canvas](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas.html) to train models and generate predictions without needing to write any code.

You can train a Canvas custom model for [time series forecasting](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-time-series.html). Canvas time series model training uses the [Sagemaker Autopilot](https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-automate-model-development.html), which enables the use of various Autopilot’s public APIs. These include operations like `CreateAutoMLJobV2`, `ListCandidatesForAutoMLJob`, and `DescribeAutoMLJobV2` among others. This integration facilitates a streamlined process for training machine learning models directly within the Canvas environment.

Canvas automatically trains candidate models using these [time series forecasting algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/timeseries-forecasting-algorithms.html) and then creates a model ensemble as a final model.

In this notebook you learn how to use Canvas features to explore and process data and how to train a time series model in different build modes.

<div class="alert alert-info">
While working with Canvas, you don't need to write any code or use Jupyter notebooks. This notebook is only for instructions and to download and preprocess a time series dataset.
</div>

## Setup notebook environment

In [57]:
import boto3
import zipfile
import sagemaker
import os
import numpy as np
import pandas as pd
import json

sagemaker.__version__

'2.224.4'

In [4]:
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name


s3_bucket = sagemaker_session.default_bucket()  # replace with an existing bucket if needed
s3_prefix = "canvas-demo-notebook" 
s3_data_path = f"s3://{s3_bucket}/{s3_prefix}/data"

In [59]:
# get domain_id and user profile name
NOTEBOOK_METADATA_FILE = "/opt/ml/metadata/resource-metadata.json"
domain_id = None

if os.path.exists(NOTEBOOK_METADATA_FILE):
    with open(NOTEBOOK_METADATA_FILE, "rb") as f:
        metadata = json.loads(f.read())
        domain_id = metadata.get('DomainId')
        space_name = metadata.get('SpaceName')
        print(f"SageMaker domain id: {domain_id}")

if not space_name:
    raise Exception(f"Cannot find the current space name. Make sure you run this notebook in a JupyterLab in the SageMaker Studio")
else:
    print(f"Space name: {space_name}")
    
r = boto3.client("sagemaker").describe_space(DomainId=domain_id, SpaceName=space_name)
user_profile_name = r['OwnershipSettings']['OwnerUserProfileName']

assert(user_profile_name)
print(f"User profile: {user_profile_name}")

%store domain_id
%store space_name
%store user_profile_name
%store region

SageMaker domain id: d-d345khpp52hx
Space name: sagemaker-space
User profile: studio-user-15cf2030
Stored 'domain_id' (str)
Stored 'space_name' (str)
Stored 'user_profile_name' (str)
Stored 'region' (str)


## Prepare the data

All notebooks in this workshop using the same dataset. It makes possible to compare model metrics across different approaches. 

You use the [electricity dataset](https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014) from the repository of the University of California, Irvine:
> Trindade, Artur. (2015). ElectricityLoadDiagrams20112014. UCI Machine Learning Repository. https://doi.org/10.24432/C58C86.

Download the from the SageMaker example S3 bucket.

### Download the dataset

In [5]:
dataset_zip_file_name = "LD2011_2014.txt.zip"
dataset_file_name = dataset_zip_file_name.split('.')[0]
s3_dataset_path = f"datasets/timeseries/uci_electricity/{dataset_zip_file_name}"

In [6]:
os.makedirs("./data", exist_ok=True)

s3_client = boto3.client("s3")
s3_client.download_file(
    f"sagemaker-example-files-prod-{region}", s3_dataset_path, f"./data/{dataset_zip_file_name}"
)

In [7]:
zip_ref = zipfile.ZipFile(f"./data/{dataset_zip_file_name}", "r")
zip_ref.extractall("./data")
zip_ref.close()
dataset_path = '.'.join(zip_ref.filename.split('.')[:-1])

In [8]:
# see what is inside the file
!head -n 2 {dataset_path} 

"";"MT_001";"MT_002";"MT_003";"MT_004";"MT_005";"MT_006";"MT_007";"MT_008";"MT_009";"MT_010";"MT_011";"MT_012";"MT_013";"MT_014";"MT_015";"MT_016";"MT_017";"MT_018";"MT_019";"MT_020";"MT_021";"MT_022";"MT_023";"MT_024";"MT_025";"MT_026";"MT_027";"MT_028";"MT_029";"MT_030";"MT_031";"MT_032";"MT_033";"MT_034";"MT_035";"MT_036";"MT_037";"MT_038";"MT_039";"MT_040";"MT_041";"MT_042";"MT_043";"MT_044";"MT_045";"MT_046";"MT_047";"MT_048";"MT_049";"MT_050";"MT_051";"MT_052";"MT_053";"MT_054";"MT_055";"MT_056";"MT_057";"MT_058";"MT_059";"MT_060";"MT_061";"MT_062";"MT_063";"MT_064";"MT_065";"MT_066";"MT_067";"MT_068";"MT_069";"MT_070";"MT_071";"MT_072";"MT_073";"MT_074";"MT_075";"MT_076";"MT_077";"MT_078";"MT_079";"MT_080";"MT_081";"MT_082";"MT_083";"MT_084";"MT_085";"MT_086";"MT_087";"MT_088";"MT_089";"MT_090";"MT_091";"MT_092";"MT_093";"MT_094";"MT_095";"MT_096";"MT_097";"MT_098";"MT_099";"MT_100";"MT_101";"MT_102";"MT_103";"MT_104";"MT_105";"MT_106";"MT_107";"MT_108";"MT_109";"MT_110";"MT_111

### Preprocess data

In [9]:
# load the dataset from the file
df_raw = pd.read_csv(
    dataset_path, 
    sep=';', 
    index_col=0,
    decimal=',',
    parse_dates=True,
)

For Canvas model building you need to resample the time series to one of the supported intervals. At the time of this writing, Canvas [supports](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-time-series.html) the following intervals:

- 1, 5, 15, 30 min
- 1 hour
- 1 day
- 1 week
- 1 month
- 1 year

The current time series is sampled to 15 min interval. To decrease the number of rows and Canvas training time, you need to resample the dataset to 1 hour or to 1 day. Both time series can be processed by Canvas.

Consider the following model building times when selecting your aggregation or number of time series:

Dataset size|Sampling|Build type|Training time
---|---|---|---
Small - 2 time series, 70130 rows|1 hour|Quick|5 min
Small - 2 time series, 2924 rows|1 day|Quick|5 min
Small - 2 time series|1 hour|Standard|150 min
Small - 2 time series|1 day|Standard|30 min 
Full - 370 time series, 12974050 rows|1 hour|Quick|5 min
Full - 370 time series, 540940 rows|1 day|Quick|5 min
Full - 370 time series|1 hour|Standard|130 min 
Full - 370 time series|1 day|Standard|30 min 

Note that training time is approximation only and might be different in your environment.

In [43]:
# resample to Canvas supported interval
# select and uncomment your aggregation interval
# freq = "1H" 
# div = 4
freq = "1D"
div = 96

data_kw = df_raw.resample(freq).sum() / div

In [44]:
data_kw

Unnamed: 0,MT_001,MT_002,MT_003,MT_004,MT_005,MT_006,MT_007,MT_008,MT_009,MT_010,...,MT_361,MT_362,MT_363,MT_364,MT_365,MT_366,MT_367,MT_368,MT_369,MT_370
2011-01-01,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2011-01-02,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2011-01-03,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2011-01-04,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2011-01-05,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2014-12-28,2.366223,22.203947,1.574718,149.242039,70.591972,209.604415,4.475221,263.082211,53.321678,49.943996,...,300.157626,34084.375000,2299.182489,2682.054924,85.098870,5.753852,478.279558,45.892460,688.912818,16179.054054
2014-12-29,2.590948,23.048542,1.674269,146.532012,74.987297,237.754216,5.735350,315.481201,69.766171,66.017025,...,300.261718,32386.458333,2154.711674,2803.030303,87.896567,12.019700,561.759219,134.529772,769.611437,18817.567568
2014-12-30,2.419099,22.974455,1.719519,148.860942,74.885671,248.759921,6.112210,321.969697,67.580857,67.596326,...,296.752320,30253.125000,2126.318565,2745.975379,110.576923,9.295153,586.817749,166.631886,770.314027,19453.828829
2014-12-31,2.392661,23.678284,1.737619,145.896850,73.158028,244.915675,7.195685,298.961841,64.703526,52.441756,...,280.945158,28633.333333,1693.301688,2248.816288,77.249022,5.522235,523.540386,137.973011,733.504399,14228.040541


In [45]:
# select two random time series to include in a small dataset
columns_to_keep = np.random.randint(1, data_kw.shape[1], size=2).tolist()
columns_to_keep

[336, 89]

In [55]:
data_kw_small = data_kw.iloc[:, columns_to_keep]
data_kw_small

Unnamed: 0,MT_337,MT_090
2011-01-01,0.000000,0.000000
2011-01-02,0.000000,0.000000
2011-01-03,0.000000,0.000000
2011-01-04,0.000000,0.000000
2011-01-05,0.000000,0.000000
...,...,...
2014-12-28,117.935861,10.713916
2014-12-29,158.112540,30.545647
2014-12-30,157.305922,22.674990
2014-12-31,140.659685,10.584677


In [47]:
def stack_timeseries(df, ts_name, var_name, value_name):
    # Melt the DataFrame
    melted_df = pd.melt(
        df.reset_index(),
        id_vars='index', 
        value_vars=df.columns, 
        var_name=var_name, 
        value_name=value_name
    )
    
    # Rename the 'index' column to 'timestamp'
    return melted_df.rename(columns={'index': ts_name})

In [48]:
stacked_df = stack_timeseries(data_kw, 'ts', 'mt_id', 'consumption')
stacked_df

Unnamed: 0,ts,mt_id,consumption
0,2011-01-01,MT_001,0.000000
1,2011-01-02,MT_001,0.000000
2,2011-01-03,MT_001,0.000000
3,2011-01-04,MT_001,0.000000
4,2011-01-05,MT_001,0.000000
...,...,...,...
540935,2014-12-28,MT_370,16179.054054
540936,2014-12-29,MT_370,18817.567568
540937,2014-12-30,MT_370,19453.828829
540938,2014-12-31,MT_370,14228.040541


In [49]:
stacked_df_small = stack_timeseries(data_kw_small, 'ts', 'mt_id', 'consumption')
stacked_df_small

Unnamed: 0,ts,mt_id,consumption
0,2011-01-01,MT_337,0.000000
1,2011-01-02,MT_337,0.000000
2,2011-01-03,MT_337,0.000000
3,2011-01-04,MT_337,0.000000
4,2011-01-05,MT_337,0.000000
...,...,...,...
2919,2014-12-28,MT_090,10.713916
2920,2014-12-29,MT_090,30.545647
2921,2014-12-30,MT_090,22.674990
2922,2014-12-31,MT_090,10.584677


In [50]:
stacked_df.to_csv(f"./data/canvas_ts_full_{freq}.csv", index=False, header=True)
stacked_df_small.to_csv(f"./data/canvas_ts_small_{freq}.csv", index=False, header=True)

### Upload to S3

In [51]:
!aws s3 rm {s3_data_path}/ --recursive

delete: s3://sagemaker-us-east-1-906545278380/canvas-demo-notebook/data/canvas_ts_full_1H.csv
delete: s3://sagemaker-us-east-1-906545278380/canvas-demo-notebook/data/canvas_ts_small_1H.csv


In [52]:
# upload the datasets to S3
!aws s3 cp ./data/canvas_ts_full_{freq}.csv {s3_data_path}/
!aws s3 cp ./data/canvas_ts_small_{freq}.csv {s3_data_path}/

upload: data/canvas_ts_full_1D.csv to s3://sagemaker-us-east-1-906545278380/canvas-demo-notebook/data/canvas_ts_full_1D.csv
upload: data/canvas_ts_small_1D.csv to s3://sagemaker-us-east-1-906545278380/canvas-demo-notebook/data/canvas_ts_small_1D.csv


In [53]:
!aws s3 ls {s3_data_path}/

2024-08-23 15:04:15   18138757 canvas_ts_full_1D.csv
2024-08-23 15:04:16      85475 canvas_ts_small_1D.csv


In [54]:
print(f"The dataset S3 path: {s3_data_path}/")

The dataset S3 path: s3://sagemaker-us-east-1-906545278380/canvas-demo-notebook/data/


In [82]:
print(f"""
Uploaded datasets:
Small {freq} aggregation -> {stacked_df_small.shape[0]} rows
Full {freq} aggregation -> {stacked_df.shape[0]} rows
""")


Uploaded datasets:
Small 1D aggregation -> 2924 rows
Full 1D aggregation -> 540940 rows



## Train models in Amazon SageMaker Canvas



### Start Canvas

Log in to SageMaker Canvas from the SageMaker Studio by following [this instructions](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-getting-started.html#canvas-getting-started-step1).

### Experiment 1: load and explore time series

[Import data into Canvas](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-importing-data.html)

[Import tabular data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-import-dataset.html#canvas-import-dataset-tabular)

As the data source choose Amazon S3 and use the S3 bucket and path printed by the previous code cell.

You must import at least one dataset - the small one `canvas_ts_small.csv`. Optionally you can repeat import operation and create a new dataset in Canvas with the data from the full time series dataset - `canvas_ts_full.csv`.

If you created two datasets, you will have them in the dataset list in Canvas. Note the size of the datasets:

![](../img/canvas-datasets.png)


[Prepare data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-data-prep.html)

[Create a Data Flow](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-data-flow.html)


#### Visualizations

[Seasonal trend decomposition in time series data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-analyses.html#canvas-seasonal-trend-decomposition)

![](../img/canvas-analysis-config.png)

[Detect anomalies in time series data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-analyses.html#canvas-time-series-anomaly-detection)



#### Transformations

[Transform Time Series](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-transform.html#canvas-transform-time-series)

[Chat for data prep](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-chat-for-data-prep.html)

#### Export data

[Process data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-data-processing.html)

- create a model with the prepared data
- export the dataset to Canvas or to Amazon S3
- export data flow as Python code in a Jupyter notebook

### Experiment 1: create and configure a model

[Build a custom model](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-build-model.html)

[Time Series Forecasts in Amazon SageMaker Canvas](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-time-series.html)

Limit|Time series forecasting
---|---
Quick build time|2-20 min
Standard build time|2-4 hours
Downsample size|30GB
Min number of rows for Quick build|N/A
Min number of rows for Standard build|50
Max number of rows for Quick build|N/A
Max number of rows for Standard build|150K
Max number of columns|1K

Navigate to the **Datasets** view, choose the small dataset you've just imported and choose **Create a model**:

![](../img/canvas-create-model.png)


Configure your model on the model page:

![](../img/canvas-configure-model-config.png)

- Select the `consumption` column as **Target column**
- Select **Configure model**:
    - Choose correct column names: `mt_id` as **Item ID column** and `ts` as **time stamp column**
    - enter 30 hours for prediction length if you use 1h sampling or 7 days if you use 1d sampling. **Note that for Quick build you can choose maximum 30 hours as a prediction interval**
    - choose Avg. wQL as the **Objective metric**
    - choose algorithms in the **Algorithms** view
    - keep default `0.10, 0.50, 0.90` forecast quantiles

Refer to the documentation on [Advanced time series forecasting model settings](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-advanced-settings.html#canvas-advanced-settings-time-series) for more details on algorithms and settings.

### Experiment 2: quick build mode

If you choose to do a **Quick build** on a dataset with more than 50,000 rows, then Canvas samples your data down to 50,000 rows for a shorter model training time.



#### Analyze

[Evaluate time series forecasting models](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-scoring.html#canvas-scoring-time-series)

Here is the screenshot of the model metrics for 1h sampling small dataset:
![](../img/canvas-analyze-quick-build.png)

#### Predict

[Make predictions for your data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-make-predictions.html)

![](../img/canvas-predictions.png)

To generate predictions after you build a model in Canvas, Canvas automatically deploys an asynchronous SageMaker endpoint into your AWS account. Canvas uses the endpoint to generate a single prediction. For batch predictions, Canvas starts a SageMaker batch transform job. The endpoint deployed by Canvas can be used only for in-app predictions and cannot be used outside Canvas.

You can see these inference endpoins in the Studio UI by opening the link constructe by the code cell below.

In [87]:
from IPython.display import HTML

# Show the inference endpoints link
display(
    HTML('<b>See <a target="top" href="https://studio-{}.studio.{}.sagemaker.aws/inference-experience/endpoints">inference endpoints</a> in the Studio UI</b>'.format(
            domain_id, region))
)

### Experiment 3: standard build mode

In standard build mode Canvas trains the [six built-in algorithms]((https://docs.aws.amazon.com/sagemaker/latest/dg/timeseries-forecasting-algorithms.html)) with your target time series. Then, using a stacking ensemble method, it combines these model candidates to create an optimal forecasting model for a given objective metric. A standard build usually takes about 1-4 hours.

To start a standard build, you need to create a new model version. Refer to the documentation [Adding model versions in Amazon SageMaker Canvas](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-update-model.html) for more details on model versioning in Canvas.

To create a new model version navigate to the **My Models** view and choose **View** on your model. Create a new version for the standard build:

![](../img/canvas-model-new-version.png)

In the **Build** view select **Configure model** and enter 168 hours for prediction length if you use 1h sampling or 7 days if you use 1d sampling. 
    
Now select **Standard build**. Canvas starts the model building. You can navigate away from that page.

#### Analyze

![](../img/canvas-build-comparison.png)

![](../img/canvas-analyze-standard-build.png)



Artifacts

[Model leaderboard](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-evaluate-model-candidates.html)

![](../img/canvas-model-leaderboard.png)

#### Predict

As with the quick build, Canvas automatically deploys an asynchronous SageMaker endpoint to your AWS account. This endpoint cannot be used outside Canvas. If you'd like to deploy a SageMaker endpoint hosting this model, you can use Canvas **Deploy** feature.

### Experiment 4: register a model in the Model Registry

[Register a model version in the SageMaker model registry](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-register-model.html)


To keep your model in a central model registry you can register the model directly from Canvas UI:

![](../img/canvas-add-to-model-registry.png)

Canvas also shows the model package details:

![](../img/canvas-model-registry-details.png)

You can now use "Model package group name" to have access to a model version via boto3 API.

In [61]:
# Show the model registry link
display(
    HTML('<b>See <a target="top" href="https://studio-{}.studio.{}.sagemaker.aws/models/registered-models">model registry</a> in the Studio UI</b>'.format(
            domain_id, region))
)

### Experiment 5: deploy the trained model as a SageMaker endpoint

[Deploy your models to an endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-deploy-model.html)

![](../img/canvas-model-deploy.png)

You can also [view all deployments](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-deploy-model.html#canvas-deploy-model-view) of the model in the Canvas UI on the model version page or in **MLOps** view.

In [85]:
# Show the inference endpoints link
display(
    HTML('<b>See <a target="top" href="https://studio-{}.studio.{}.sagemaker.aws/inference-experience/endpoints">inference endpoints</a> in the Studio UI</b>'.format(
            domain_id, region))
)

## Compare Canvas model performance

Canvas calculates these advanced [metrics for time series forecasts](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-metrics.html#canvas-time-series-forecast-metrics) for each model training runs and both quick and standard builds. Refer also to Amazon SageMaker Autopilot [time series metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/timeseries-objective-metric.html) for more details.

The following table summarizes build metrics for different datasets. Note that your metrics might differ especially for small datasets because the time series in the small datasets are randomly selected from the full set of 370 time series.

Dataset|Build|WAPE|Avg. wQL|MAPE|MASE|RMSE
---|---|---|---|---|---|---
Small 1h|Standard|0.180|**0.131**|0.320|0.402|36.419
Small 1h|Quick|0.170|**0.138**|0.320|1.683|37.239
Small 1d|Standard|0.226|**0.176**|24.413|0.737|71.674
Small 1d|Quick|0.223|**0.198**|25.069|3.509|71.550
Full 1h|Standard|0.161|**0.121**|0.611|3.007|525.860
Full 1h|Quick|0.170|**0.122**|2.063|1.438|166.361
Full 1d|Standard|0.247|**0.195**|29.616|5.478|743.551
Full 1d|Quick|0.188|**0.168**|48.062|3.718|218.941


---

## Clean up

### Delete unused inference endpoints
If you deployed any Canvas model to a SageMaker endpoint, delete the deployed endpoints to avoid incurring costs.

Run the code cell below to see what endpoints exists in your AWS account.

In [83]:
endpoints = boto3.client("sagemaker").list_endpoints()

In [84]:
[f"{e['EndpointName']} -> {e['EndpointStatus']}" for e in endpoints['Endpoints'] if 'async' not in e['EndpointName']]

['canvas-new-deployment-08-23-2024-10-41-PM -> InService']

### Log out of Canvas UI
[Log out](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-log-out.html) of Canvas after you finished working with it.