# Lab 1: time series forecast with Amazon SageMaker Canvas
This notebook shows how to use no code AutoML [Amazon SageMaker Canvas](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas.html) to train models and generate predictions without needing to write any code.

You can train a Canvas custom model for time series forecasting. Canvas time series model training uses the [Sagemaker Autopilot](https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-automate-model-development.html), which enables the use of various Autopilot’s public APIs. These include operations like `CreateAutoMLJobV2`, `ListCandidatesForAutoMLJob`, and `DescribeAutoMLJobV2` among others. This integration facilitates a streamlined process for training machine learning models directly within the Canvas environment.

Canvas automatically trains candidate models using these [time series forecasting algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/timeseries-forecasting-algorithms.html) and then creates a model ensemble as a final model.

In this notebook you learn how to use Canvas features to explore and process data and how to train a time series model in different build modes.

<div class="alert alert-info">
While working with Canvas, you don't need to write any code or use Jupyter notebooks. This notebook is only for instructions and to download a time series dataset.
</div>

## Setup notebook environment

In [34]:
import boto3
import zipfile
import sagemaker
import os
import numpy as np
import pandas as pd

sagemaker.__version__

'2.229.0'

In [8]:
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name


s3_bucket = sagemaker_session.default_bucket()  # replace with an existing bucket if needed
s3_prefix = "canvas-demo-notebook" 
s3_data_path = f"s3://{s3_bucket}/{s3_prefix}/data"

## Prepare the data

All notebooks in this workshop using the same dataset. It makes possible to compare model metrics across different approaches. 

You use the [electricity dataset](https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014) from the repository of the University of California, Irvine:
> Trindade, Artur. (2015). ElectricityLoadDiagrams20112014. UCI Machine Learning Repository. https://doi.org/10.24432/C58C86.

Download the from the SageMaker example S3 bucket.

### Download the dataset

In [28]:
dataset_zip_file_name = "LD2011_2014.txt.zip"
dataset_file_name = dataset_zip_file_name.split('.')[0]
s3_dataset_path = f"datasets/timeseries/uci_electricity/{dataset_zip_file_name}"

In [10]:
os.makedirs("./data", exist_ok=True)

s3_client = boto3.client("s3")
s3_client.download_file(
    f"sagemaker-example-files-prod-{region}", s3_dataset_path, f"./data/{dataset_zip_file_name}"
)

In [11]:
zip_ref = zipfile.ZipFile(f"./data/{dataset_zip_file_name}", "r")
zip_ref.extractall("./data")
zip_ref.close()
dataset_path = '.'.join(zip_ref.filename.split('.')[:-1])

In [15]:
# see what is inside the file
!head -n 2 {dataset_path} 

"";"MT_001";"MT_002";"MT_003";"MT_004";"MT_005";"MT_006";"MT_007";"MT_008";"MT_009";"MT_010";"MT_011";"MT_012";"MT_013";"MT_014";"MT_015";"MT_016";"MT_017";"MT_018";"MT_019";"MT_020";"MT_021";"MT_022";"MT_023";"MT_024";"MT_025";"MT_026";"MT_027";"MT_028";"MT_029";"MT_030";"MT_031";"MT_032";"MT_033";"MT_034";"MT_035";"MT_036";"MT_037";"MT_038";"MT_039";"MT_040";"MT_041";"MT_042";"MT_043";"MT_044";"MT_045";"MT_046";"MT_047";"MT_048";"MT_049";"MT_050";"MT_051";"MT_052";"MT_053";"MT_054";"MT_055";"MT_056";"MT_057";"MT_058";"MT_059";"MT_060";"MT_061";"MT_062";"MT_063";"MT_064";"MT_065";"MT_066";"MT_067";"MT_068";"MT_069";"MT_070";"MT_071";"MT_072";"MT_073";"MT_074";"MT_075";"MT_076";"MT_077";"MT_078";"MT_079";"MT_080";"MT_081";"MT_082";"MT_083";"MT_084";"MT_085";"MT_086";"MT_087";"MT_088";"MT_089";"MT_090";"MT_091";"MT_092";"MT_093";"MT_094";"MT_095";"MT_096";"MT_097";"MT_098";"MT_099";"MT_100";"MT_101";"MT_102";"MT_103";"MT_104";"MT_105";"MT_106";"MT_107";"MT_108";"MT_109";"MT_110";"MT_111

### Preprocess data

In [35]:
df_raw = pd.read_csv(
    dataset_path, 
    sep=';', 
    index_col=0,
    decimal=',',
    parse_dates=True,
)

In [37]:
# resample to 2h intervals
freq = "2H"
data_kw = df_raw.resample(freq).sum() / 8

In [38]:
data_kw

Unnamed: 0,MT_001,MT_002,MT_003,MT_004,MT_005,MT_006,MT_007,MT_008,MT_009,MT_010,...,MT_361,MT_362,MT_363,MT_364,MT_365,MT_366,MT_367,MT_368,MT_369,MT_370
2011-01-01 00:00:00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2011-01-01 02:00:00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2011-01-01 04:00:00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2011-01-01 06:00:00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2011-01-01 08:00:00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2014-12-31 16:00:00,2.379442,28.538407,1.954822,160.569106,67.225610,260.416667,7.560769,352.693603,75.830420,53.897849,...,439.864383,40675.0,2816.983122,3980.113636,127.933507,7.753072,504.938543,118.948247,703.445748,12033.783784
2014-12-31 18:00:00,2.220812,28.449502,2.172024,208.587398,91.310976,385.788690,12.719050,359.427609,95.498252,84.408602,...,425.053533,43300.0,2793.776371,3454.545455,105.606258,5.266238,500.877963,47.787980,706.011730,9554.054054
2014-12-31 20:00:00,2.379442,24.004267,1.737619,174.288618,89.176829,330.357143,11.447145,292.508418,84.134615,75.806452,...,330.389008,39137.5,1676.160338,1857.954545,74.967405,4.169105,420.654083,131.886477,675.219941,8344.594595
2014-12-31 22:00:00,2.062183,21.692745,1.737619,161.331301,85.365854,311.383929,11.023177,251.262626,68.181818,72.446237,...,289.079229,31775.0,1591.244726,1303.977273,46.284224,7.533645,665.605795,178.422371,669.263196,7263.513514


In [40]:
columns_to_keep = np.random.randint(1, data_kw.shape[1], size=2).tolist()
columns_to_keep

[332, 88]

In [41]:
data_kw_small = data_kw.iloc[:, columns_to_keep]
data_kw_small

Unnamed: 0,MT_333,MT_089
2011-01-01 00:00:00,0.000000,0.000000
2011-01-01 02:00:00,0.000000,0.000000
2011-01-01 04:00:00,0.000000,0.000000
2011-01-01 06:00:00,0.000000,0.000000
2011-01-01 08:00:00,0.000000,0.000000
...,...,...
2014-12-31 16:00:00,120.360825,263.888889
2014-12-31 18:00:00,150.128866,346.891534
2014-12-31 20:00:00,144.201031,346.560847
2014-12-31 22:00:00,71.649485,323.743386


In [44]:
def stack_timeseries(df, ts_name, var_name, value_name):
    # Melt the DataFrame
    melted_df = pd.melt(
        df.reset_index(),
        id_vars='index', 
        value_vars=df.columns, 
        var_name=var_name, 
        value_name=value_name
    )
    
    # Rename the 'index' column to 'timestamp'
    return melted_df.rename(columns={'index': ts_name})

In [47]:
stacked_df = stack_timeseries(data_kw, 'ts', 'mt_id', 'consumption')
stacked_df

Unnamed: 0,ts,mt_id,consumption
0,2011-01-01 00:00:00,MT_001,0.000000
1,2011-01-01 02:00:00,MT_001,0.000000
2,2011-01-01 04:00:00,MT_001,0.000000
3,2011-01-01 06:00:00,MT_001,0.000000
4,2011-01-01 08:00:00,MT_001,0.000000
...,...,...,...
6487205,2014-12-31 16:00:00,MT_370,12033.783784
6487206,2014-12-31 18:00:00,MT_370,9554.054054
6487207,2014-12-31 20:00:00,MT_370,8344.594595
6487208,2014-12-31 22:00:00,MT_370,7263.513514


In [48]:
stacked_df_small = stack_timeseries(data_kw_small, 'ts', 'mt_id', 'consumption')
stacked_df_small

Unnamed: 0,ts,mt_id,consumption
0,2011-01-01 00:00:00,MT_333,0.000000
1,2011-01-01 02:00:00,MT_333,0.000000
2,2011-01-01 04:00:00,MT_333,0.000000
3,2011-01-01 06:00:00,MT_333,0.000000
4,2011-01-01 08:00:00,MT_333,0.000000
...,...,...,...
35061,2014-12-31 16:00:00,MT_089,263.888889
35062,2014-12-31 18:00:00,MT_089,346.891534
35063,2014-12-31 20:00:00,MT_089,346.560847
35064,2014-12-31 22:00:00,MT_089,323.743386


In [61]:
stacked_df.to_csv('./data/canvas_ts_full.csv', index=False, header=True)
stacked_df_small.to_csv('./data/canvas_ts_small.csv', index=False, header=True)

### Upload to S3

In [62]:
!aws s3 rm {s3_data_path}/ --recursive

delete: s3://sagemaker-us-east-1-906545278380/canvas-demo-notebook/data/canvas_ts_small.csv


In [63]:
# upload the datasets to S3
!aws s3 cp ./data/canvas_ts_full.csv {s3_data_path}/
!aws s3 cp ./data/canvas_ts_small.csv {s3_data_path}/

upload: data/canvas_ts_full.csv to s3://sagemaker-us-east-1-906545278380/canvas-demo-notebook/data/canvas_ts_full.csv
upload: data/canvas_ts_small.csv to s3://sagemaker-us-east-1-906545278380/canvas-demo-notebook/data/canvas_ts_small.csv


In [64]:
!aws s3 ls {s3_data_path}/

2024-08-22 20:22:27  274368348 canvas_ts_full.csv
2024-08-22 20:22:30    1465531 canvas_ts_small.csv


In [57]:
print(f"The dataset S3 path: {s3_data_path}/")

The dataset S3 path: s3://sagemaker-us-east-1-906545278380/canvas-demo-notebook/data/


## Train models in Amazon SageMaker Canvas



### Start Canvas

Log in to SageMaker Canvas from SageMaker Studio by following [this instructions](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-getting-started.html#canvas-getting-started-step1).

### Experiment 1: load and explore time series

[Import data into Canvas](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-importing-data.html)

[Import tabular data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-import-dataset.html#canvas-import-dataset-tabular)

As the data source choose Amazon S3 and use the S3 bucket and path printed by the previous code cell.

You must import at least one dataset - the small one `canvas_ts_small.csv`. Optionally you can repeat import operation and create a new dataset in Canvas with the data from the full time series dataset - `canvas_ts_full.csv`.

If you created two datasets, you will have them in the dataset list in Canvas. Note the size of the datasets:

![](../img/canvas-datasets.png)


[Prepare data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-data-prep.html)

[Create a Data Flow](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-data-flow.html)


#### Visualizations

[Detect anomalies in time series data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-analyses.html#canvas-time-series-anomaly-detection)

[Seasonal trend decomposition in time series data](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-analyses.html#canvas-seasonal-trend-decomposition)

#### Transformations

[Transform Time Series](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-transform.html#canvas-transform-time-series)

[Chat for data prep](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-chat-for-data-prep.html)

### Experiment 2: quick build mode

Quick build is only available for datasets with less than 50.000 rows

![](img/canvas-model-overview.png)

In quick build mode Canvas trains one model. The model and metrics are captured in SageMaker Experiments.

See Experiments in Studio Classic:
![](img/experiments-quick-builld.png)

Predict:

![](img/canvas-predictions.png)

To generate predictions after you build a model in Canvas, Canvas automatically deploys an asynchronous SageMaker endpoint into your AWS account. The endpoint is temporary and Canvas uses it to generate single prediction. For batch predictions, Canvas starts a SageMaker batch transform job. The endpoint deployed by Canvas can be used only for in-app predictions and cannot be used outside Canvas.

### Experiment 3: standard build mode

Customizing model build:

![](img/canvas-configure-model-config.png)
![](img/canvas-configure-model-metric.png)
![](img/canvas-configure-model-quantiles.png)

In standard build mode Canvas trains the [six built-in algorithms]((https://docs.aws.amazon.com/sagemaker/latest/dg/timeseries-forecasting-algorithms.html)) with your target time series. Then, using a stacking ensemble method, it combines these model candidates to create an optimal forecasting model for a given objective metric.

All models and metrics are captured in SageMaker experiments.

See Experiments in Studio Classic:

![](img/experiments-standard-builld.png)

Predict:

![](img/canvas-predictions-standard-build.png)

### Experiment 4: share a model with SageMaker Studio

### Experiment 5: deploy the trained model as a SageMaker endpoint

To keep your model in a central model registry you can register the model directly from Canvas UX:

![](img/canvas-add-to-model-registry.png)

Canvas also shows the model package details:

![](img/model-registry-details.png)

You can now use "Model package group name" to have access to a model version via boto3 API.

---

## Clean up