# Lab 6: results

In notebook you can analyze, visualize, and compare all model performance metrics which you captured in previous experiments with different time series forecasting algorithms.

## Import packages

In [1]:
%pip install -q pandera dask[dataframe] seaborn pyarrow

Note: you may need to restart the kernel to use updated packages.


In [2]:
import boto3
import sagemaker
import os
import numpy as np
import pandas as pd
import json
import pandera as pa
from pandas.errors import EmptyDataError
from pandera import DataFrameSchema, Column, Check
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import (
interact, interactive, fixed, interact_manual,
Checkbox, Dropdown, SelectMultiple, Layout
)

%matplotlib inline

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


## Load reports files

Now load all metric reports you saved in each notebook of this workshop.

In [19]:
# define data schema for validation
report_schema = DataFrameSchema({
    "timestamp": Column(np.datetime64, coerce=True),
    "metric_name": Column(str),
    "value": Column(float, Check(lambda x: x >= 0)),
    "experiment": Column(str)
})

The code below checks what files you have in `model-performance` folder:

In [20]:
model_performance_dir = 'model-performance'

In [21]:
# list all files saved in the {model_performance} folder
!ls -p {model_performance_dir}*/*

 model-performance/autogluon-2h-2-17533-20241003-120222.csv
 model-performance/autogluon-2h-370-17533-20240930-084718.csv
'model-performance/autogluon-Chronos[base]-2h-2-17533-20241003-130100.csv'
'model-performance/autogluon-Chronos[base]-2h-2-17533-20241003-130326.csv'
'model-performance/autogluon-Chronos[base]-2h-2-17533-bt4-20241003-130110.csv'
'model-performance/autogluon-Chronos[base]-2h-2-17533-bt4-20241003-130334.csv'
'model-performance/autogluon-Chronos[base]-2h-370-17533-20240930-114821.csv'
'model-performance/autogluon-Chronos[base]-2h-370-17533-bt4-20240930-122110.csv'
 model-performance/autopilot-2h-370-17533-20240930-085028.csv
 model-performance/autopilot-full-2h-370-17533-20241001-091815.csv
 model-performance/canvas-1H-370-35065-20240930-104508.csv
 model-performance/chronos-2h-370-17533-bt4-20240930-110123.csv
 model-performance/chronos-2h-370-17533-off10-20240930-110123.csv
 model-performance/deepar-2h-370-17533-20240930-104323.csv
 model-performance/gluonts-1h-10-87

In [22]:
# some algorithms use different names for same metrics, map all alternative names to one
metric_name_map = {
    'WQL': 'WQL',
    'MAPE':'MAPE',
    'sMAPE':'sMAPE',
    'WAPE':'WAPE',
    'MASE':'MASE',
    'RMSE':'RMSE',
    'NRMSE':'NRMSE',
    'MSE':'MSE',
    'MSIS':'MSIS',
    'AverageWeightedQuantileLoss':'WQL',
    'mean_wQuantileLoss':'WQL',
    'test:mean_wQuantileLoss':'WQL',
    'test:MAPE':'MAPE',
    'test:WAPE':'WAPE',
    'test:MASE':'MASE',
    'test:RMSE':'RMSE',
}

def convert_metric(df):
    # Convert 'metric_name' column and drop rows with unknown values
    df['metric_name'] = df['metric_name'].map(metric_name_map)
    return df.dropna(subset=['metric_name'])

In [23]:
# load, schema-validate, and convert metric names for each report in the folder
def read_report_csv_files(directory):
    dfs = []
    skipped_files = []

    for f in os.listdir(directory):
        if f.endswith('.csv'):
            file_path = os.path.join(directory, f)
            try:
                # Read the CSV file
                df = pd.read_csv(file_path)

                # Validate the schema
                report_schema.validate(df)

                # Apply metric conversion and drop rows with unknown metrics
                df = convert_metric(df)

                # If the DataFrame is not empty after conversion, add it to the list
                if not df.empty:
                    dfs.append(df)
                    print(f"Successfully read, validated, and converted: {f}")
                else:
                    print(f"Warning: All rows skipped due to unknown metrics in file: {f}")
                    skipped_files.append(f)

            except EmptyDataError:
                print(f"Error: Empty CSV file: {f}")
                skipped_files.append(f)
            except pd.errors.ParserError:
                print(f"Error: Unable to parse CSV file: {f}")
                skipped_files.append(f)
            except pa.errors.SchemaError as e:
                print(f"Schema validation failed for file {f}: {str(e)}")
                skipped_files.append(f)
            except Exception as e:
                print(f"An error occurred while processing file {f}: {str(e)}")
                skipped_files.append(f)

    return dfs, skipped_files

In [24]:
# Load reports
reports, skipped = read_report_csv_files(f'./{model_performance_dir}')

Successfully read, validated, and converted: deepar-2h-370-17533-20240930-104323.csv
Successfully read, validated, and converted: autopilot-2h-370-17533-20240930-085028.csv
Successfully read, validated, and converted: canvas-1H-370-35065-20240930-104508.csv
Successfully read, validated, and converted: autogluon-2h-370-17533-20240930-084718.csv
Successfully read, validated, and converted: chronos-2h-370-17533-off10-20240930-110123.csv
Successfully read, validated, and converted: chronos-2h-370-17533-bt4-20240930-110123.csv
Successfully read, validated, and converted: autogluon-Chronos[base]-2h-370-17533-20240930-114821.csv
Successfully read, validated, and converted: autogluon-Chronos[base]-2h-370-17533-bt4-20240930-122110.csv
Successfully read, validated, and converted: autopilot-full-2h-370-17533-20241001-091815.csv
Successfully read, validated, and converted: autogluon-2h-2-17533-20241003-120222.csv
Successfully read, validated, and converted: autogluon-Chronos[base]-2h-2-17533-20241

All reports in a list of `pandas.DataFrame` now. Concatenate all reports into the single data frame and set a composite index for easier handling:

In [25]:
metric_df = pd.concat(reports).set_index(['experiment', 'timestamp']).sort_index()
metric_df

Unnamed: 0_level_0,Unnamed: 1_level_0,metric_name,value
experiment,timestamp,Unnamed: 2_level_1,Unnamed: 3_level_1
autogluon-2h-2-17533,20241003-120222,WQL,0.126380
autogluon-2h-2-17533,20241003-120222,MAPE,0.347289
autogluon-2h-2-17533,20241003-120222,WAPE,0.176529
autogluon-2h-2-17533,20241003-120222,RMSE,371.928233
autogluon-2h-2-17533,20241003-120222,MASE,2.713621
...,...,...,...
gluonts-TemporalFusionTransformer-1h-10-8736-bt4,20241106-080455,MAPE,0.834418
gluonts-TemporalFusionTransformer-1h-10-8736-bt4,20241106-080455,sMAPE,0.405172
gluonts-TemporalFusionTransformer-1h-10-8736-bt4,20241106-080455,RMSE,146.064678
gluonts-TemporalFusionTransformer-1h-10-8736-bt4,20241106-080455,NRMSE,0.507565


Given a data frame you can query it for various data and analyze. For example, to get all scores for a specific metric:

In [26]:
# show scores for a specific metric
metric_name = 'MAPE'
metric_data = metric_df[metric_df['metric_name'] == metric_name]
metric_data

Unnamed: 0_level_0,Unnamed: 1_level_0,metric_name,value
experiment,timestamp,Unnamed: 2_level_1,Unnamed: 3_level_1
autogluon-2h-2-17533,20241003-120222,MAPE,0.347289
autogluon-2h-370-17533,20240930-084718,MAPE,1.399402
autogluon-Chronos[base]-2h-2-17533,20241003-130100,MAPE,0.296382
autogluon-Chronos[base]-2h-2-17533,20241003-130326,MAPE,0.296382
autogluon-Chronos[base]-2h-2-17533-bt4,20241003-130110,MAPE,0.134085
autogluon-Chronos[base]-2h-2-17533-bt4,20241003-130334,MAPE,0.134085
autogluon-Chronos[base]-2h-370-17533,20240930-114821,MAPE,1.470925
autogluon-Chronos[base]-2h-370-17533-bt4,20240930-122110,MAPE,0.741637
autopilot-2h-370-17533,20240930-085028,MAPE,2.296853
autopilot-full-2h-370-17533,20241001-091815,MAPE,1.392904


## Visualize metric scores

In this section you build charts based on the metric data.

In [31]:
# plot a specific metic given the list of experiments
def plot_metric(ax, metric_df: pd.DataFrame, metric: str, experiments: [str], sort_scores: [bool]=False):
    # Filter data for the metric and list of experiments
    metric_data = metric_df[
        (metric_df['metric_name'] == metric) &
        (metric_df.index.get_level_values('experiment').isin(experiments))
    ].sort_values(by='value')

    # Calculate mean values for each experiment as there can be several metric scores per experiment
    mean_values = metric_data.groupby('experiment')['value'].mean().reset_index()

    if sort_scores: mean_values = mean_values.sort_values(by='value').reset_index()
        
    # Calculate overall mean and standard deviation
    overall_mean = metric_data['value'].mean()
    overall_std = metric_data['value'].std()

    # Define colors
    bar_color = 'skyblue'
    lowest_bar_color = 'orange'
    
    # Create a bar plot
    bp = sns.barplot(x='value', y='experiment', data=mean_values, ax=ax, color=bar_color)
    # Indicate the lowest value with a different color
    if len(mean_values['value']): bp.patches[mean_values['value'].idxmin()].set_facecolor(lowest_bar_color)

    # Add mean line
    ax.axvline(overall_mean, color='r', linestyle='--', label='Mean')
    
    # Add shaded area for standard deviation
    ax.axvspan(overall_mean - overall_std, overall_mean + overall_std, 
                    alpha=0.2, color='g', label='±1 Std Dev')
    
    # Set the title and labels
    ax.set_title(f'{metric}')
    ax.set_xlabel('Metric score')
    ax.set_ylabel('Experiment')

    # Add value labels on top of each bar
    for j, v in enumerate(mean_values['value']):
        ax.text(v, j, f'{v:.2f}', va='center')

    # Add legend
    ax.legend()

In [32]:
# plot all metrics for selected experiments
def plot_scores(metric_df: pd.DataFrame, metrics: [str], experiments: [str]=None, sort_scores: [bool]=False):
    if not experiments: experiments = metric_df.index.get_level_values(0).unique().to_list()
        
    # Calculate the number of rows and columns for subplots
    n_metrics = len(metrics)
    n_cols = 2  # You can adjust this to change the layout
    n_rows = (n_metrics + 1) // 2  # Ceiling division to ensure all metrics are included
    
    # Create a figure with subplots
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(20, 8*n_rows))
    fig.suptitle('Metric scores by experiment - lower is better', fontsize=16)
    
    # Flatten the axes array for easier indexing
    axes = axes.flatten()
            
    # Create a plot for each unique metric
    for i, m in enumerate(metrics):
        plot_metric(axes[i], metric_df, m, experiments, sort_scores)
    
    # Remove any unused subplots
    for j in range(i+1, len(axes)):
        fig.delaxes(axes[j])
    
    # Adjust the layout and display the plot
    plt.tight_layout()
    plt.show()

Using the interactive controls build metric score plots for different metrics and experiments. You can compare metrics over multiple experiments across all executed notebooks and algorithms.

In [33]:
style = {"description_width": "80px"}
metric_list = metric_df['metric_name'].unique()
experiment_list = metric_df.index.get_level_values(0).unique().to_list()

In [34]:
@interact_manual(
    metrics=SelectMultiple(options=metric_list,value=[metric_list[0]], rows=5, style=style, description='Metrics:'),
    experiments=SelectMultiple(options=experiment_list,value=experiment_list, rows=10, style=style, description='Experiments:', layout=Layout(width='400px')),
    sort=Checkbox(value=False, description='Sort scores'),
    continuous_update=False,
)
def plot_interact(metrics, experiments, sort):
    plot_scores(metric_df, metrics, experiments, sort_scores=sort)

interactive(children=(SelectMultiple(description='Metrics:', index=(0,), options=('WQL', 'MAPE', 'WAPE', 'RMSE…

## Considerations for scoring analysis

<div class="alert alert-info">The electricity dataset with 15min, 1H, and 1W aggregations was used to train Chronos models, see the <b>Appendix B</b> of the <a href="https://arxiv.org/html/2403.07815v1#A2">Chronos paper</a>. Chronos demonstrates better performance on <b>in-domain</b> evaluation than on <b>zero-shot</b> (unseen data) evaluation benchmarks.</div>

When analyzing prediction model performance metrics like [WQL (Weighted Quantile Loss)]((https://auto.gluon.ai/stable/_modules/autogluon/timeseries/metrics/quantile.html#WQL)), [MAPE (Mean Absolute Percentage Error)](https://auto.gluon.ai/stable/_modules/autogluon/timeseries/metrics/point.html#MAPE), [WAPE (Weighted Absolute Percentage Error)](https://auto.gluon.ai/stable/_modules/autogluon/timeseries/metrics/point.html#WAPE), and [RMSE (Root Mean Square Error)](https://auto.gluon.ai/stable/_modules/autogluon/timeseries/metrics/point.html#RMSE), it's crucial to consider their suitability for different types of time series data and their sensitivity to outliers and trends. WQL is particularly useful for probabilistic forecasts and handles sparse data well, but can be sensitive to outliers. MAPE is easy to interpret and allows comparisons across different scales, but performs poorly with zero or near-zero values and can be biased towards lower forecasts. WAPE addresses some of MAPE's limitations by weighting errors by the sum of actual values, making it more suitable for intermittent demand forecasting. RMSE is sensitive to outliers and gives higher weight to larger errors, making it a crucial metric for fine-tuning models, especially in complex time series with significant fluctuations.

To implement a robust metric score comparison across different prediction models, the most important factors to consider are:

1. The nature of your time series data – e.g., presence of zeros, outliers, or trends
2. The scale and units of your data
3. The specific business requirements and impact of over- vs under-prediction
4. The need for probabilistic vs point forecasts
5. The interpretability of the metric for stakeholders

It's generally recommended to use multiple complementary metrics to get a comprehensive view of model performance, as each metric captures different aspects of forecast accuracy. Additionally, consider using scale-independent metrics like [MASE (Mean Absolute Scaled Error)](https://auto.gluon.ai/stable/_modules/autogluon/timeseries/metrics/point.html#MASE) for fair comparisons across different time series, and always validate your metrics on out-of-sample data to ensure generalizability of your model's performance.

<div class="alert alert-info">
This concludes the time series forecasting workshop. Thank you for participanting!
</div>

## Fill in workshop survey

If you're participating in an AWS-led workshop, please fill in survey distributed by your instructor.

## Star the GitHub repo

If you liked doing this workshop, please star the [GitHub repo](https://github.com/aws-samples/modern-time-series-forecasting-on-aws) by clicking the button generated by the next code cell.

In [237]:
%%html

<a class="github-button" href="https://github.com/aws-samples/modern-time-series-forecasting-on-aws" data-color-scheme="no-preference: light; light: light; dark: dark;" data-icon="octicon-star" data-size="large" data-show-count="true" aria-label="Star Modern Time Series Forecasting on AWS on GitHub">Star</a>
<script async defer src="https://buttons.github.io/buttons.js"></script>

**Click this button ^^^ above ^^^**