# Example: Browse Offline Evaluations

In this notebook, we want to explore how evaluations can be viewed locally after downloading them via `tsbench evaluations download`.

Whether evaluations are downloaded from your own AWS Sagemaker experiment (via passing the `--experiment` flag to the download command) or you use the [publicly available evaluations](https://registry.opendata.aws/tsbench/) is irrelevant.

Note that, by default, the command only downloads the metrics obtained. This ensures that you only need to download ~20 MiB of data and should suffice for most use cases. If you need the actual forecasts (e.g. to build ensembles), you must pass the `--include_forecasts` flag. These amount to almost 600 GiB of data.

## Initialize the Tracker

The tracker is responsible for accessing the evaluations that are available offline. Whenever you want to access evaluations, you should **only** use this class.

In [1]:
from pathlib import Path
from tsbench.evaluations.tracking import ModelTracker

When the tracker is initialized for the first time, it loads the performance metrics obtained from all evaluations. For the publicly available data, this takes roughly 7 seconds. Afterwards, the tracker will be cached.

_**Note:** If you modify the code for anything related to the tracker, you will need to delete the cache which is located at `~/.cache/tsbench`._

In [2]:
tracker = ModelTracker.from_directory(Path.home() / "evaluations")

## Aggregate Metrics

At first, we want to have a look at all available offline evaluations.

In [3]:
evaluations = tracker.get_evaluations()

In [4]:
df = evaluations.dataframe()

The data frame provides us with a mapping from datasets and model configurations to performance metrics. For example, to obtain the best model configuration in terms of nCRPS on the `m4_monthly` dataset, we can run the following:

In [5]:
import numpy as np

In [6]:
dataset_evaluations = df.query("dataset == 'm4_monthly'")
print(f"Best nCRPS: {dataset_evaluations.ncrps_mean.min():.4f}")

i = np.argmin(dataset_evaluations.ncrps_mean)
print(f"Best Model: {dataset_evaluations.index[i]}")

Best nCRPS: 0.0920
Best Model: ('m4_monthly', 'deepar', 1.0, 0.001, 2.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 2.0, 40.0)


This model representation makes more sense when looking at the names of the index:

In [7]:
df.index.names

FrozenList(['dataset', 'model', 'model_training_fraction', 'model_learning_rate', 'model_context_length_multiple', 'model_mqcnn_num_filters', 'model_mqcnn_kernel_size_first', 'model_mqcnn_kernel_size_hidden', 'model_mqcnn_kernel_size_last', 'model_tft_hidden_dim', 'model_tft_num_heads', 'model_simple_feedforward_hidden_dim', 'model_simple_feedforward_num_layers', 'model_nbeats_num_stacks', 'model_nbeats_num_blocks', 'model_deepar_num_layers', 'model_deepar_num_cells'])

Essentially, it boils down to the following configuration:

In [8]:
from tsbench.config.model.models import DeepARModelConfig

In [9]:
model_config = DeepARModelConfig(context_length_multiple=2)

## Obtain Metrics from the Tracker

For a particular combination of model configuration and dataset, it is very easy to obtain the performance metrics, the forecasts, and plenty of additional information. Consult the documentation of the `ModelTracker` class for all possible operations.

In [10]:
from tsbench.config import DATASET_REGISTRY, Config

In [11]:
config = Config(model_config, DATASET_REGISTRY["m4_monthly"]())

In [12]:
tracker.get_performance(config)

Performance(training_time=Metric(mean=24000.048828125, std=4799.99609375), latency=Metric(mean=0.008198121096938848, std=8.529191836714745e-05), num_model_parameters=Metric(mean=23164.0, std=0.0), num_gradient_updates=Metric(mean=349418.5, std=62984.5), ncrps=Metric(mean=0.09203928336501122, std=0.00035720691084861755), mase=Metric(mean=0.9362838566303253, std=0.006132692098617554), smape=Metric(mean=0.12887728214263916, std=0.001025184988975525), nrmse=Metric(mean=0.2828601896762848, std=0.0015260577201843262), nd=Metric(mean=0.11401168629527092, std=0.0006160400807857513))

In [13]:
forecasts = tracker.get_forecasts(config)

The forecasts are an array of generated forecasts for the provided configuration (since the same model configuration might have been evaluated on the same dataset for multiple seeds).

In [14]:
forecasts[0].quantiles

['0.1', '0.2', '0.3', '0.4', '0.5', '0.6', '0.7', '0.8', '0.9']

In [15]:
forecasts[0].values.shape

(48000, 9, 18)

A single forecast object then provides the forecasts for all time series (in this case: 48,000) across all evaluated quantiles (10-quantiles) and the entire forecast horizon (in this case: 18 steps).