# Example: Evaluate Ensemble Performance

In this notebook, we shall demonstrate how to evaluate the performance of ensembles. For this, evaluations of models must be available locally **along with their stored forecasts**, e.g. by using the CLI to download the publicly available evaluations:

```bash
tsbench evaluations download --include_forecasts
```

_**Note:** The publicly available forecasts require roughly 600 GiB of available local storage._

## Initialize the Tracker

As you have seen in the `browse-offline-evaluations` notebook, we use the tracker for accessing the evaluations that are available offline.

In [1]:
from pathlib import Path
from tsbench.evaluations.tracking import ModelTracker

In [2]:
tracker = ModelTracker.from_directory(Path.home() / "evaluations")

## Manual Evaluation

For this example, we want to analyze the performance of a simple combination of two deep learning models (TFT and DeepAR) and a classical model (ARIMA) and compare it to the performance of the individual models.

In [3]:
from tsbench.config import DATASET_REGISTRY, Config
from tsbench.config.model.models import TemporalFusionTransformerModelConfig, DeepARModelConfig, ARIMAModelConfig
from tsbench.analysis import EnsembleAnalyzer

In [4]:
analyzer = EnsembleAnalyzer(tracker)

In [5]:
members = [
    TemporalFusionTransformerModelConfig(),
    DeepARModelConfig(),
    ARIMAModelConfig(),
]

In [6]:
dataset = DATASET_REGISTRY["kdd_2018"]()

In [7]:
ensemble_performance = analyzer.get_ensemble_performance(members, dataset)

In [8]:
ensemble_performance.ncrps

Metric(mean=0.4105578065745702, std=0.0247243410990787)

In [9]:
for member in members:
    print(member.name(), ":", tracker.get_performance(Config(member, dataset)).ncrps)

tft : Metric(mean=0.4358486086130142, std=0.011365100741386414)
deepar : Metric(mean=0.41372910141944885, std=0.08784350752830505)
arima : Metric(mean=0.5097207129001617, std=0.0018174946308135986)


As we can see in this simple example, the performance of the ensemble in terms of nCRPS improves upon the performance of the individual models.

## Exhaustive Evaluation

In case you want to analyze the performance of ensembles across datasets, you should use the `tsbench` CLI. Check out ensemble analysis command:

```bash
tsbench analysis ensemble --help
```