# Example: Analyze the Performance of Surrogates

In this notebook, we want to analyze how different surrogates perform.

For this, we first need to evaluate the accuracy of the surrogates' predictions. To do this, we can use the `tsbench` CLI:

```bash
tsbench analysis surrogate \
    --experiment surrogate-analysis \
    --config_path configs/analysis/surrogates.yaml
```

This command runs a hyperparameter search over the surrogate configurations stored in `configs/analysis/surrogates.yaml` and stores the resulting metrics in MongoDB using Sacred. We can later retrieve these metrics by querying the `surrogate-analysis` experiment in MongoDB.

_**Note 1:** The above comment takes quite some time to run since it only terminates once all configurations have been evaluated. Consider running it in a `tmux` session._

_**Note 2:** If you want to evaluate surrogate model performance when dataset meta features are used (default for `configs/analysis/surrogates.yaml`), they need to be precomputed. For this, run `tsbench datasets compute-stats` and `tsbench datasets compute-catch22`. The latter command takes some time (up to 2 hours with 48 CPUs) and requires plenty of memory (up to 128 GiB) due to a memory leak in the catch22 library. Consider passing the `--dataset` option to the command and run it for individual datasets if you do not have that much memory available._

## Load the Metrics from MongoDB

First, we want to load the performance of the individual surrogates from MongoDB.

In [1]:
from tsbench.analysis.tracking import SacredMongoClient

In [2]:
client = SacredMongoClient("surrogate-analysis")

In [3]:
configs = []
metrics = []
for experiment in client:
    # Load the metrics file which contains the performances for each left-out dataset
    df = experiment.read_parquet("results.parquet")
    configs.append(experiment.config)
    # Average the performance across all test datasets
    metrics.append(df.mean())

## Build a Data Frame

After we have loaded the metrics, we can aggregate them into a dataframe.

In [4]:
import pandas as pd

def flatten(data):
    result = {}
    for k, v in data.items():
        if isinstance(v, dict):
            result.update({f"{k}_{kk}": vv for kk, vv in flatten(v).items()})
        else:
            result[k] = str(v)
    return result

In [5]:
index_df = pd.DataFrame([flatten(c) for c in configs])
df = pd.DataFrame(metrics, index=pd.MultiIndex.from_frame(index_df)).sort_index()

## Report the Performance

Eventually, we can group them by the type of surrogate (since multiple random evaluations are possible run per
surrogate) and compute their average performance.

In [6]:
df.query(
    "surrogate == 'mlp' and "
    "mlp_objective == 'ranking' and "
    "mlp_discount == 'linear' and "
    "inputs_use_simple_dataset_features == 'False'"
).mean().sort_index()

latency_mean  mrr            NaN
              ndcg           NaN
              nrmse          NaN
              precision_10   NaN
              precision_20   NaN
              precision_5    NaN
              smape          NaN
ncrps_mean    mrr            NaN
              ndcg           NaN
              nrmse          NaN
              precision_10   NaN
              precision_20   NaN
              precision_5    NaN
              smape          NaN
dtype: float64