# Example 11: Benchmarking with IdentiBench

IdentiBench provides standardized benchmarks for comparing system
identification methods. This example shows how to run your TSFast models
on IdentiBench benchmarks for fair, reproducible comparison with other
methods.

## Prerequisites

- [Example 00: Your First Model](00_your_first_model.ipynb)
- [Example 01: Understanding the Data Pipeline](01_data_pipeline.ipynb)
- [Example 02: Simulation](02_simulation.ipynb)
- [Example 04: Benchmark RNN](04_benchmark_rnn.py)

## Setup

In [1]:
import identibench as idb

from tsfast.datasets.benchmark import create_dls_from_spec
from tsfast.models.rnn import RNNLearner
from tsfast.inference import InferenceWrapper
from tsfast.learner.losses import fun_rmse

## What is IdentiBench?

IdentiBench is a benchmarking framework that provides standardized
datasets, evaluation protocols, and metrics for system identification.
Each benchmark defines:

- A **dataset** with specified train/validation/test splits
- **Input and output column names** (e.g., voltage in, displacement out)
- **Evaluation metrics** (typically NRMSE -- normalized root mean square
  error)
- A **standard API** that all methods must follow, ensuring fair
  comparison

The `workshop_benchmarks` dictionary contains the benchmarks used in the
IdentiBench workshop -- a curated set covering different system types and
difficulties.

## The Build Model Function

IdentiBench requires a `build_model` function that takes a
`TrainingContext` and returns a callable model for evaluation. The context
provides:

- **`context.spec`** -- the benchmark specification (dataset path, column
  names, window sizes, metric function)
- **`context.hyperparameters`** -- your model's hyperparameters, passed
  through from the benchmark runner

The returned model must accept numpy arrays: `model(u_test, y_init)` for
simulation benchmarks, where `u_test` is the full input signal and
`y_init` is the initial output window.

In [2]:
def build_model(context: idb.TrainingContext):
    """Build and train a TSFast model for an IdentiBench benchmark."""
    dls = create_dls_from_spec(context.spec)

    lrn = RNNLearner(
        dls,
        rnn_type=context.hyperparameters.get('model_type', 'lstm'),
        num_layers=context.hyperparameters.get('num_layers', 1),
        hidden_size=context.hyperparameters.get('hidden_size', 40),
        n_skip=context.spec.init_window,
        metrics=[fun_rmse],
    )

    lrn.fit_flat_cos(n_epoch=10, lr=3e-3)
    return InferenceWrapper(lrn)

Key details:

- **`create_dls_from_spec`** automatically extracts column names, window
  sizes, and prediction settings from the benchmark spec. It also applies
  benchmark-specific DataLoader defaults (e.g., batch size, step size)
  from TSFast's `BENCHMARK_DL_KWARGS` table.
- **`n_skip=context.spec.init_window`** uses the benchmark-defined
  initialization window to skip the initial transient in the loss. This
  matches IdentiBench's evaluation protocol, which discards the first
  `init_window` timesteps.
- **`InferenceWrapper`** wraps the trained learner into a numpy-in,
  numpy-out callable that IdentiBench's evaluation harness can call
  directly.

## Configure and Run Benchmarks

We define a hyperparameter dictionary and pass it along with the
benchmarks to `idb.run_benchmarks`. The runner:

1. Downloads each dataset (on first use)
2. Calls `build_model` with the spec and hyperparameters
3. Evaluates the returned model on the held-out test set
4. Collects metrics into a pandas DataFrame

In [3]:
model_config = {
    'model_type': 'lstm',
    'num_layers': 1,
    'hidden_size': 40,
}

benchmarks = list(idb.workshop_benchmarks.values())
results = idb.run_benchmarks(benchmarks, build_model, model_config)

--- Starting benchmark run for 4 specifications, repeating each 1 times ---

-- Repetition 1/1 --

[1/4] Running: BenchmarkWH_Simulation (Rep 1)


epoch,train_loss,valid_loss,fun_rmse,time
0,0.013542,0.01071,0.014256,00:02
1,0.008203,0.007862,0.010135,00:02
2,0.007334,0.005773,0.007681,00:02
3,0.00754,0.007693,0.009679,00:02
4,0.005602,0.003809,0.005187,00:02
5,0.006091,0.006287,0.007872,00:02
6,0.006572,0.008107,0.010937,00:02
7,0.00529,0.005069,0.006561,00:02
8,0.00249,0.002136,0.00321,00:02
9,0.00185,0.001917,0.00291,00:02


  -> ERROR running benchmark 'BenchmarkWH_Simulation' (Rep 1): input.size(-1) must be equal to input_size. Expected 1, got 2

[2/4] Running: BenchmarkSilverbox_Simulation (Rep 1)


epoch,train_loss,valid_loss,fun_rmse,time
0,0.005729,0.003897,0.005963,00:02
1,0.003457,0.003796,0.005148,00:02
2,0.003146,0.003109,0.004407,00:02
3,0.002784,0.003424,0.004669,00:02
4,0.002708,0.002582,0.003909,00:02
5,0.002986,0.002651,0.003982,00:02
6,0.002712,0.002917,0.004233,00:02
7,0.002638,0.002083,0.003457,00:02
8,0.001957,0.00194,0.003453,00:02
9,0.00173,0.001772,0.003378,00:02


  -> ERROR running benchmark 'BenchmarkSilverbox_Simulation' (Rep 1): input.size(-1) must be equal to input_size. Expected 1, got 2

[3/4] Running: BenchmarkEMPS_Simulation (Rep 1)


epoch,train_loss,valid_loss,fun_rmse,time
0,0.069733,0.071184,0.081173,00:02
1,0.069433,0.07141,0.082529,00:02
2,0.069793,0.071213,0.082052,00:02
3,0.067787,0.067015,0.085624,00:03
4,0.059562,0.0688,0.08461,00:03
5,0.058254,0.063308,0.082248,00:03
6,0.057195,0.06352,0.080586,00:03
7,0.056593,0.062098,0.082033,00:03
8,0.055283,0.061372,0.080548,00:03
9,0.054668,0.061886,0.081729,00:02


  -> ERROR running benchmark 'BenchmarkEMPS_Simulation' (Rep 1): input.size(-1) must be equal to input_size. Expected 1, got 2

[4/4] Running: BenchmarkCED_Simulation (Rep 1)


epoch,train_loss,valid_loss,fun_rmse,time
0,0.094108,0.16547,0.242301,00:02
1,0.066462,0.146954,0.214916,00:02
2,0.051098,0.128206,0.179753,00:02
3,0.045406,0.102076,0.145229,00:02
4,0.04155,0.094097,0.135932,00:02
5,0.041709,0.093857,0.13276,00:02
6,0.040035,0.096683,0.137633,00:02
7,0.036568,0.09794,0.137784,00:02
8,0.031412,0.096995,0.137803,00:02
9,0.028697,0.096767,0.137891,00:02


  -> ERROR running benchmark 'BenchmarkCED_Simulation' (Rep 1): input.size(-1) must be equal to input_size. Expected 1, got 2

--- Benchmark run finished. 0/4 individual runs completed successfully. ---


## Analyze Results

The results DataFrame shows the benchmark name, metric score, and
training/test times for each benchmark.

In [4]:
print(results)

Empty DataFrame
Columns: []
Index: []


## Trying Different Configurations

One of IdentiBench's strengths is making it easy to compare different
model architectures on the same benchmarks. Here we try a GRU with 2
layers instead of a single-layer LSTM.

In [5]:
model_config_v2 = {
    'model_type': 'gru',
    'num_layers': 2,
    'hidden_size': 40,
}

results_v2 = idb.run_benchmarks(benchmarks, build_model, model_config_v2)

--- Starting benchmark run for 4 specifications, repeating each 1 times ---

-- Repetition 1/1 --

[1/4] Running: BenchmarkWH_Simulation (Rep 1)


epoch,train_loss,valid_loss,fun_rmse,time
0,0.011985,0.0102,0.013661,00:03
1,0.009718,0.016444,0.019905,00:03
2,0.007916,0.008811,0.010413,00:03
3,0.006667,0.004925,0.006617,00:03
4,0.007075,0.006787,0.008137,00:03
5,0.005515,0.005563,0.006927,00:03
6,0.006052,0.007912,0.010538,00:03
7,0.005136,0.00528,0.007136,00:03
8,0.002596,0.002249,0.003203,00:03
9,0.001502,0.001535,0.002477,00:02


  -> ERROR running benchmark 'BenchmarkWH_Simulation' (Rep 1): input.size(-1) must be equal to input_size. Expected 1, got 2

[2/4] Running: BenchmarkSilverbox_Simulation (Rep 1)


epoch,train_loss,valid_loss,fun_rmse,time
0,0.004455,0.00305,0.004205,00:02
1,0.003158,0.002428,0.003692,00:02
2,0.003148,0.003758,0.004969,00:02
3,0.00293,0.002887,0.00411,00:02
4,0.002939,0.003219,0.00454,00:02
5,0.002938,0.002834,0.004091,00:02
6,0.002749,0.003133,0.004395,00:02
7,0.002402,0.002544,0.0039,00:02
8,0.002036,0.001876,0.003416,00:02
9,0.00176,0.001802,0.003383,00:02


  -> ERROR running benchmark 'BenchmarkSilverbox_Simulation' (Rep 1): input.size(-1) must be equal to input_size. Expected 1, got 2

[3/4] Running: BenchmarkEMPS_Simulation (Rep 1)


epoch,train_loss,valid_loss,fun_rmse,time
0,0.069761,0.07143,0.081981,00:03
1,0.069618,0.071303,0.081248,00:02
2,0.068572,0.071255,0.082006,00:02
3,0.069085,0.071208,0.081936,00:02
4,0.069027,0.07102,0.082321,00:02
5,0.068211,0.068786,0.081111,00:03
6,0.055204,0.050874,0.067061,00:03
7,0.033857,0.033331,0.06067,00:03
8,0.0343,0.031959,0.055136,00:04
9,0.024241,0.019655,0.034394,00:03


  -> ERROR running benchmark 'BenchmarkEMPS_Simulation' (Rep 1): input.size(-1) must be equal to input_size. Expected 1, got 2

[4/4] Running: BenchmarkCED_Simulation (Rep 1)


epoch,train_loss,valid_loss,fun_rmse,time
0,0.105475,0.16452,0.225912,00:03
1,0.049,0.07664,0.111517,00:03
2,0.043063,0.080853,0.118647,00:03
3,0.038487,0.101073,0.145044,00:02
4,0.036761,0.106693,0.15626,00:02
5,0.035981,0.117225,0.173467,00:02
6,0.031315,0.128918,0.189117,00:02
7,0.031419,0.128531,0.191488,00:02
8,0.027531,0.131669,0.2012,00:02
9,0.024075,0.130144,0.200553,00:02


  -> ERROR running benchmark 'BenchmarkCED_Simulation' (Rep 1): input.size(-1) must be equal to input_size. Expected 1, got 2

--- Benchmark run finished. 0/4 individual runs completed successfully. ---


In [6]:
print(results_v2)

Empty DataFrame
Columns: []
Index: []


## Key Takeaways

- **IdentiBench provides standardized, reproducible benchmarks** for fair
  comparison across system identification methods.
- The **`build_model` function** follows a simple API: receive a training
  context, build and train a model, return an `InferenceWrapper`.
- **`create_dls_from_spec`** handles dataset-specific configuration
  automatically -- column names, window sizes, and prediction settings
  are all extracted from the benchmark spec.
- **Compare different architectures** (LSTM vs. GRU, depth, width) on
  the same benchmarks with minimal code changes.
- Results are **directly comparable** with other methods in the
  IdentiBench ecosystem.