# Benchmark Models
> Tutorial on how to benchmark neuralforecast models on multiple datasets

Benchmarking is crucial for time series forecasting: we want to evaluate models across different datasets, with different settings, to better understand model behaviour and help us pick the right model for a task. 

In this notebook, we show how to benchmark a set of neuralforecast models on a set of commonly used benchmark time series datasets from the academic literature. 

We will show how to:
* Load a set of benchmark datasets, used in the academic literature.
* Train a set of models on these datasets.
* Forecast the test set.
* Evaluate performance.

You can run these experiments using GPU with Google Colab.

<a href="https://colab.research.google.com/github/Nixtla/neuralforecast/blob/main/nbs/examples/LongHorizon_with_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Installing libraries

In [None]:
%%capture
!pip install neuralforecast datasetsforecast

## 2. Load Data, Models, Losses and Metrics

The `LongHorizon` class will automatically download a set of benchmark datasets and process it. In this example, we will benchmark `NHITS`, `BiTCN`, `TSMixer`, `DLinear` and `iTransformer`.

In [None]:
import pandas as pd
from datasetsforecast.long_horizon import LongHorizon, LongHorizonInfo
from neuralforecast.core import NeuralForecast
from neuralforecast.models import NHITS, BiTCN, TSMixer, DLinear, iTransformer
from neuralforecast.losses.pytorch import MAE

## 3. Load Models

We create a `load_models` function that will return a list of models to evaluate given an output forecast horizon, input size and seed. Feel free to add your model to the list in the function; make sure to import the model in the above import statements. We will use the models mostly with their default settings in this example; only the `scaler_type` is different for some of the models.

Note that `TSMixer` and `iTransformer` are multivariate models, which means they require an additional `n_series` parameter as these models will forecast all series in the dataset concurrently.

In [None]:
%%capture
def load_models(horizon, input_size, n_series, seed):
    models = [              
               NHITS(h=horizon,
                    input_size=input_size,
                    early_stop_patience_steps=5,
                    scaler_type='robust',
                    valid_loss=MAE(),
                    random_seed=seed,
                    ),   
               DLinear(h=horizon,
                    input_size=input_size,
                    max_steps=1000,
                    early_stop_patience_steps=5,
                    scaler_type='standard',
                    valid_loss=MAE(),
                    random_seed=seed,
                    ), 
               BiTCN(h=horizon,
                    input_size=input_size,
                    early_stop_patience_steps=5,
                    scaler_type='standard',
                    valid_loss=MAE(),
                    random_seed=seed,
                    ),         
               TSMixer(h=horizon,
                    input_size=input_size,
                    n_series=n_series,
                    early_stop_patience_steps=5,
                    scaler_type='identity',
                    valid_loss=MAE(),
                    random_seed=seed,
                    ),                                                                                           
               iTransformer(h=horizon,
                    input_size=input_size,
                    n_series=n_series,
                    early_stop_patience_steps=5,
                    scaler_type='identity',
                    valid_loss=MAE(),
                    random_seed=seed,
                    ),                                  
          ]

    return models

## 4. Train models

We will train the models in a cross-validation procedure for a given dataset, horizon, input size and metric.

In [None]:
%%capture
def cross_validation(results, dataset, horizon, input_size, metrics, seed=1234567):
    
    # Access the frequency, validation size, test_size and n_series of the dataset
    freq = LongHorizonInfo[dataset].freq
    val_size = LongHorizonInfo[dataset].val_size
    test_size = LongHorizonInfo[dataset].test_size
    n_series = LongHorizonInfo[dataset].n_ts  

    # Load the dataset
    Y_df, _, _ = LongHorizon.load(directory='./', group=dataset)
    Y_df['ds'] = pd.to_datetime(Y_df['ds'])

    # Create the model list
    models = load_models(horizon, input_size, n_series, seed=seed)

    # Instantiate NeuralForecast
    nf = NeuralForecast(
        models=models,
        freq=freq)   

    # Create a set of forecasts using cross-validation
    Y_hat_df = nf.cross_validation(df=Y_df,
                                val_size=val_size,
                                test_size=test_size,
                                n_windows=None)                                 
    Y_hat_df = Y_hat_df.reset_index()    

    # Save the metric results to a dictionary
    for model in models:
        results[dataset][horizon][model] = {}
        for metric, fmetric in metrics.items():
            metric_model = fmetric(Y_hat_df['y'], Y_hat_df[f'{model}'])
            results[dataset][horizon][model][metric] = metric_model

    return results

# Helper function to process the dictionary of results in the end
# https://stackoverflow.com/questions/47416113/how-to-build-a-multiindex-pandas-dataframe-from-a-nested-dictionary-with-lists    
def get_result_df(results):
    d = results
    d = {(i, j, f'{k}'): d[i][j][k] 
        for i in d.keys() 
        for j in d[i].keys()
        for k in d[i][j].keys()}     

    mux = pd.MultiIndex.from_tuples(d.keys())
    df = pd.DataFrame(list(d.values()), index=mux).stack().reset_index()
    df.columns = ['dataset', 'horizon', 'model', 'metric', 'value']
    df['value'] = df['value'].round(3)
    df['dataset'] = pd.Categorical(df['dataset'])
    df['horizon'] = pd.Categorical(df['horizon'])
    df['model'] = pd.Categorical(df['model'])
    df['metric'] = pd.Categorical(df['metric'])
    df = df.set_index(['dataset', 'horizon', 'metric', 'model'])
    df = df.unstack('metric').unstack('model')    

    return df        

## 5. Running the benchmark

First, we define all our experimental settings:
- A set of datasets from `LongHorizon`
- The input size (sequence length) to the models
- A set of metrics to evaluate

In this example, we will only evaluate on the `ETTm1` dataset. You can uncomment the other datasets to include them in the benchmark. 

:::{.callout-important}
Note that benchmarking may take a long time and require a high amount of resources.
:::

In [None]:
from neuralforecast.losses.numpy import mse, mae, smape

In [None]:
%%capture
# Define dictionary of datasets to evaluate. The following dictionary includes all available datasets; uncomment those you wish to include in the benchmark.
datasets = {
            # 'ETTh1',
            # 'ETTh2',
            'ETTm1',
            # 'ETTm2',
            # 'ECL',
            # 'TrafficL',
            # 'Weather',
            # 'ILI',
            }

# Input_size and metrics to evaluate.
input_size = 96
metrics = {'MSE': mse, 
           'MAE': mae, 
           'sMAPE': smape}

Now, we can run the benchmark experiment. 

The following code will loop over all datasets, and over all horizons that each dataset is commonly evaluated on (this is provided as an attribute in `LongHorizonInfo`). It will then cross-validate the set of models for each dataset-horizon combination, and return the metrics on the test set.

In [None]:
%%capture
results = {}
for dataset in datasets:
    results[dataset] = {}
    horizons = LongHorizonInfo[dataset].horizons
    for horizon in horizons:
        results[dataset][horizon] = {}
        results = cross_validation(results, dataset, horizon, input_size, metrics, seed=1234567)

df_results = get_result_df(results)

INFO:lightning_fabric.utilities.seed:Seed set to 1234567
INFO:lightning_fabric.utilities.seed:Seed set to 1234567
INFO:lightning_fabric.utilities.seed:Seed set to 1234567
INFO:lightning_fabric.utilities.seed:Seed set to 1234567
INFO:lightning_fabric.utilities.seed:Seed set to 1234567
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name         | Type          | Params
-----------------------------------------------
0 | loss         | MAE           | 0     
1 | valid_loss   | MAE           | 0     
2 | padder_train | ConstantPad1d | 0     
3 | scaler       | TemporalNorm  | 0     
4 

## 6. Results

The results are returned in a pandas Dataframe `df_results`. You can compare these results to the results reported in the respective papers of these methods.

As you can see, it's a tight battle between these methods on `ETTm1`!

In [None]:
df_results

Unnamed: 0_level_0,Unnamed: 1_level_0,value,value,value,value,value,value,value,value,value,value,value,value,value,value,value
Unnamed: 0_level_1,metric,MAE,MAE,MAE,MAE,MAE,MSE,MSE,MSE,MSE,MSE,sMAPE,sMAPE,sMAPE,sMAPE,sMAPE
Unnamed: 0_level_2,model,BiTCN,DLinear,NHITS,TSMixer,iTransformer,BiTCN,DLinear,NHITS,TSMixer,iTransformer,BiTCN,DLinear,NHITS,TSMixer,iTransformer
dataset,horizon,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3
ETTm1,96,0.361,0.365,0.35,0.351,0.376,0.34,0.349,0.323,0.334,0.352,0.694,0.702,0.671,0.681,0.715
ETTm1,192,0.383,0.386,0.379,0.374,0.398,0.383,0.391,0.377,0.381,0.397,0.716,0.723,0.701,0.703,0.749
ETTm1,336,0.402,0.406,0.409,0.395,0.415,0.414,0.423,0.423,0.412,0.425,0.737,0.749,0.735,0.73,0.757
ETTm1,720,0.436,0.444,0.446,0.431,0.45,0.473,0.49,0.481,0.476,0.485,0.775,0.794,0.787,0.776,0.8
