In [1]:
import numpy as np
import pandas as pd

In [2]:
model = ["TFT", "TCN", "XGB", "MLP", "LSTM", "SARIMA"]
metric = ["percent_data_used", "MSE", "MAPE"]
coverage = ["own_data", "common_data"]
feature = ["feat_1", "feat_2"]
test_series_id = ["room_23", "room24"]
offset = pd.timedelta_range("1h", "2h", freq="h")

## General considerations
### Readability
We should consolidate all the metrics as much as possible.
For the code, that means, that they must be accessible in a uniform way, self explaining way.
In the current implementation one needs to remember which levels are lists, which are data frames, which are dicionaries and what are the keys at the current dictionary level.
A better approach would be to 
- There should be only one column index level so that all the models are next to each other for better comparison.
- To avoid jumping between multiple tables, whenever multiple metrics have the same table stucture, i.e., the same index level names, the reporter should concatenate them.
### Fairness
Since we leave the model-specific data processing to the model adapter, we can end up with an unfair comparison where a model adapter drops all data with high uncertainty gaining an advantage over one that attemts forecasts on a larger portion. This can be mitigated by the following steps:
1. Compute every metric twice:
    - once for the biggest subset of the data that _each_ model can process
    - once fot the biggest subset of the data that _all_ models can process, i.e. the intersection of the above data
1. Introduce an additional mandatory metric indicating what percentage of the test set was discarded by what model adapter

## Simple Case: Scalar Metrics
The performance of a model on the entire test set is expressed through a single number

In [3]:
scalar_col_idx = pd.Index(model, name="model")
scalar_row_idx = pd.MultiIndex.from_product([metric, coverage], names=["metric", "evaluated_on"])
scalar_metric_values = np.random.rand(
    len(scalar_row_idx), 
    len(scalar_col_idx),
).round(2)

scalar_metric_df = pd.DataFrame(
    columns=scalar_col_idx,
    index=scalar_row_idx, 
    data=scalar_metric_values
)
scalar_metric_df

Unnamed: 0_level_0,model,TFT,TCN,XGB,MLP,LSTM,SARIMA
metric,evaluated_on,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
percent_data_used,own_data,0.43,0.19,0.59,0.12,0.87,0.62
percent_data_used,common_data,0.11,0.39,0.72,0.57,0.99,0.41
MSE,own_data,0.85,0.74,0.25,0.17,0.87,0.04
MSE,common_data,0.52,0.43,0.15,0.58,0.34,0.21
MAPE,own_data,0.19,0.92,0.21,0.43,0.49,0.47
MAPE,common_data,0.14,0.46,0.29,0.97,0.89,0.38


## Complex Case: Tabular Metrics
The performance of a model needs to be communicated separetely for various vertical or horizontal slices of the test set, e.g., one per offset, one per test series, etc...

In [4]:
tabular_row_idx = pd.MultiIndex.from_product(
    [metric, coverage, feature, test_series_id, offset],
    names=["metric", "evaluated_on", "feature", "series_id", "offset"]
)
tabular_metric_values = np.random.rand(
    len(tabular_row_idx), 
    len(scalar_col_idx),
).round(2)

tabular_metric_df = pd.DataFrame(
    columns=scalar_col_idx,
    index=tabular_row_idx, 
    data=tabular_metric_values
)
tabular_metric_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,model,TFT,TCN,XGB,MLP,LSTM,SARIMA
metric,evaluated_on,feature,series_id,offset,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
percent_data_used,own_data,feat_1,room_23,0 days 01:00:00,0.21,0.06,0.19,0.08,0.32,0.31
percent_data_used,own_data,feat_1,room_23,0 days 02:00:00,0.7,0.32,0.89,0.87,0.7,0.84
percent_data_used,own_data,feat_1,room24,0 days 01:00:00,0.94,0.23,0.52,0.73,0.28,0.08
percent_data_used,own_data,feat_1,room24,0 days 02:00:00,0.02,0.7,0.1,0.0,0.27,0.84
percent_data_used,own_data,feat_2,room_23,0 days 01:00:00,0.42,0.05,0.88,0.9,0.35,0.24
percent_data_used,own_data,feat_2,room_23,0 days 02:00:00,0.8,1.0,0.49,0.1,0.43,0.61
percent_data_used,own_data,feat_2,room24,0 days 01:00:00,0.61,0.01,0.7,0.51,0.55,0.53
percent_data_used,own_data,feat_2,room24,0 days 02:00:00,0.04,0.33,0.48,0.74,0.25,0.66
percent_data_used,common_data,feat_1,room_23,0 days 01:00:00,0.55,0.33,0.71,0.26,0.58,0.98
percent_data_used,common_data,feat_1,room_23,0 days 02:00:00,0.26,0.85,0.36,0.23,0.58,0.26


- This example is an extreme case. In practice, one should strive to keep the depth of the row index small.
- The core takeaway is that all dimensions are relocated to the row index, keeping the column index single-level.