This notebook covers the following topics:
1. Defining a time series forecasting `Task` consisting of multiple `EvaluationWindow`s
2. Multivariate and univariate forecasting
3. Evaluation on a `Benchmark` consisting of multiple tasks
4. Aggregating benchmark results

In [None]:
import warnings
from pathlib import Path

import datasets
import numpy as np
from tqdm.auto import tqdm

import fev

warnings.simplefilter("ignore")
datasets.disable_progress_bars()

## Main classes
The `fev` package provides 3 core classes for evaluating time series forecasting models:

1. **`Task`** - Defines a single forecasting task with dataset path, forecast horizon, and evaluation settings. Each `Task` contains one or more evaluation windows.

2. **`EvaluationWindow`** - Represents a single train/test split of the data at a specific cutoff point. Model performance is averaged across all windows within a `Task`.

3. **`Benchmark`** - A collection of multiple tasks (e.g., different datasets). Individual task results are aggregated to compute overall benchmark scores.

In short, the hierarchy is `Benchmark` -> `Task` -> `EvaluationWindow`.

This tutorial demonstrates the functionality of these classes.

### Data sources
Dataset stored on Hugging Face Hub: https://huggingface.co/datasets/autogluon/chronos_datasets

In [2]:
task = fev.Task(
    dataset_path="autogluon/chronos_datasets",
    dataset_config="monash_cif_2016",
    horizon=12,
)

Dataset stored on S3

In [3]:
# Dataset consisting of a single parquet / arrow file
task = fev.Task(
    dataset_path="s3://autogluon/datasets/timeseries/m1_monthly/data.parquet",
    horizon=12,
)
# Dataset consisting of multiple parquet / arrow files
task = fev.Task(
    dataset_path="s3://autogluon/datasets/timeseries/m1_monthly/*.parquet",
    horizon=12,
)

Dataset stored locally

In [4]:
# Download dataset from HF Hub and save it locally
ds = datasets.load_dataset("autogluon/chronos_datasets", name="m4_hourly", split="train")
local_path = "/tmp/m4_hourly/data.parquet"
ds.to_parquet(local_path)

task = fev.Task(
    dataset_path=local_path,
    horizon=48,
)

### Evaluation windows
A single `Task` consists of one or more `EvaluationWindow`s. 

Each `EvaluationWindow` represents a single train/test split of the time series data at a specific cutoff point.

We'll create a task with a toy dataset to demonstrate how evaluation windows work.

In [5]:
import pandas as pd
# Create a toy dataset with a single time series
ts = {
    "id": "A",
    "timestamp": pd.date_range("2025-01-01", freq="D", periods=10),
    "target": list(range(10)),
}
ds = datasets.Dataset.from_list([ts])
dataset_path = "/tmp/toy_dataset.parquet"
ds.to_parquet(dataset_path);

We now construct a `Task` with 2 evaluation windows based on this toy dataset.

In [6]:
task = fev.Task(
    dataset_path=dataset_path,
    horizon=3,
    num_windows=2,
)

# Show the original dataset before any splits (for reference only)
full_dataset = task.load_full_dataset()
print(full_dataset)
print(full_dataset[0])

Dataset({
    features: ['id', 'timestamp', 'target'],
    num_rows: 1
})
{'id': np.str_('A'), 'timestamp': array(['2025-01-01T00:00:00.000000000', '2025-01-02T00:00:00.000000000',
       '2025-01-03T00:00:00.000000000', '2025-01-04T00:00:00.000000000',
       '2025-01-05T00:00:00.000000000', '2025-01-06T00:00:00.000000000',
       '2025-01-07T00:00:00.000000000', '2025-01-08T00:00:00.000000000',
       '2025-01-09T00:00:00.000000000', '2025-01-10T00:00:00.000000000'],
      dtype='datetime64[ns]'), 'target': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])}


Now let's examine how the data is split across the 2 evaluation windows:

In [7]:
# Show how data is split across the 2 evaluation windows
for window_index, window in enumerate(task.iter_windows()):
    past, future = window.get_input_data()
    ground_truth = window.get_ground_truth()
    print(f"Window {window_index} (cutoff={window.cutoff}):")
    print(f"  Past data:    {past[0]['target']}")
    print(f"  Ground truth: {ground_truth[0]['target']}")

Window 0 (cutoff=-6):
  Past data:    [0 1 2 3]
  Ground truth: [4 5 6]
Window 1 (cutoff=-3):
  Past data:    [0 1 2 3 4 5 6]
  Ground truth: [7 8 9]


### Customizing evaluation window parameters
You can control how evaluation windows are positioned using `initial_cutoff` and `window_step_size` parameters.

In [8]:
# Example 1: Start evaluation earlier with initial_cutoff
task = fev.Task(
    dataset_path=dataset_path,
    horizon=3,
    num_windows=2,
    initial_cutoff=-8,
)

for window_index, window in enumerate(task.iter_windows()):
    past, future = window.get_input_data()
    ground_truth = window.get_ground_truth()
    print(f"Window {window_index} (cutoff={window.cutoff}):")
    print(f"  Past data:    {past[0]['target']}")
    print(f"  Ground truth: {ground_truth[0]['target']}")

Window 0 (cutoff=-8):
  Past data:    [0 1]
  Ground truth: [2 3 4]
Window 1 (cutoff=-5):
  Past data:    [0 1 2 3 4]
  Ground truth: [5 6 7]


In [9]:
# Example 2: Use smaller window_step_size
task = fev.Task(
    dataset_path=dataset_path,
    horizon=3,
    num_windows=2,
    window_step_size=1,
)

for window_index, window in enumerate(task.iter_windows()):
    past, future = window.get_input_data()
    ground_truth = window.get_ground_truth()
    print(f"Window {window_index} (cutoff={window.cutoff}):")
    print(f"  Past data:    {past[0]['target']}")
    print(f"  Ground truth: {ground_truth[0]['target']}")

Window 0 (cutoff=-4):
  Past data:    [0 1 2 3 4 5]
  Ground truth: [6 7 8]
Window 1 (cutoff=-3):
  Past data:    [0 1 2 3 4 5 6]
  Ground truth: [7 8 9]


You can also set `initial_cutoff` and `window_step_size` for pandas-compatible time strings.

In [10]:
# Example 3: Use pandas timestamp-like strings
task = fev.Task(
    dataset_path=dataset_path,
    horizon=3,
    num_windows=2,
    initial_cutoff="2025-01-05",
    window_step_size="2D",
)

for window_index, window in enumerate(task.iter_windows()):
    past, future = window.get_input_data()
    ground_truth = window.get_ground_truth()
    print(f"Window {window_index} (cutoff={window.cutoff}):")
    print(f"  Past data:    {past[0]['target']}")
    print(f"  Ground truth: {ground_truth[0]['target']}")

Window 0 (cutoff=2025-01-05T00:00:00):
  Past data:    [0 1 2 3 4]
  Ground truth: [5 6 7]
Window 1 (cutoff=2025-01-07T00:00:00):
  Past data:    [0 1 2 3 4 5 6]
  Ground truth: [7 8 9]


## Univariate forecasting

The simplest kind of forecasting task is univariate forecasting where the goal is to predict a single `target_column` for each time series in the dataset.



In [11]:
task = fev.Task(
    dataset_path="autogluon/chronos_datasets",
    dataset_config="m4_hourly",
    horizon=24,
)

To evaluate a forecasting model on this task we need to make predictions for each `EvaluationWindow`.


### Predictions format
Predictions must follow a certain format that is specified by `task.predictions_schema`.

For point forecasting tasks (i.e., if `quantile_levels=None`), predictions must contain a single array of length `horizon` for each time series.

In [12]:
task.predictions_schema

{'predictions': Sequence(feature=Value(dtype='float64', id=None), length=24, id=None)}

Here is an example of a function that makes predictions for a single `EvaluationWindow` and formats them as a `datasets.Dataset`.

In [13]:
def naive_forecast(window: fev.EvaluationWindow) -> datasets.Dataset:
    predictions: list[dict[str, np.ndarray]] = []
    past_data, future_data = window.get_input_data()
    for ts in past_data:
        y = ts[window.target_columns_list[0]]
        predictions.append(
            {"predictions": np.array([y[-1] for _ in range(window.horizon)])}
        )
    return datasets.Dataset.from_list(predictions)

window = task.get_window(0)
predictions_per_window = naive_forecast(window)
predictions_per_window

Dataset({
    features: ['predictions'],
    num_rows: 414
})

Each entry in `predictions` is a dictionary where the key `"predictions"` corresponds to an array with `24` values.

In [14]:
print(predictions_per_window[0])

{'predictions': [701.0, 701.0, 701.0, 701.0, 701.0, 701.0, 701.0, 701.0, 701.0, 701.0, 701.0, 701.0, 701.0, 701.0, 701.0, 701.0, 701.0, 701.0, 701.0, 701.0, 701.0, 701.0, 701.0, 701.0]}


Once we have predictions for each evaluation window, we can compute the metrics and generate an evaluation summary

In [15]:
predictions_per_window = [naive_forecast(window) for window in task.iter_windows()]
task.evaluation_summary(predictions_per_window, model_name="naive")

{'model_name': 'naive',
 'dataset_path': 'autogluon/chronos_datasets',
 'dataset_config': 'm4_hourly',
 'horizon': 24,
 'num_windows': 1,
 'initial_cutoff': -24,
 'window_step_size': 24,
 'min_context_length': 1,
 'max_context_length': None,
 'seasonality': 1,
 'eval_metric': 'MASE',
 'extra_metrics': [],
 'quantile_levels': None,
 'id_column': 'id',
 'timestamp_column': 'timestamp',
 'target_column': 'target',
 'generate_univariate_targets_from': None,
 'past_dynamic_columns': [],
 'excluded_columns': [],
 'task_name': 'm4_hourly',
 'test_error': 3.815112047601982,
 'training_time_s': None,
 'inference_time_s': None,
 'dataset_fingerprint': '19e36bb78b718d8d',
 'trained_on_this_dataset': False,
 'fev_version': '1.0.0',
 'MASE': 3.815112047601982}

### Probabilistic forecasting

For probabilistic forecasting tasks (i.e., if `quantile_levels` is provided), predictions must additionally contain a prediction for each quantile level.

In [16]:
task = fev.Task(
    dataset_path="autogluon/chronos_datasets",
    dataset_config="m4_hourly",
    horizon=24,
    quantile_levels=[0.1, 0.5, 0.9],
    eval_metric="WQL",
)

In [17]:
task.predictions_schema

{'predictions': Sequence(feature=Value(dtype='float64', id=None), length=24, id=None),
 '0.1': Sequence(feature=Value(dtype='float64', id=None), length=24, id=None),
 '0.5': Sequence(feature=Value(dtype='float64', id=None), length=24, id=None),
 '0.9': Sequence(feature=Value(dtype='float64', id=None), length=24, id=None)}

## Covariates
By default, all columns of type `Sequence` are interpreted as known covariates, and all remaining columns are interpreted as static covariates.

In [18]:
task = fev.Task(
    dataset_path="autogluon/chronos_datasets_extra",
    dataset_config="ETTh",
    horizon=24,
    target_column="OT",
)
past_data, future_data = task.get_window(0).get_input_data()
print(past_data)
print(future_data)

Dataset({
    features: ['id', 'timestamp', 'HUFL', 'HULL', 'MUFL', 'MULL', 'LUFL', 'LULL', 'OT'],
    num_rows: 2
})
Dataset({
    features: ['id', 'timestamp', 'HUFL', 'HULL', 'MUFL', 'MULL', 'LUFL', 'LULL'],
    num_rows: 2
})


We can configure how the covariates are used as part of the task definition.

For example, here we say that 
- columns `HUFL` and `HULL` are known only in the past
- columns `MUFL` and `MULL` are excluded from the dataset

In [19]:
task = fev.Task(
    dataset_path="autogluon/chronos_datasets_extra",
    dataset_config="ETTh",
    horizon=24,
    target_column="OT",
    past_dynamic_columns=["HUFL", "HULL"],
    excluded_columns=["MUFL", "MULL"],
)

past_data, future_data = task.get_window(0).get_input_data()
print(past_data)
print(future_data)

Dataset({
    features: ['id', 'timestamp', 'HUFL', 'HULL', 'LUFL', 'LULL', 'OT'],
    num_rows: 2
})
Dataset({
    features: ['id', 'timestamp', 'LUFL', 'LULL'],
    num_rows: 2
})


## Multivariate forecasting
In all previous examples we considered univariate forecasting tasks, where the goal was to predict a single `target_column` into the future. 

`fev` also supports multivariate tasks, where the goal is to simultaneously predict multiple target columns. 

### "Real" multivariate tasks
We can define multivariate forecasting tasks by setting the `target_column` attribute to a `list` of column names.


In [20]:
task = fev.Task(
    dataset_path="autogluon/chronos_datasets_extra",
    dataset_config="ETTh",
    horizon=3,
    target_column=["OT", "LUFL", "LULL"],
)

The input data created by the task in this case is identical to what would happen if we used `["OT", "LUFL", "LULL"]` as `past_dynamic_columns`.
That is, the target columns `["OT", "LUFL", "LULL"]` are available in `past_data` but not in `future_data`.

In [21]:
past_data, future_data = task.get_window(0).get_input_data()
print(past_data)
print(future_data)

Dataset({
    features: ['id', 'timestamp', 'HUFL', 'HULL', 'MUFL', 'MULL', 'LUFL', 'LULL', 'OT'],
    num_rows: 2
})
Dataset({
    features: ['id', 'timestamp', 'HUFL', 'HULL', 'MUFL', 'MULL'],
    num_rows: 2
})


The only difference in a multivariate task is that the predictions must be formatted as a `datasets.DatasetDict` where
- each key corresponds to the name of the target column
- each value is a `datasets.Dataset` containing the predictions for this column in a format compatible with `task.predictions_schema`

In [22]:
def naive_forecast_multivariate(window: fev.EvaluationWindow) -> datasets.DatasetDict:
    """Predicts the last observed value in each multivariate column."""
    past_data, future_data = window.get_input_data()
    predictions = datasets.DatasetDict()
    for col in window.target_columns_list:
        predictions_for_column = []
        for ts in past_data:
            predictions_for_column.append({"predictions": [ts[col][-1] for _ in range(window.horizon)]})
        predictions[col] = datasets.Dataset.from_list(predictions_for_column)
    return predictions

In [23]:
window = task.get_window(0)
predictions_per_window = naive_forecast_multivariate(window).cast(task.predictions_schema)
predictions_per_window

DatasetDict({
    LUFL: Dataset({
        features: ['predictions'],
        num_rows: 2
    })
    LULL: Dataset({
        features: ['predictions'],
        num_rows: 2
    })
    OT: Dataset({
        features: ['predictions'],
        num_rows: 2
    })
})

We can also look at the individual values in the `Dataset` objects

In [24]:
for col in task.target_columns_list:
    print(f"Predictions for column '{col}'")
    print(f"\t{predictions_per_window[col].to_list()}")

Predictions for column 'LUFL'
	[{'predictions': [3.5329999923706055, 3.5329999923706055, 3.5329999923706055]}, {'predictions': [-10.331000328063965, -10.331000328063965, -10.331000328063965]}]
Predictions for column 'LULL'
	[{'predictions': [1.6749999523162842, 1.6749999523162842, 1.6749999523162842]}, {'predictions': [-1.2899999618530273, -1.2899999618530273, -1.2899999618530273]}]
Predictions for column 'OT'
	[{'predictions': [11.043999671936035, 11.043999671936035, 11.043999671936035]}, {'predictions': [48.18349838256836, 48.18349838256836, 48.18349838256836]}]


The rest of the code can stay the same.

In [25]:
task.evaluation_summary([predictions_per_window], model_name="naive")

{'model_name': 'naive',
 'dataset_path': 'autogluon/chronos_datasets_extra',
 'dataset_config': 'ETTh',
 'horizon': 3,
 'num_windows': 1,
 'initial_cutoff': -3,
 'window_step_size': 3,
 'min_context_length': 1,
 'max_context_length': None,
 'seasonality': 1,
 'eval_metric': 'MASE',
 'extra_metrics': [],
 'quantile_levels': None,
 'id_column': 'id',
 'timestamp_column': 'timestamp',
 'target_column': ['LUFL', 'LULL', 'OT'],
 'generate_univariate_targets_from': None,
 'past_dynamic_columns': [],
 'excluded_columns': [],
 'task_name': 'ETTh',
 'test_error': 1.1921320279836811,
 'training_time_s': None,
 'inference_time_s': None,
 'dataset_fingerprint': '1051fcbf7ab489b5',
 'trained_on_this_dataset': False,
 'fev_version': '1.0.0',
 'MASE': 1.1921320279836811}

### Converting multivariate tasks into univariate tasks
Alternatively, we can convert a multivariate task into a univariate one by creating multiple univariate time series from each multivariate time series.

The original `ETTh` dataset contains two multivariate time series with the following ids:

In [26]:
past_data["id"]

array(['ETTh1', 'ETTh2'], dtype='<U5')

If we set `generate_univariate_targets_from=["OT", "LUFL", "LULL"]`, `fev` will create 3 univariate time series from each time series in the original dataset.

In [27]:
task = fev.Task(
    dataset_path="autogluon/chronos_datasets_extra",
    dataset_config="ETTh",
    horizon=3,
    generate_univariate_targets_from=["OT", "LUFL", "LULL"],
)

In [28]:
past_data, future_data = task.get_window(0).get_input_data()
print(past_data)
print(future_data)

Dataset({
    features: ['id', 'timestamp', 'HUFL', 'HULL', 'MUFL', 'MULL', 'target'],
    num_rows: 6
})
Dataset({
    features: ['id', 'timestamp', 'HUFL', 'HULL', 'MUFL', 'MULL'],
    num_rows: 6
})


The new dataset contains 6 items (2 original ids $\times$ 3 target columns).

In [29]:
past_data["id"]

array(['ETTh1_LUFL', 'ETTh1_LULL', 'ETTh1_OT', 'ETTh2_LUFL', 'ETTh2_LULL',
       'ETTh2_OT'], dtype='<U10')

We can confirm that the naive forecast achieves the same MASE score on this equivalent representation of the multivariate task.

In [30]:
def naive_forecast_univariate(window: fev.EvaluationWindow) -> datasets.Dataset:
    """Predicts the last observed value."""
    past_data, future_data = window.get_input_data()
    predictions = []
    for ts in past_data:
        predictions.append({"predictions": [ts[window.target_columns_list[0]][-1] for _ in range(window.horizon)]})
    return datasets.Dataset.from_list(predictions)

In [31]:
predictions_per_window = []
for window in task.iter_windows():
    predictions_per_window.append(naive_forecast_univariate(window))
task.evaluation_summary(predictions_per_window, model_name="naive")

{'model_name': 'naive',
 'dataset_path': 'autogluon/chronos_datasets_extra',
 'dataset_config': 'ETTh',
 'horizon': 3,
 'num_windows': 1,
 'initial_cutoff': -3,
 'window_step_size': 3,
 'min_context_length': 1,
 'max_context_length': None,
 'seasonality': 1,
 'eval_metric': 'MASE',
 'extra_metrics': [],
 'quantile_levels': None,
 'id_column': 'id',
 'timestamp_column': 'timestamp',
 'target_column': 'target',
 'generate_univariate_targets_from': ['OT', 'LUFL', 'LULL'],
 'past_dynamic_columns': [],
 'excluded_columns': [],
 'task_name': 'ETTh',
 'test_error': 1.1921320279836811,
 'training_time_s': None,
 'inference_time_s': None,
 'dataset_fingerprint': '01c8288f51e0dc88',
 'trained_on_this_dataset': False,
 'fev_version': '1.0.0',
 'MASE': 1.1921320279836811}

## Evaluation on a Benchmark consisting of multiple tasks
A `fev.Benchmark` object is essentially a collection of `Task`s.

We can create a benchmark from a list of dictionaries. Each dictionary is interpreted as a `fev.TaskGenerator`.

In [32]:
tasks_configs = [
    {
        "dataset_path": "autogluon/chronos_datasets",
        "dataset_config": "monash_m3_monthly",
        "horizon": 18,
        "seasonality": 12,
        "eval_metric": "MASE",
        "num_windows": 2,
    },
    {
        "dataset_path": "autogluon/chronos_datasets",
        "dataset_config": "monash_electricity_weekly",
        "horizon": 8,
    },
]
benchmark = fev.Benchmark.from_list(tasks_configs)

Or from a YAML file

In [33]:
benchmark_path = Path(fev.__file__).parents[2] / "benchmarks" / "example" / "tasks.yaml"
# Show contents of the benchmark YAML file
!cat {benchmark_path}

tasks:
- dataset_path: autogluon/chronos_datasets
  dataset_config: monash_m3_monthly
  horizon: 18
  seasonality: 12
  num_windows: 2
- dataset_path: autogluon/chronos_datasets
  dataset_config: monash_electricity_weekly
  horizon: 8


In [34]:
benchmark = fev.Benchmark.from_yaml(benchmark_path)

In [35]:
benchmark.tasks

[Task(dataset_path='autogluon/chronos_datasets', dataset_config='monash_m3_monthly', horizon=18, num_windows=2, initial_cutoff=-36, window_step_size=18, min_context_length=1, max_context_length=None, seasonality=12, eval_metric='MASE', extra_metrics=[], quantile_levels=None, id_column='id', timestamp_column='timestamp', target_column='target', generate_univariate_targets_from=None, past_dynamic_columns=[], excluded_columns=[], task_name='monash_m3_monthly'),
 Task(dataset_path='autogluon/chronos_datasets', dataset_config='monash_electricity_weekly', horizon=8, num_windows=1, initial_cutoff=-8, window_step_size=8, min_context_length=1, max_context_length=None, seasonality=1, eval_metric='MASE', extra_metrics=[], quantile_levels=None, id_column='id', timestamp_column='timestamp', target_column='target', generate_univariate_targets_from=None, past_dynamic_columns=[], excluded_columns=[], task_name='monash_electricity_weekly')]

Now let's evaluate some simple forecasting models on this toy benchmark.

In [36]:
!pip install -q statsforecast "numpy<=2.2" "scipy<1.16"

In [37]:
from statsforecast.models import ARIMA, SeasonalNaive, Theta


def predict_with_model(task: fev.Task, model_name: str = "naive") -> list[datasets.Dataset]:
    if model_name == "seasonal_naive":
        model = SeasonalNaive(season_length=task.seasonality)
    elif model_name == "theta":
        model = Theta(season_length=task.seasonality)
    elif model_name == "arima":
        model = ARIMA(season_length=task.seasonality)
    else:
        raise ValueError(f"Unknown model_name: {model_name}")

    predictions_per_window = []
    for window in task.iter_windows():
        past_data, future_data = window.get_input_data()
        predictions = [
            {"predictions": model.forecast(y=ts[task.target_column], h=task.horizon)["mean"]}
            for ts in past_data
        ]
        predictions_per_window.append(datasets.Dataset.from_list(predictions))
    return predictions_per_window

In [38]:
import time

summaries = []
for task in tqdm(benchmark.tasks, desc="Tasks completed"):
    for model_name in ["seasonal_naive", "arima", "theta"]:
        start_time = time.time()
        predictions_per_window = predict_with_model(task, model_name=model_name)
        infer_time_s = time.time() - start_time
        eval_summary = task.evaluation_summary(
            predictions_per_window,
            model_name=model_name,
            inference_time_s=infer_time_s,
            training_time_s=0.0,
        )

        summaries.append(eval_summary)

Tasks completed: 100%|██████████| 2/2 [00:20<00:00, 10.04s/it]


In [39]:
fev.leaderboard(summaries, baseline_model="seasonal_naive")

Unnamed: 0_level_0,gmean_relative_error,avg_rank,avg_inference_time_s,median_inference_time_s,avg_training_time_s,median_training_time_s,training_corpus_overlap,num_failures
model_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
theta,0.87858,2.0,3.181853,3.181853,0.0,0.0,0.0,0
seasonal_naive,1.0,2.0,2.423685,2.423685,0.0,0.0,0.0,0
arima,1.361267,2.0,2.748509,2.748509,0.0,0.0,0.0,0


The `leaderboard` method aggregates the performance into a single number.

We can investigate the performance for individual tasks using the `pivot_table` method

In [40]:
fev.pivot_table(summaries, task_columns=["dataset_config"])

model_name,arima,seasonal_naive,theta
dataset_config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
monash_electricity_weekly,3.025121,3.037166,3.097938
monash_m3_monthly,2.130589,1.145216,0.866654


Both `leaderboard()` and `pivot_table()` methods can handle single or multiple evaluation summaries in different formats:
- `pandas.DataFrame`
- list of dictionaries
- paths to JSONL (orient="record") or CSV files

Here is an example of how we can work with URLs of CSV files:

In [41]:
summaries = [
    "https://raw.githubusercontent.com/autogluon/fev/refs/heads/main/benchmarks/chronos_zeroshot/results/auto_arima.csv",
    "https://raw.githubusercontent.com/autogluon/fev/refs/heads/main/benchmarks/chronos_zeroshot/results/auto_theta.csv",
    "https://raw.githubusercontent.com/autogluon/fev/refs/heads/main/benchmarks/chronos_zeroshot/results/seasonal_naive.csv",
]
fev.leaderboard(summaries, metric_column="MASE")

Unnamed: 0_level_0,gmean_relative_error,avg_rank,avg_inference_time_s,median_inference_time_s,avg_training_time_s,median_training_time_s,training_corpus_overlap,num_failures
model_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
auto_theta,0.858722,1.703704,286.465526,23.892088,,,0.0,0
auto_arima,0.869449,1.703704,1674.733082,75.8837,,,0.0,0
seasonal_naive,1.0,2.592593,2.41595,0.096449,,,0.0,0


In [42]:
fev.pivot_table(summaries, task_columns=["dataset_config"], metric_column="WQL")

model_name,auto_arima,auto_theta,seasonal_naive
dataset_config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ETTh,0.089012,0.132979,0.12209
ETTm,0.10499,0.078587,0.141348
dominick,0.484773,0.485493,0.452916
ercot,0.041214,0.041004,0.036604
exchange_rate,0.010667,0.009714,0.012984
m4_quarterly,0.079384,0.079077,0.118648
m4_yearly,0.125041,0.11464,0.161439
m5,0.61652,0.636228,1.024088
monash_australian_electricity,0.066902,0.054564,0.083695
monash_car_parts,1.333026,1.336601,1.599952
