# Many Models Forecasting Demo

This notebook demonstrates how to conduct fine-grained model selection after running the `mmf.run_forecast` function. Before proceeding, ensure you have run the notebooks in [`/examples/monthly`](https://github.com/databricks-industry-solutions/many-model-forecasting/tree/main/examples/monthly). You can run this notebook on a serverless compute.

In [0]:
catalog = "mmf"  # Name of the catalog we use to manage our assets
db = "m4"             # Name of the schema we use to manage our assets (e.g. datasets)

In the `scoring_output` table, forecasts for each time series from every model are stored. Let's filter by a specific time series (e.g., `M1`) and examine the forecasts from all models.

In [0]:
scoring_output =  spark.sql(f"""
    SELECT model, unique_id, date, y FROM {catalog}.{db}.monthly_scoring_output 
    WHERE unique_id='M1' ORDER BY model
    """)

display(scoring_output)

This table contains forecasts from 44 different models, but we need to determine which one is best for making business decisions. This is where the `evaluation_output` table becomes useful. Let's filter by a specific time series (e.g., `M1`) and review the evaluation results (i.e., backtesting trials) from all models.

In [0]:
evaluation_output =  spark.sql(f"""
    SELECT model, unique_id, backtest_window_start_date, metric_name, metric_value, forecast, actual 
    FROM {catalog}.{db}.monthly_evaluation_output where unique_id='M1'
    order by model, backtest_window_start_date
    """)

display(evaluation_output)

Based on the backtesting configuration, defined by the parameters `backtest_length`, `prediction_length`, and `stride` in the `mmf_sa.run_forecast` function, we obtain results from 10 backtesting trials for each model. For each trial, both forecasts and actual values are stored, enabling you to compute evaluation metrics based on residuals. Additionally, this table includes a built-in metric for quick assessment, which can be specified using the `metric` parameter. In this case, the metric is `smape`, and currently, `mae`, `mse`, `rmse`, `mape`, and `smape` are supported.

We compute the mean `smape` across 10 backtesting trials for each model and each time series. The model with the lowest mean `smape` is then selected for each time series, and its forecast is retrieved from the `forecast_output` table. Below is a SQL query that performs this selection.

In [0]:
forecast_best_model = spark.sql(f"""
    SELECT eval.unique_id, eval.model, eval.average_smape, score.date, score.y
    FROM 
    (
      SELECT unique_id, model, average_smape,
      RANK() OVER (PARTITION BY unique_id ORDER BY average_smape ASC) AS rank
      FROM (
        SELECT unique_id, model, AVG(metric_value) AS average_smape
        FROM {catalog}.{db}.monthly_evaluation_output
        GROUP BY unique_id, model) 
        ORDER BY unique_id, rank
    ) AS eval
    INNER JOIN {catalog}.{db}.monthly_scoring_output AS score 
      ON eval.unique_id=score.unique_id AND eval.model=score.model
    WHERE eval.rank=1
    ORDER BY eval.unique_id
    """)

display(forecast_best_model)

These forecasts will be used to guide our business decisions. Let's count how many times each model was the best-performing one.

In [0]:
model_ranking = spark.sql(f"""
    SELECT model, count(*) as count
    FROM (
      SELECT unique_id, model, average_smape,
      RANK() OVER (PARTITION BY unique_id ORDER BY average_smape ASC) AS rank
      FROM (
        SELECT unique_id, model, AVG(metric_value) AS average_smape
        FROM {catalog}.{db}.monthly_evaluation_output
        GROUP BY unique_id, model) 
        ORDER BY unique_id, rank
    ) WHERE rank=1 GROUP BY model 
    ORDER BY count DESC
    """)

display(model_ranking)


On this dataset (M4 monthly), it appears that TimesFM models were the best performing models based on the number of time series they excelled in.

Exposing the `evaluation_output` and `forecast_output` tables in these formats provides great flexibility in model selection. For example, you can define your own evaluation metric to compare forecasting accuracy. You can aggregate metrics using a weighted average or even the median across backtesting trials. Additionally, you can retrieve forecasts from multiple models for each time series and ensemble them. All of these options simply require writing queries against these tables.