# Many Models Forecasting Demo

This notebook showcases how to run MMF on a serverless compute using foundation models. We will use [M4 competition](https://www.sciencedirect.com/science/article/pii/S0169207019301128#sec5) data. The descriptions here are mostly the same as the case with the [daily resolution](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/daily/global_daily.ipynb), so we will skip the redundant parts and focus only on the essentials.

### Serverless Compute setup

Attach this notebook to a [serverless compute](https://docs.databricks.com/aws/en/compute/serverless/notebooks). Then go to the [Configuration](https://docs.databricks.com/aws/en/compute/serverless/dependencies) tab. Choose `A10` (GPU) in [Accelerator](https://docs.databricks.com/aws/en/compute/serverless/gpu) section and set the [Environment version](https://docs.databricks.com/aws/en/compute/serverless/dependencies#-select-an-environment-version) to 3. In the Dependencies section, [add the path](https://docs.databricks.com/aws/en/compute/serverless/dependencies#create-common-utilities-to-share-across-your-workspace) to your `many-model-forecasting` directory: e.g., `/Workspace/Users/ryuta.yoshimatsu@databricks.com/many-model-forecasting`. This is required to use the MMF functions within the notebooks.

### Install and import packages

In [0]:
%pip install -r ../../requirements.txt --quiet
%pip install --force-reinstall --no-cache-dir --index-url https://download.pytorch.org/whl/cu121 torch==2.3.1+cu121 torchvision==0.18.1+cu121 torchaudio==2.3.1+cu121 --quiet
%pip install chronos-forecasting==1.4.1 --quiet
%pip install uni2ts==1.2.0 --quiet
dbutils.library.restartPython()

In [0]:
import logging
logging.getLogger("py4j.java_gateway").setLevel(logging.ERROR)
logging.getLogger("py4j.clientserver").setLevel(logging.ERROR)

In [0]:
import pathlib
import pandas as pd
from datasetsforecast.m4 import M4

### Prepare data 
We are using [`datasetsforecast`](https://github.com/Nixtla/datasetsforecast/tree/main/) package to download M4 data.

In [0]:
# Number of time series
n = 100


def create_m4_monthly():
    y_df, _, _ = M4.load(directory=str(pathlib.Path.home()), group="Monthly")
    _ids = [f"M{i}" for i in range(1, n + 1)]
    y_df = (
        y_df.groupby("unique_id")
        .filter(lambda x: x.unique_id.iloc[0] in _ids)
        .groupby("unique_id")
        .apply(transform_group)
        .reset_index(drop=True)
    )
    return y_df


def transform_group(df):
    unique_id = df.unique_id.iloc[0]
    _cnt = 60  # df.count()[0]
    _start = pd.Timestamp("2018-01-01")
    _end = _start + pd.DateOffset(months=_cnt)
    date_idx = pd.date_range(start=_start, end=_end, freq="M", name="date")
    _df = (
        pd.DataFrame(data=[], index=date_idx)
        .reset_index()
        .rename(columns={"index": "date"})
    )
    _df["unique_id"] = unique_id
    _df["y"] = df[:60].y.values
    return _df


We are going to save this data in a delta lake table. Provide catalog and database names where you want to store the data.

In [0]:
catalog = "mmf" # Name of the catalog we use to manage our assets
db = "m4" # Name of the schema we use to manage our assets (e.g. datasets)
user = spark.sql('select current_user() as user').collect()[0]['user'] # User email address

In [0]:
# Making sure that the catalog and the schema exist
#_ = spark.sql(f"CREATE CATALOG IF NOT EXISTS {catalog}")
#_ = spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog}.{db}")

#(
#    spark.createDataFrame(create_m4_monthly())
#    .write.format("delta").mode("overwrite")
#    .saveAsTable(f"{catalog}.{db}.serverless_train")
#)

Let's take a peak at the dataset:

In [0]:
display(spark.sql(f"select unique_id, count(date) as count from {catalog}.{db}.serverless_train group by unique_id order by unique_id"))

In [0]:
display(
  spark.sql(f"select * from {catalog}.{db}.serverless_train where unique_id in ('M1', 'M2', 'M3', 'M4', 'M5') order by unique_id, date")
  )

Note that monthly forecasting requires the timestamp column to represent the last day of each month.

### Models
Let's configure a list of models we are going to apply to our time series for evaluation and forecasting. We install from [chronos](https://pypi.org/project/chronos-forecasting/) and [uni2ts](https://pypi.org/project/uni2ts/). Check their documentation for the detailed description of each model.

**TimesFM is currently not supported for serverless compute in MMF.**

In [0]:
active_models = [
    #"ChronosT5Tiny",
    #"ChronosT5Mini",
    #"ChronosT5Small",
    #"ChronosT5Base",
    #"ChronosT5Large",
    #"ChronosBoltTiny",
    #"ChronosBoltMini",
    #"ChronosBoltSmall",
    #"ChronosBoltBase",
    "MoiraiSmall",
    #"MoiraiBase",
    #"MoiraiLarge",
    #"MoiraiMoESmall",
    #"MoiraiMoEBase",
]

### Run MMF

Now, we can run the evaluation and forecasting using `run_forecast` function defined in [mmf_sa/models/__init__.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/mmf_sa/models/__init__.py). Refer to [README.md](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/README.md#parameters-description) for a comprehensive description of each parameter.

In [0]:
import os

cache_dir = f"/Volumes/{catalog}/{db}/hf_cache" 
os.makedirs(cache_dir, exist_ok=True) 
os.environ["HF_HOME"] = cache_dir # low-level hub 
os.environ["HUGGINGFACE_HUB_CACHE"] = cache_dir # huggingface_hub >=0.20 
os.environ["TRANSFORMERS_CACHE"] = cache_dir # transformers fallback 

In [0]:
from mmf_sa import run_forecast

run_forecast(
    spark=spark,
    train_data=f"{catalog}.{db}.serverless_train",
    scoring_data=f"{catalog}.{db}.serverless_train",
    scoring_output=f"{catalog}.{db}.serverless_scoring_output",
    evaluation_output=f"{catalog}.{db}.serverless_evaluation_output",
    model_output=f"{catalog}.{db}",
    group_id="unique_id",
    date_col="date",
    target="y",
    freq="M",
    prediction_length=3,
    backtest_length=12,
    stride=1,
    metric="smape",
    train_predict_ratio=1,
    data_quality_check=True,
    resample=False,
    active_models=active_models,
    experiment_path=f"/Users/{user}/mmf/serverless",
    use_case_name="serverless",
    accelerator="gpu",
)

### Evaluate
In `evaluation_output` table, the we store all evaluation results for all backtesting trials from all models. This information can be used to understand which models performed well on which time series on which periods of backtesting. This is very important for selecting the final model for forecasting or models for ensembling. Maybe, it's faster to take a look at the table:

In [0]:
display(spark.sql(f"""
    select * from {catalog}.{db}.serverless_evaluation_output 
    where unique_id = 'M1'
    order by unique_id, model, backtest_window_start_date
    """))

### Forecast
In `scoring_output` table, forecasts for each time series from each model are stored.

In [0]:
display(spark.sql(f"""
    select * from {catalog}.{db}.serverless_scoring_output 
    where unique_id = 'M1'
    order by unique_id, model, date
    """))

Refer to the [notebook](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/post-evaluation-analysis.ipynb) for guidance on performing fine-grained model selection after running `run_forecast`.

### Delete Tables
Let's clean up the tables.

In [0]:
#display(spark.sql(f"delete from {catalog}.{db}.serverless_evaluation_output"))

In [0]:
#display(spark.sql(f"delete from {catalog}.{db}.serverless_scoring_output"))