<a target="_blank" href="https://colab.research.google.com/github/bettercodepaul/nixtla_intro_workshop/blob/main/Introduction_to_Nixtlaverse.ipynb">
    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Introduction to Forecasting with Nixtla's Nixtlaverse

This notebook walks you through the very basics of forecasting time series with Nixtla's Nixtlaverse.

## Install and import necessary libraries

We use [Polars](https://docs.pola.rs/) for data wrangling, [Plotly](https://plotly.com/python/plotly-express/) for visualizations and Nixtla's [StatsForecast](https://nixtlaverse.nixtla.io/statsforecast/index.html) for basic time series forecasting.

In [None]:
pip -q install statsforecast polars plotly

In [None]:
import polars as pl
import plotly.express as px
from statsforecast import StatsForecast
from datetime import date

## Initial Exploration of the data

The data for this walk through is simple monthly sales data from various countries.

In [None]:
df = pl.read_parquet("https://github.com/bettercodepaul/nixtla_intro_workshop/raw/refs/heads/main/retail_sales.parquet")
df.sample(5)

We can visualize the time series with Plotly.

In [None]:
# Plotly comes with useful interactive features: zoom, hover and trace isolation in the legend via click/double click
px.line(df, x="date", y="sales", color="country")

In [None]:
# too many overlapping lines? We can use a facet plot
fig = px.line(df, x="date", y="sales", facet_col="country", facet_col_wrap=2)
fig.update_yaxes(matches=None)
fig.update_layout(height=800)
fig.show()

# Hands-on

Explore the time series! What do you find interesting? Are there any obvious patterns? Are there any outliers? What could be the reasons?

In [None]:
# you can either code here your own explorations or just use the interactive diagrams above

## Forecasting using Nixtla's StatsForecast

Nixtla's StatsForecast package comes with a lot of classic forecasting algorithms. If you would like to know more details we highly recommend the free ["Forecasting: Principles and Practice, the Pythonic Way"](https://otexts.com/fpppy/) by Rob Hyndman.

Nixtla expects the data to be named in a certain way, so we start with that renaming.

In [None]:
Y_df = df.rename(
    {
        "date": "ds",
        "sales": "y",
        "country": "unique_id",
    }
)

In [None]:
# Nixtla supports plotting time series, but not as nice as with Plotly
StatsForecast.plot(Y_df)

In [None]:
from statsforecast.models import (
    HistoricAverage,
    SeasonalNaive,
    HoltWinters,
    AutoARIMA,
)
models = [
    HistoricAverage(),
    SeasonalNaive(season_length=12),
    HoltWinters(season_length=12),
    AutoARIMA(season_length=12),
]

In [None]:
sf = StatsForecast(
    models=models,
    freq="1mo",
)

In [None]:
forecasts_df = sf.forecast(df=Y_df, h=48)
forecasts_df.head()

In [None]:
# this is the Plotly equivalent to sf.plot(Y_df, forecasts_df)
plot_df = pl.concat([Y_df, forecasts_df], how="diagonal")
#plot_df = plot_df.filter(pl.col("unique_id").is_in(["Italien", "Japan"]))
#plot_df = plot_df.select("unique_id", "ds", "y", "SeasonalNaive", "AutoARIMA")
y_columns = [c for c in plot_df.columns if c not in ["unique_id", "ds"]]
fig = px.line(plot_df, x="ds", y=y_columns, facet_col="unique_id", facet_col_wrap=2)
fig.update_yaxes(matches=None)
fig.update_layout(height=800)
fig.show()

## Hands-on

Explore the forecasts! What do you find interesting? Are the models capable of reproducing the seasonal patterns? Is there a model you would prefer over the others?

In [None]:
# you can either code here your own explorations or just use the interactive diagrams above

## Forecast Validation

With time series data, Cross Validation is done by defining a sliding window across the historical data and predicting the period following it. This form of cross-validation allows us to arrive at a better estimation of our model’s predictive abilities across a wider range of temporal instances while also keeping the data in the training set contiguous as is required by our models.

In [None]:
from IPython.display import Image
Image(url="https://raw.githubusercontent.com/Nixtla/statsforecast/main/nbs/imgs/ChainedWindows.gif")

The cross_validation method from the StatsForecast class takes the following arguments.

- `df`: training data frame

- `h (int)`: represents h steps into the future that are being forecasted. In our case 24 months

- `step_size (int)`: step size between each window. In other words: how often do you want to run the forecasting processes.

- `n_windows(int)`: number of windows used for cross validation. In other words: what number of forecasting processes in the past do you want to evaluate.

In [None]:
# this takes some time...
cv_df = sf.cross_validation(
    df=Y_df,
    h=24,
    step_size=24, # try step_size 12 as well -> overlapping windows
    n_windows=4
)

In [None]:
# you can check the resulting windows of the sliding window validation
windows = cv_df.group_by("cutoff").agg(pl.col("ds").max()).sort("cutoff")
windows

In [None]:
# visualize the windows
colors = ["blue", "orange", "springgreen", "violet"]
fig = px.line(Y_df.group_by("ds").agg(pl.col("y").sum()).sort("ds"), x="ds", y="y")
for idx, window in enumerate(windows.rows()):
    fig.add_vrect(x0=window[0], x1=window[1], fillcolor=colors[idx%len(colors)], opacity=0.2)
fig

Look at the data and the windows. Will this validation be representative?

There are several error metrics we can use to evaluate the models based on the forecast from the sliding window validation.

- **Mean Absolute Error**: [`mae`](https://nixtlaverse.nixtla.io/utilsforecast/losses.html#mean-absolute-error-mae)
- **Root Mean Squared Error**: [`rmse`](https://nixtlaverse.nixtla.io/utilsforecast/losses.html#root-mean-squared-error)
- **Bias**: [`bias`](https://nixtlaverse.nixtla.io/utilsforecast/losses.html#bias), use this to check if the model's forecasts are biased
- **Mean Average Percentage Error**: [`mape`](https://nixtlaverse.nixtla.io/utilsforecast/losses.html#mean-absolute-percentage-error), use this only for communication with stakeholders, not for choosing or comparing models

In [None]:
from utilsforecast.losses import rmse, mae, mape, bias

In [None]:
def evaluate_cv(df, metric):
    models = [c for c in df.columns if c not in ('unique_id', 'ds', 'cutoff', 'y')]
    evals = metric(cv_df, models=models)
    pos2model = dict(enumerate(models))
    return evals.with_columns(
        best_model=pl.concat_list(models).list.arg_min().replace_strict(pos2model)
    ).with_columns(pl.selectors.float().round(2))

In [None]:
evaluation_df = evaluate_cv(cv_df, rmse) # you can try different metrics here
evaluation_df.head()