# Kats 206 ML_AR Traditional Machine Learning (LightGBM) forecasting

This tutorial will introduce how to use the autoregressive machine learning model (MLAR) in Kats. It implements a wrapper for LightGBM, performing feature engineering, global modelling, dierct modelling for many horizons and variables at once, and other things. The table of contents for Kats 206 is as follows:

TODO: Need to change this
1. Overview of Global Model for Forecasting  
2. Build Your Own Global Model or Global Ensemble From Scratch  
    2.1 Introduction to `GMParam`  
    2.2 Forecasting using a single global model with `GMModel`  
    2.3 Forecasting using a global model ensemble with `GMEnsemble`  
    2.4 Backtesting with `GMBacktester` 
3. Using Pretrained Global Model or Global Ensemble

**Note:** We provide two types of tutorial notebooks
- **Kats 101**, basic data structure and functionalities in Kats 
- **Kats 20x**, advanced topics, including advanced forecasting techniques, advanced detection algorithms, `TsFeatures`, meta-learning, global model etc. 

## 1. Overview of the model

The model uses LightGBM, a Gradient-Boosted Tree that got highly popular in forecasting after being used by most of the top entries into the M5 competition. Our wrapper has the following characteristics:

### Features:

- Lags (input window)
- Calendar features and/or Fourier terms
- Past covariates and lags of those covariates
- Future covariates and lags and future values of those
- Summary statistics over the input window: min, max, mean, median, sd 

### Normalization:

It does per-window normalization, to allow for trend modeling. The normalizer can be computed over the full input window or a smaller portion, e.g., only the last 2-3 observations. It has the following options implemented:
- Mean, Median, Max, Std over the window
- Either divide by those or subtract
- Do Z-score per-window normalization, i.e., subtract mean and divide by std.

### Direct forecasting and modeling of more than one target:

You can specify directly the horizons you want to forecast for. Those are modeled internally as a feature, i.e., you have an input feature that tells you which horizon you want. That way, you can train the model for different horizons at the same time. In a similar way, we even allow to train for different targets at the same time, which can be used, e.g., to have one model that predicts different quantiles. Here, the past covariates and features stay the same across the horizons, so are effectively copied for each horizon, and the future covariates and features (calendar features) are different for each target. As the input matrix is effectively copied for each horizon, having many direct horizons can be memory intensive.

### Iterative forecasting:

The model also allows for iterating out the forecast. Thus, you can, for example, train a model for direct forecasting of a horizon of 30, and then iterate out to a horizon of 90. This will be done in chunks of 30. Note that for iterative forecasting to work, future values of both past and future covariates are needed, so if your model has past covariates, you need to forecast those for the iterative forecasting.



## 2. Introduction to `MLARParams`

TODO: Need to change this

All parameters for a GM or a GME are specified using the `GMParam` class.  The `GMParam` class does basic parameter checking when initialized to ensure that the parameters are correctly specified.  Here are some of its key arguments:

* **freq**: `str` or `pd.Timedelta`, The time granularity of the model (and the input time series.) For example, `freq='D'` indicates a daily model;
* **model_type**: `str`, The name of neural network type - either 'rnn' (recurrent neural network) or 's2s' (sequence to sequence). Default is 'rnn';
* **seasonality**: `int`, The integer length of the seasonality period. The default value is 1, indicating a non-seasonal model;
* **input_window**: `int`, This parameter specifies the size into which we segment our input time series to feed them into the neural network.  It should be greater than the `seasonality` argument;
* **fcst_window**: `int`,  The number of data points forecast in a single forecast step;
* **quantile**: `list[float]`, The float values of the quantiles to forecast.  The first value of this list should always be 0.5, representing the median.  The default value is `[0.5,0.05,0.95,0.99]`;
* **nn_structure**: `list[list[int]]`, The structure of the neural network. If not specified, the default value is `[[1,3]]`;
* **loss_function**: `list[str]`, The name of loss function - either 'pinball' or 'adjustedpinball';
* **gmfeature**: `list[str]` or `str`; A single or a list of feature names.

For the definition of other parameters, please see our documentation.


## 3. A simple example

TODO: Need to change this

The `GMModel` is our basic class to build a single GM.  Kats also supports Global Model Ensembles (GMEs), which are ensembles of independent GMs, with the `GMEnsemble` class.  The `GMParam` is the parameter class for both `GMModel` and `GMEnsemble`.  We also provide the `GMBacktester` class for parameter tunning and backtesting. 

The examples in this section are designed to display the basic functionality of each of the aforementioned classes.  They are of limited scale and not expected to provide good performance.



In [None]:
%%capture
# For Google Colab:
!pip install kats

In [8]:
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
import matplotlib.pyplot as plt

from kats.models.ml_ar import MLARParams, MLARModel
from kats.consts import TimeSeriesData
from kats.utils.simulator import Simulator

import time

ModuleNotFoundError: No module named 'kats'

Simulate data. 

In [None]:
import random

random.seed(358749)
all_series = {}

for i in range(20):
    sim = Simulator(n=1000, freq="1D", start = pd.to_datetime("2017-01-01"))
    sim.add_trend(magnitude=50)
    sim.add_seasonality(5, period=timedelta(days=7))
    sim.add_seasonality(10, period=timedelta(days=365))
    sim.add_noise(magnitude=2)
    sim_ts = sim.stl_sim().to_dataframe()
    sim_ts.set_index("time", drop=True, inplace=True)
    all_series[f"country_{i}"] = sim_ts

in_data = pd.concat(all_series, axis=1)

in_data.columns = all_series.keys()
countries = in_data.columns

in_data = in_data.reset_index()
in_data.set_index("time", inplace=True, drop=False)


Set LGBM parameters.

In [None]:
target_var = "y"

horizon = 60

#horizon = [1,2,3,4,5, 30, 60, 90]

input_window = 90

lightgbm_params = {}

# we use Fourier terms as calendar features and subtract the window mean for normalization.
lightgbm_params["orig"] = MLARParams(
    n_jobs = 20,
    n_estimators = 400,
    objective="mae",

    target_variable=[target_var],
    horizon=horizon,
    input_window=input_window,
    #freq="D",

    num_leaves = 128,
    verbose=3,
    norm_sum_stat = "mean",
    sub_div = "sub",
    calendar_features = []
    #fourier_features_order = [],
    #fourier_features_period = []
)


Partition into training and test dataset

In [None]:
start_date = "2017-01-01"
start_date = pd.to_datetime(start_date)


fc_origin = "2019-09-30"
fc_origin = pd.to_datetime(fc_origin)

fc_hor1 = fc_origin + pd.DateOffset(days=1)
fc_hor365 = fc_origin + pd.DateOffset(days=365)

train_data = in_data[(in_data["time"] >= start_date) & (in_data["time"] <= fc_origin)]
test_data = in_data[(in_data["time"] >= fc_hor1) & (in_data["time"] <= fc_hor365)]

train_data_dict = {}

for curr_country in countries:
    curr_data = train_data[["time", curr_country]].rename(columns={curr_country : "y"})
    train_data_dict[curr_country] = TimeSeriesData(curr_data, time_col_name="time")


Train the models

In [None]:
# create models from parameter configs
models = dict(map(lambda x: (x[0], MLARModel(x[1])), lightgbm_params.items()))


all_run_times = {}
all_forecasts = {}

all_run_times[fc_origin] = {}
all_forecasts[fc_origin] = {}

for mod_name, model in models.items():
    start_time=time.time()
    model.train(train_data_dict)
    end_time=time.time()
    all_run_times[fc_origin][mod_name] = end_time-start_time
    print('Model: {} run time: {} seconds'.format(mod_name, all_run_times[fc_origin][mod_name]))

    # get forecasts
    all_forecasts[fc_origin][mod_name] = model.predict(steps=300)


all_run_times

Plot the forecasts from the model

In [None]:
forecasts = all_forecasts[fc_origin]["orig"]

for country in forecasts.keys():

    #fc = forecasts[country]
    fc = forecasts[country][["time", "forecast"]].set_index("time")

    plot_df = fc.join(train_data[country], how="outer").join(test_data[country], how="outer", lsuffix="_train")

    plt.figure()
    plot_df.plot(figsize=(50,20), title=f"Country: {country}")
    plt.show()