# Kats 205 Forecasting with Global Model

This tutorial will introduce how to use the global model in Kats.  The global model is a new and powerful forecasting method that combines exponential smoothing models with recurrent neural networks, resulting in higher accuracy than other approaches. The table of contents for Kats 205 is as follows:

1. Overview of global model for forecasting  
2. Build Your Own Global Model or Global Ensemble From Scratch  
    2.1 Introduction to `GMParam`  
    2.2 Forecasting using a single global model with `GMModel`  
    2.3 Forecasting using a global model ensemble with `GMEnsemble`  
    2.4 Backtesting with `GMBacktester`  
3. Using pretrained global model/global ensemble  

**Note:** We provide two types of tutorial notebooks
- **Kats 101**, basic data structure and functionalities in Kats 
- **Kats 20x**, advanced topics, including advanced forecasting techniques, advanced detection algorithms, `TsFeatures`, meta-learning, global model etc. 

## 1. Overview of global model for forecasting

The Global Model, henceforth abbreviated as GM, is a powerful forecasting model that was originally proposed by Slawek Smyl and won the Computational Intelligence in Forecasting International Time Series Competition (2016) and the M4 Forecasting competition (2018).  The GM effectively combines exponential smoothing models with LSTM neural networks in a way that results in higher accuracy than a method that only uses pure statistics or machine learning.  

Kats has many forecasting models that only use pure statistics, and we discussed many of these approaches in [Kats 201](kats_201_forecasting.ipynb).  We also provide an ML approach to forecasting in our metalearning framework for forecasting, which is covered in [Kats 204](kats_204_metalearning.ipynb).  The GM approach generates more accurate forecasts by combining the advantages of pure statistical and ML approaches.  For more details on how this works, please see the [original paper for the GM](https://www.sciencedirect.com/science/article/pii/S0169207019301153).

GM is trained with a large number of time series of the same time granularity.  It generally supports batch processing, meaning it can efficiently generate forecasts for several time series at the same time.  


In Kats, we build upon the original model and allow support for two different neural network types when building a GM:
1. **Recurrent neural network (RNN)**: This is the default and it's best for short-term forecasting.  
2. **Sequence to Sequence (S2S)**: More optimal for medium to long term forecasting. 



## 2. Build Your Own Global Model or Global Ensemble From Scratch

The `GMModel` is our basic class to build a single GM.  Kats also supports Global Model Ensembles (GMEs), which are ensembles of independent GMs, with the `GMEnsemble` class.  The `GMParam` is the parameter class for both `GMModel` and `GMEnsemble`.  We also provide the `GMBacktester` class for parameter tunning and backtesting. 

The examples in this section are designed to display the basic functionality of each of the aforementioned classes.  They are of limited scale and not expected to provide good performance.

### 2.1 Introduction to `GMParam`

All parameters for a GM or a GME are specified using the `GMParam` class.  The `GMParam` class does basic parameter checking when initialized to ensure that the parameters are correctly specified.  Here are some of its key arguments:

* **freq**: `str` or `pd.Timedelta`, The time granularity of the model (and the input time series.) For example, `freq='D'` indicates a daily model;
* **model_type**: `str`, The name of neural network type - either 'rnn' (recurrent neural network) or 's2s' (sequence to sequence). Default is 'rnn';
* **seasonality**: `int`, The integer length of the seasonality period. The default value is 1, indicating a non-seasonal model;
* **input_window**: `int`, The length of each input time series.  This should be greater than the `seasonality` argument;
* **fcst_window**: `int`,  The number of data points forecast in a single forecast step;
* **quantile**: `list[float]`, The float values of the quantiles to forecast.  The first value of this list should always be 0.5, representing the median.  The default value is `[0.5,0.05,0.95,0.99]`;
* **nn_structure**: `list[list[int]]`, The structure of the neural network. If not specified, the default value is `[[1,3]]`;
* **loss_function**: `list[str]`, The name of loss function - either 'pinball' or 'adjustedpinball';
* **gmfeature**: `list[str]` or `str`; A single or a list of feature names.

For the definition of other parameters, please see our documentation.

In [1]:
import numpy as np
import pandas as pd
import sys
import warnings
import os
import pprint

warnings.simplefilter(action='ignore')
sys.path.append("../")

from kats.models.globalmodel.utils import GMParam

Below we initialize a `GMParam` instance that we will use to train a daily model with weekly seasonality. 

In [2]:
gmparam = GMParam(
    input_window = 35, 
    fcst_window = 31,
    seasonality = 7,
    freq = 'D',
    loss_function = 'adjustedpinball',
    nn_structure = [[1,3]],
    gmfeature = ['last_date'],
    epoch_num = 1, 
    epoch_size = 2, # use a small num just for demonstration
    gmname = "daily_default",
)

### 2.2 Forecasting using a single global model with GMModel


In [3]:
from kats.models.globalmodel.model import GMModel, load_gmmodel_from_file
from kats.models.globalmodel.serialize import global_model_to_json, load_global_model_from_json
from kats.consts import TimeSeriesData

Now we are ready to train a global model.  A `GMModel` object can be initialized with just one parameter - an instance of the `GMParam` object.

In [4]:
gm = GMModel(gmparam)

To train a `GMModel` object, we need a list or a dictionary of `TimeSeriesData` objects. We will simulate two dictionaries, one for training and one for testing, using the `get_ts` method from our test functions.

In [5]:
from kats.tests.test_globalmodel import get_ts

train_TSs = [get_ts(n*5, '2020-05-06') for n in range(20, 40)]
test_TSs = [get_ts(n*2, '2020-05-06') for n in range(40, 45)]

It is straightforward to train the GM using the `train` function.  This function also saves basic information about the training process, which we look at below.

In [6]:
# train the model
training_info = gm.train(train_TSs)

#training_info saves the information of training process
pprint.pprint(training_info)

{'train_loss_monitor': [0.112794265],
 'train_loss_val': [0.40887923538684845],
 'valid_fcst_monitor': [],
 'valid_loss_monitor': [{'epoch': 0}]}


Now we can use the trained model to generate forecasts using the `predict` function.  The input of the `predict` function can either be a single `TimeSeriesData` or a list/dictionary of them.  The `predict` function also requires you to specify the number of steps you wish to forecast.  Here, we demonstrate doing batch forecasting on the 5 `TimeSeriesData` objects in `test_TSs` for 3 steps.

The input can be a `TimeSeriesData` object or a list/dictionary of TimeSeriesData objects. The returned value is a dictionary of `pd.DataFrame` objects.

In [7]:
fcsts = gm.predict(test_TSs, steps = 3)

This generates a dictionary with 5 keys, one for each time series in `test_TSs`.

In [8]:
fcsts.keys()

dict_keys([0, 1, 2, 3, 4])

The values in the dictionary are the 3-step forecasts for each time series in `test_TSs`.  Let's look at the forecast for the 3rd time series in the list.  We see a `pd.DataFrame` that gives forecasts for each percentiles 5, 50, 95, and 99 in the 3 days after the time series ends.

In [9]:
fcsts[3]

Unnamed: 0,fcst_quantile_0.5,fcst_quantile_0.05,fcst_quantile_0.95,fcst_quantile_0.99,time
0,-0.7346,-0.42873,-0.286135,-1.340883,2020-07-31
1,1.092031,1.828891,0.928108,0.769238,2020-08-01
2,0.902502,0.978088,-0.026781,0.778905,2020-08-02


We can save the model using the `save_model` function.

In [10]:
# save model
gm.save_model("gm_example_1.p")

Let's load the saved model and again use it to repeat the forecast we did above.

In [11]:
# load model
gm2 = load_gmmodel_from_file("gm_example_1.p")

# make prediction 
fcsts2 = gm2.predict(test_TSs, steps = 3)
fcsts2[3]

Unnamed: 0,fcst_quantile_0.5,fcst_quantile_0.05,fcst_quantile_0.95,fcst_quantile_0.99,time
0,-0.7346,-0.42873,-0.286135,-1.340883,2020-07-31
1,1.092031,1.828891,0.928108,0.769238,2020-08-01
2,0.902502,0.978088,-0.026781,0.778905,2020-08-02


Now let's remove the saved model.

In [12]:
os.remove("gm_example_1.p")

We can also encode GM into a json string using the  `global_model_to_json` function.

In [13]:
gm_str = global_model_to_json(gm)

Let's repeat the same prediction one more time.

In [14]:
# load model from json string
gm3 = load_global_model_from_json(gm_str)

# make prediction 
fcsts3 = gm3.predict(test_TSs, steps = 3)
fcsts3[3]

Unnamed: 0,fcst_quantile_0.5,fcst_quantile_0.05,fcst_quantile_0.95,fcst_quantile_0.99,time
0,-0.7346,-0.42873,-0.286135,-1.340883,2020-07-31
1,1.092031,1.828891,0.928108,0.769238,2020-08-01
2,0.902502,0.978088,-0.026781,0.778905,2020-08-02


### 2.3 Forecasting using a single global model with GMEnsemble

You can also easily build one ensemble of several individual GMs with `GMEnsemble` class. In addition to a `GMParam` object, one also needs to specify how training data set should be splitted and how many independent `GMModel` objects. 

Here is the list of attributs:
* **gmparam**: A GMParam object; This is used for initiating each global model.
* **ensemble_type**: String; A string representing how forecasts are combined. Can be 'median' or 'mean'. Default is 'median'.
* **splits**: Integer; An positive integer representing the number of sub-datasets to be built. Default is 3.
* **overlap**: Boolean; A boolean representing whether or not sub-datasets overlap with each other or not. Default is True. For example, when `splits=3` and `overlap=True`, then each sub-dataset contains 2/3 of training data.
* **replicate**: Integer; A positive integer representing the number of global models to be trained on each sub-datasets. Default is 1.
* **multi**: Boolean; A boolean representing whether or not to use multi-processing for training and prediction. Default is False.

Note that a GMEnsemble object will build `splits*replicate` independent `GMModel` objects, and the final forecasts are aggregated from the forecasts generated from each trained `GMModel` object.

In [15]:
from kats.models.globalmodel.ensemble import GMEnsemble, load_gmensemble_from_file

# Initiate 
gme = GMEnsemble(gmparam, splits=3, overlap=True, replicate=1, multi=True)


Now we can train the `GMEnsemble` object. Note that one has the choice of setting aside a test set from the training data to measure the performance of each `GMModel` object throughout the training process.

In [16]:
gme.train(train_TSs, test_size = 0.1)

In [17]:
# the information of training process and the evaluation results on the set-aside test set are saved in attribute gm_info.
gme.gm_info

[{'train_loss_monitor': [0.11851728],
  'valid_loss_monitor': [{'epoch': 0}],
  'valid_fcst_monitor': [],
  'train_loss_val': [0.26666390150785446],
  'test_info': [      smape     sbias  exceed_0.05  exceed_0.95  exceed_0.99  step  idx  epoch
   0  1.341331  0.040348     0.419355     0.387097     0.451613     0   14      0
   1  1.462254  0.196207     0.419355     0.451613     0.354839     0    5      0]},
 {'train_loss_monitor': [0.1094319],
  'valid_loss_monitor': [{'epoch': 0}],
  'valid_fcst_monitor': [],
  'train_loss_val': [0.1915058046579361],
  'test_info': [      smape     sbias  exceed_0.05  exceed_0.95  exceed_0.99  step  idx  epoch
   0  1.400731  0.037765     0.354839     0.225806     0.290323     0   14      0
   1  1.429105  0.290570     0.483871     0.322581     0.290323     0    5      0]},
 {'train_loss_monitor': [0.10012598],
  'valid_loss_monitor': [{'epoch': 0}],
  'valid_fcst_monitor': [],
  'train_loss_val': [0.3629566431045532],
  'test_info': [      smape     

After training the `GMEnsemble` object, you now can use it to generate forecasts. Similar to the `GMModel` object, the input can be a `TimeSeriesData` object or a list/dictionary of TimeSeriesData objects and the returned value is a dictionary of `pd.DataFrame` objects. 

In [18]:
# generate forecasts
fcsts=gme.predict(test_TSs, steps = 3)
print(f"The generated forecasts is of type {type(fcsts)}, and it is {fcsts}.")

The generated forecasts is of type <class 'dict'>, and it is {0:    fcst_quantile_0.5  fcst_quantile_0.05  fcst_quantile_0.95  \
0           0.767697            0.404130            0.497639   
1           0.037443            0.071744            0.303610   
2          -1.401297           -1.378399           -0.707376   

   fcst_quantile_0.99       time  
0            0.457347 2020-07-25  
1            0.077227 2020-07-26  
2           -1.142237 2020-07-27  , 1:    fcst_quantile_0.5  fcst_quantile_0.05  fcst_quantile_0.95  \
0           0.313839            0.040294            0.025578   
1           0.232108            0.380484            0.459492   
2           0.970169            1.245332            2.275410   

   fcst_quantile_0.99       time  
0           -0.023659 2020-07-27  
1            0.217330 2020-07-28  
2            1.349539 2020-07-29  , 2:    fcst_quantile_0.5  fcst_quantile_0.05  fcst_quantile_0.95  \
0           0.679296            0.467965            0.468984   
1    

Similar to the `GMModel` object, you can also easily save/load and serilize the `GMEnsemble` object.

In [19]:
# save model
gme.save_model("gme_example_1.p")

# load model
gme2 = load_gmensemble_from_file("gme_example_1.p")

# remove the saved model
os.remove("gme_example_1.p")


# encode model into json string
gme.gm_info=None # Note that pd.DataFrame is not serilizable
gme_str = global_model_to_json(gme)

# load model from json string
gme3 = load_global_model_from_json(gme_str)

### 2.4 Backtesting with `GMBacktester`

A `GMBacktester` object helps evaluate the hyper-parameter setting (i.e., the `GMParam` object). Here is a list of some of the attributes:
* **data**: A list or a dictionary of `kats.consts.TimeSeriesData` objects for training and validation.
* **gmparam**: A `GMParam` object.
* **backtest_timestamp**: A list of strings or `pandas.Timestamp` objects representing the backtest timestamps. A backtest timestamp is used to split the time series into the training and testing set.
* **splits**: Integer; An positive integer representing the number of sub-datasets to be built. Default is 3.
* **overlap**: Boolean; A boolean representing whether or not sub-datasets overlap with each other or not. Default is True. For example, when `splits=3` and `overlap=True`, then each sub-dataset contains 2/3 of training data.
* **replicate**: Integer; A positive integer representing the number of global models to be trained on each sub-datasets. Default is 1.

For the full list of attributes, please see our documents.

In [20]:
from kats.models.globalmodel.backtester import GMBackTester

# initiate backtester
gbm = GMBackTester(train_TSs, gmparam, backtest_timestamp = ['2020-08-10'])

Now one can run backtesting.

In [21]:
gbm.run_backtest()

Unnamed: 0,smape,sbias,exceed_0.05,exceed_0.95,exceed_0.99,model_num,step,idx,type,backtest_ts
0,1.552282,-0.433595,0.290323,0.516129,0.387097,0.0,0,13.0,single,2020-08-10
1,1.402725,-0.414422,0.354839,0.483871,0.451613,1.0,0,13.0,single,2020-08-10
2,1.502924,-0.180256,0.322581,0.387097,0.483871,2.0,0,13.0,single,2020-08-10
3,1.465183,-0.326782,0.322581,0.483871,0.451613,,0,,ensemble,2020-08-10
4,1.509675,-0.267232,0.677419,0.645161,0.0,0.0,1,13.0,single,2020-08-10
5,1.39565,0.206955,0.774194,0.354839,0.290323,1.0,1,13.0,single,2020-08-10
6,1.16541,0.08344,0.225806,0.129032,0.258065,2.0,1,13.0,single,2020-08-10
7,1.363542,-0.082123,0.677419,0.387097,0.193548,,1,,ensemble,2020-08-10
8,1.193258,-0.169224,0.129032,0.16129,0.0,0.0,2,13.0,single,2020-08-10
9,1.563056,-0.871619,0.16129,0.096774,0.064516,1.0,2,13.0,single,2020-08-10


## 3. Using pretrained global model/global ensemble

In Kats, we provide two pre-trained daily `GMEnsemble` objects (one is S2S-GME and and the other one is RNN-GME). Both of them are trained with M4 dataset. One can use them for forecasting exploration or benchmark.

In [22]:
gme_rnn = load_gmensemble_from_file("../kats/models/globalmodel/pretrained_daily_rnn.p")
gme_rnn

ERROR:root:Fail to load GMEnsemble from ../kats/models/globalmodel/pretrained_daily_rnn.p with Exception [Errno 2] No such file or directory: '../kats/models/globalmodel/pretrained_daily_rnn.p'.


ValueError: Fail to load GMEnsemble from ../kats/models/globalmodel/pretrained_daily_rnn.p with Exception [Errno 2] No such file or directory: '../kats/models/globalmodel/pretrained_daily_rnn.p'.

You can use this loaded pre-trained model to generate forecasts.

In [None]:
fcsts = gme_rnn.predict(test_TSs, steps = 3)
fcsts