# Kats 205 Forecasting with Global Model

This tutorial will introduce how to use the global model in Kats.  The global model is a new and powerful forecasting method that combines exponential smoothing models with recurrent neural networks, resulting in higher accuracy than other approaches. The table of contents for Kats 205 is as follows:

1. Overview of global model for forecasting  
2. Building your own global model/global ensemble from scratch  
    2.1 Introduction to `GMParam`  
    2.2 Forecasting using a single global model with `GMModel`  
    2.3 Forecasting using a global model ensemble with `GMEnsemble`  
    2.4 Backtesting with `GMBacktester`  
3. Using pretrained global model/global ensemble  

**Note:** We provide two types of tutorial notebooks
- **Kats 101**, basic data structure and functionalities in Kats 
- **Kats 20x**, advanced topics, including advanced forecasting techniques, advanced detection algorithms, `TsFeatures`, meta-learning, global model etc. 

## 1. Overview of global model for forecasting

Global model (GM) is a powerful forecasting model based on neural networks, which is first proposed by Slawek Smyl and is the winning model of the M4 Forecasting competition (2018) and the Computational Intelligence in Forecasting International Time Series Competition (2016) (reference of the original model: https://www.sciencedirect.com/science/article/pii/S0169207019301153). Unlike traditional forecasting model (e.g., ARIMA or Prophet), GM is trained with a large amount of time series and can be used for forecasting any new unseen time series of the same time granularity. Its award winning performance verifies that GM is of high accuracy. Moreover, GM is generic for batch processing (i.e., generating forecasts for several time series at the same time) and enjoys suprior efficiency. 

In Kats, we build upon the original model and provide two types of GMs: RNN-GM (for short-term forecasting) and S2S-GM (for mid-term/long-term forecasting).


## 2. Building your own global model/global ensemble from scratch

The `GMModel` and `GMEnsemble` class allow you to build a single GM or an ensemble of several independent GMs (GME). The `GMParam` class encodes all the necessary configerations of a GM including NN structure, time series granularity and etc. In addition, we provide class `GMBacktester` for parameter tunning and backtesting. 

In this section, we will only display the functionality of each class (hence the models are not well-trained and may not provide good performance.)

### 2.1 Introduction to `GMParam`

A `GMParam` object carries all the necessary configerations of a GM (or a GME), and it performs basic parameter checking when initialized. Here we list several importand arguments for `GMParam`:

* **freq**: String or pd.Timedelta; The time granularity of the model (and the time series.) For example, `freq='D'` indicates a daily model.
* **model_type**: String; The name of neural network type. Should be either 'rnn' or 's2s'. Default is 'rnn'.
* **input_window**: Integer; An integer representing length of input TS of each step and it should be greater than seasonality.
* **fcst_window**: Integer; An integer representing the length of each forecast step. When `model_type='s2s'`, the loss function is computed over the sub time series of length `fcst_window*fcst_step_num`. Note that GM/GME can generate forecasts of any length regardless of `fcst_window`. 
* **seasonality**: Integer; An integer representing representing the seasonality period. When `seasonality=1`, the global model is non-seasonal. Default is 1.
* **quantile**: List of floats; A list of floats representing the forecast quantile (the first element should be 0.5 representing the mean/median value). Default value is [0.5,0.05,0.95,0.99].
* **nn_structure**: List of lists of integers; A list of lists of integers representing the neural network structure. If None, default value is [[1,3]].
* **loss_function**: String; The name of loss function, can be 'pinball' or 'adjustedpinball'.
* **gmfeature**: List of strings or string; A single or a list of feature names.

For the definition of other parameters, please see our documentation.

In [22]:
import numpy as np
import pandas as pd
import sys
import warnings
import os

warnings.simplefilter(action='ignore')
sys.path.append("../")

from kats.models.globalmodel.utils import GMParam

In [11]:
# GMParam example  -- for daily model
gmparam = GMParam(
    input_window = 35, 
    fcst_window = 31,
    seasonality = 7,
    freq = 'D',
    loss_function = 'adjustedpinball',
    nn_structure = [[1,3]],
    gmfeature = ['last_date'],
    epoch_num = 1, 
    epoch_size = 2, # use a small num just for demonstration
    gmname = "daily_default",
)

### 2.2 Forecasting using a single global model with GMModel

After initiating a `GMParam` object, you are ready to build and train a single global model. To initiate a `GMModel` object, one only needs to input the `GMParam` object.

In [25]:
from kats.models.globalmodel.model import GMModel, load_gmmodel_from_file
from kats.models.globalmodel.serialize import global_model_to_json, load_global_model_from_json
from kats.consts import TimeSeriesData

# build `GMModel` object

gm = GMModel(gmparam)

To train a `GMModel` object, we need a list or a dictionary of time series. 

In [30]:
# helper function for simulating random time series.

def get_ts(n, start_time, freq='D', has_nans=True):
    """Function for simulating time series.
    
    Args: 
        n: An integer representing the length of time series.
        start_time: A string representing the starting timestamp of time series.
        freq: A string representing the time granularity of time series.
        has_nans: A boolearn representing whether or not time series has NaNs.
    
    Returns:
        A simulated time series.
    """
    t = pd.Series(pd.date_range(start_time, freq=freq, periods=n))
    val = np.random.randn(n)
    if has_nans:
        idx = np.random.choice(range(n), int(n*0.2), replace=False)
        val[idx]=np.nan
    val = pd.Series(val)
    return TimeSeriesData(time=t, value=val)

train_TSs = {i: get_ts(n*5, '2020-05-06') for i, n in enumerate(list(range(20, 40)))}
test_TSs = {i: get_ts(n*2, '2020-05-06') for i, n in enumerate(list(range(40, 42)))}

# train the model
training_info = gm.train(train_TSs)

#training_info saves the information of training process
print(training_info)

{'train_loss_monitor': [0.1050752], 'valid_loss_monitor': [{'epoch': 0}], 'valid_fcst_monitor': [], 'train_loss_val': [0.21015039831399918]}


Now we can use the trained model to generate forecasts. The input can be a `TimeSeriesData` object or a list/dictionary of TimeSeriesData objects. The returned value is a dictionary of `pd.DataFrame` objects.

In [18]:
test_ts = get_ts(30, '2020-05-06')
fcst = gm.predict(test_ts, steps = 3)
print(f"The dataframe of forecasts is {fcst}.")
print("=="*60)

# generate the forecasts of a batch of time series.
fcsts = gm.predict(test_TSs, steps = 3)
print(f"The dictionary of forecasts is {fcsts}.")

The dataframe of forecasts is {0:    fcst_quantile_0.5  fcst_quantile_0.05  fcst_quantile_0.95  \
0            1.07464            0.107831            1.566863   
1            1.71647            0.569356            1.982417   
2           -0.47167           -0.719296           -0.026999   

   fcst_quantile_0.99       time  
0            1.359847 2020-06-05  
1            1.931908 2020-06-06  
2           -0.571826 2020-06-07  }.
The dictionary of forecasts is {0:    fcst_quantile_0.5  fcst_quantile_0.05  fcst_quantile_0.95  \
0          -0.835291           -1.531433           -0.507440   
1          -0.916999           -1.478367           -0.646848   
2          -1.067785           -1.373956           -0.773177   

   fcst_quantile_0.99       time  
0           -0.677720 2020-07-25  
1           -0.658723 2020-07-26  
2           -1.174261 2020-07-27  , 1:    fcst_quantile_0.5  fcst_quantile_0.05  fcst_quantile_0.95  \
0          -1.011401           -1.823488           -0.651751   
1  

Let's now display how to save and reload the model.

In [23]:
# save model
gm.save_model("gm_example_1.p")

# load model
gm2 = load_gmmodel_from_file("gm_example_1.p")

# remove the saved model
os.remove("gm_example_1.p")

We also provide methods for encoding GM into a json string, and loading a model from a json string.

In [31]:
# encode model into json string
gm_str = global_model_to_json(gm)

# load model from json string
gm3 = load_global_model_from_json(gm_str)

### 2.3 Forecasting using a single global model with GMEnsemble

You can also easily build one ensemble of several individual GMs with `GMEnsemble` class. In addition to a `GMParam` object, one also needs to specify how training data set should be splitted and how many independent `GMModel` objects. 

Here is the list of attributs:
* **gmparam**: A GMParam object; This is used for initiating each global model.
* **ensemble_type**: String; A string representing how forecasts are combined. Can be 'median' or 'mean'. Default is 'median'.
* **splits**: Integer; An positive integer representing the number of sub-datasets to be built. Default is 3.
* **overlap**: Boolean; A boolean representing whether or not sub-datasets overlap with each other or not. Default is True. For example, when `splits=3` and `overlap=True`, then each sub-dataset contains 2/3 of training data.
* **replicate**: Integer; A positive integer representing the number of global models to be trained on each sub-datasets. Default is 1.
* **multi**: Boolean; A boolean representing whether or not to use multi-processing for training and prediction. Default is False.

Note that a GMEnsemble object will build `splits*replicate` independent `GMModel` objects, and the final forecasts are aggregated from the forecasts generated from each trained `GMModel` object.

In [33]:
from kats.models.globalmodel.ensemble import GMEnsemble, load_gmensemble_from_file

# Initiate 
gme = GMEnsemble(gmparam, splits=3, overlap=True, replicate=1, multi=True)


Now we can train the `GMEnsemble` object. Note that one has the choice of setting aside a test set from the training data to measure the performance of each `GMModel` object throughout the training process.

In [35]:
gme.train(train_TSs, test_size = 0.1)



In [38]:
# the information of training process and the evaluation results on the set-aside test set are saved in attribute gm_info.
gme.gm_info

[{'train_loss_monitor': [0.09529682],
  'valid_loss_monitor': [{'epoch': 0}],
  'valid_fcst_monitor': [],
  'train_loss_val': [0.29780256748199463],
  'test_info': [      smape     sbias  exceed_0.05  exceed_0.95  exceed_0.99  step  idx  epoch
   0  1.718492 -1.418439     0.096774     0.677419     0.677419     0    9      0
   1  1.551402 -0.258064     0.354839     0.419355     0.516129     0   11      0]},
 {'train_loss_monitor': [0.093351685],
  'valid_loss_monitor': [{'epoch': 0}],
  'valid_fcst_monitor': [],
  'train_loss_val': [0.19837233051657677],
  'test_info': [      smape     sbias  exceed_0.05  exceed_0.95  exceed_0.99  step  idx  epoch
   0  1.749226 -1.445085     0.064516     0.709677     0.709677     0    9      0
   1  1.678446 -0.379500     0.387097     0.451613     0.516129     0   11      0]},
 {'train_loss_monitor': [0.104721874],
  'valid_loss_monitor': [{'epoch': 0}],
  'valid_fcst_monitor': [],
  'train_loss_val': [0.340346097946167],
  'test_info': [      smape  

After training the `GMEnsemble` object, you now can use it to generate forecasts. Similar to the `GMModel` object, the input can be a `TimeSeriesData` object or a list/dictionary of TimeSeriesData objects and the returned value is a dictionary of `pd.DataFrame` objects. 

In [39]:
# generate forecasts
fcsts=gme.predict(test_TSs, steps = 3)
print(f"The generated forecasts is of type {type(fcsts)}, and it is {fcsts}.")

The generated forecasts is of type <class 'dict'>, and it is {0:    fcst_quantile_0.5  fcst_quantile_0.05  fcst_quantile_0.95  \
0          -0.220554            0.013514            0.044756   
1          -0.016261           -0.315333           -0.010909   
2           0.446663            0.037812            0.246893   

   fcst_quantile_0.99       time  
0           -0.527834 2020-07-25  
1           -0.426698 2020-07-26  
2           -0.338622 2020-07-27  , 1:    fcst_quantile_0.5  fcst_quantile_0.05  fcst_quantile_0.95  \
0          -0.948326           -0.751360           -0.743056   
1          -0.470674           -0.765599           -0.452450   
2           0.806719            0.348096            0.613019   

   fcst_quantile_0.99       time  
0           -1.233405 2020-07-27  
1           -0.858695 2020-07-28  
2           -0.111488 2020-07-29  }.


Similar to the `GMModel` object, you can also easily save/load and serilize the `GMEnsemble` object.

In [41]:
# save model
gme.save_model("gme_example_1.p")

# load model
gme2 = load_gmensemble_from_file("gme_example_1.p")

# remove the saved model
os.remove("gme_example_1.p")


# encode model into json string
gme.gm_info=None # Note that pd.DataFrame is not serilizable
gme_str = global_model_to_json(gme)

# load model from json string
gme3 = load_global_model_from_json(gme_str)

### 2.4 Backtesting with `GMBacktester`

A `GMBacktester` object helps evaluate the hyper-parameter setting (i.e., the `GMParam` object). Here is a list of some of the attributes:
* **data**: A list or a dictionary of `kats.consts.TimeSeriesData` objects for training and validation.
* **gmparam**: A `GMParam` object.
* **backtest_timestamp**: A list of strings or `pandas.Timestamp` objects representing the backtest timestamps. A backtest timestamp is used to split the time series into the training and testing set.
* **splits**: Integer; An positive integer representing the number of sub-datasets to be built. Default is 3.
* **overlap**: Boolean; A boolean representing whether or not sub-datasets overlap with each other or not. Default is True. For example, when `splits=3` and `overlap=True`, then each sub-dataset contains 2/3 of training data.
* **replicate**: Integer; A positive integer representing the number of global models to be trained on each sub-datasets. Default is 1.

For the full list of attributes, please see our documents.

In [53]:
from kats.models.globalmodel.backtester import GMBackTester

# initiate backtester
gbm = GMBackTester(train_TSs, gmparam, backtest_timestamp = ['2020-08-10'])

Now one can run backtesting.

In [55]:
gbm.run_backtest()

Unnamed: 0,smape,sbias,exceed_0.05,exceed_0.95,exceed_0.99,model_num,step,idx,type,backtest_ts
0,1.112656,-1.112656,0.0,0.032258,0.032258,0.0,0,0.0,single,2020-08-10
1,1.104074,-0.895926,0.0,0.064516,0.032258,1.0,0,0.0,single,2020-08-10
2,1.056552,-1.056552,0.0,0.064516,0.064516,2.0,0,0.0,single,2020-08-10
3,1.056552,-1.056552,0.0,0.064516,0.032258,,0,,ensemble,2020-08-10
4,1.42255,0.068835,0.483871,0.419355,0.225806,0.0,0,9.0,single,2020-08-10
5,1.476043,0.218554,0.387097,0.258065,0.322581,1.0,0,9.0,single,2020-08-10
6,1.413662,0.106427,0.548387,0.354839,0.419355,2.0,0,9.0,single,2020-08-10
7,1.4259,0.148917,0.516129,0.387097,0.322581,,0,,ensemble,2020-08-10
8,1.041351,0.590719,0.064516,0.129032,0.258065,0.0,1,9.0,single,2020-08-10
9,1.318261,0.24254,0.193548,0.096774,0.225806,1.0,1,9.0,single,2020-08-10


## 3. Using pretrained global model/global ensemble

In Kats, we provide two pre-trained daily `GMEnsemble` objects (one is S2S-GME and and the other one is RNN-GME). Both of them are trained with M4 dataset. One can use them for forecasting exploration or benchmark.

In [58]:
gme_rnn = load_gmensemble_from_file("../kats/models/globalmodel/pretrained_daily_rnn.p")
gme_rnn

<kats.models.globalmodel.ensemble.GMEnsemble at 0x7f9978740e20>

You can use this loaded pre-trained model to generate forecasts.

In [59]:
fcsts = gme_rnn.predict(test_TSs, steps = 3)
fcsts

{0:    fcst_quantile_0.5  fcst_quantile_0.01  fcst_quantile_0.05  \
 0          -0.409422           -0.738302           -0.612039   
 1          -0.341009           -0.804592           -0.612023   
 2          -0.012784           -0.591704           -0.360692   
 
    fcst_quantile_0.95  fcst_quantile_0.99       time  
 0           -0.231685           -0.111664 2020-07-25  
 1           -0.089894            0.092455 2020-07-26  
 2            0.317086            0.563282 2020-07-27  ,
 1:    fcst_quantile_0.5  fcst_quantile_0.01  fcst_quantile_0.05  \
 0          -0.578904           -0.850013           -0.740762   
 1          -0.414955           -0.771584           -0.641951   
 2          -0.019759           -0.485320           -0.311198   
 
    fcst_quantile_0.95  fcst_quantile_0.99       time  
 0           -0.475661           -0.388489 2020-07-27  
 1           -0.259640           -0.137088 2020-07-28  
 2            0.232197            0.349838 2020-07-29  }