# Kats 206 ML_AR Traditional Machine Learning (LightGBM) forecasting: Elaborate example

This tutorial will introduce how to use the LightGBM in Kats.  The LightGBM in KATS is designed for time series modeling. 

1. Overview of LightGBM for Time Series Forecasting  
2. Build Your Own LightGBM Model From Scratch  
    2.1 Introduction to `LightgbmParams`  
    2.2 Forecasting using LightGBM model with `LightgbmTS`  
    2.3 Forecasting with Hyper-parameter tunning  

**Note:** We provide two types of tutorial notebooks
- **Kats 101**, basic data structure and functionalities in Kats 
- **Kats 20x**, advanced topics, including advanced forecasting techniques, advanced detection algorithms, `TsFeatures`, meta-learning, global model etc. 

In this section we discuss a more elabota example, with external features and some other characteristics.



In [None]:
%%capture
# For Google Colab:
!pip install kats

In [8]:
import pandas as pd
import numpy as np
from kats.consts import TimeSeriesData
from kats.models.ml_ar import MLARParams, MLARModel

import matplotlib.pyplot as plt


ModuleNotFoundError: No module named 'kats'

Load simulated data. 

In [None]:
import random
import string

# for c in input_df_orig.columns:
#     print(f"----{c}---")
#     print(input_df_orig[c].value_counts())

# min(input_df_orig["ticket_dollar"])
# max(input_df_orig["ticket_dollar"])

company = ["AA", "UA", "SOU", "Del"]
direction = ["departure", "arrival"]

ds = pd.Series(
    pd.date_range(
        "2021-01-01",
        "2022-05-01",
        freq="D",
    )
)

terminal = []
for i in range(20):
    terminal.append(''.join(random.choices(string.ascii_uppercase + string.digits, k=3)))

k_people = np.random.randint(10000, high=100000, size=len(terminal)*len(company)*len(direction)*len(ds), dtype=int)/1000
ticket_dollar = np.random.randint(6600, high=36600, size=len(terminal)*len(company)*len(direction)*len(ds), dtype=int)/100

import itertools

all_combinations = list(itertools.product(ds, terminal, company, direction))
input_df_orig = pd.DataFrame(all_combinations, columns=['ds','terminal', 'company', 'direction'])
input_df_orig["k_people"] = k_people
input_df_orig["ticket_dollar"] = ticket_dollar
input_df_orig.set_index(["ds"], inplace=True)

input_df = (
    input_df_orig.groupby(
        ["terminal", "direction", "company"]
    )
    .resample("W")[["k_people", "ticket_dollar"]]
    .quantile(q=0.99)
    .reset_index()
)

input_df_orig
input_df

We need to map categorical feature columns into a numerical type

In [None]:
unique_cat = sorted(input_df["terminal"].unique())
density_map = dict(zip(unique_cat, range(len(unique_cat))))
input_df['terminal_n'] = input_df['terminal'].map(density_map)

unique_cat = sorted(input_df["direction"].unique())
density_map = dict(zip(unique_cat, range(len(unique_cat))))
input_df['direction_n'] = input_df['direction'].map(density_map)

unique_cat = sorted(input_df["company"].unique())
density_map = dict(zip(unique_cat, range(len(unique_cat))))
input_df['company_n'] = input_df['company'].map(density_map)


When preprocessing the dataframe into a dictionary of KATS time series data, we can variables, feature list and categorical features:

* target variable(s): ["ticket_dollar"] 
* feature(s): ["k_people"]
* categorical(s): ["terminal", "direction", "company"]

In [None]:
variables = ["ticket_dollar"] # can be multible variables
feature_list = ["k_people"]
categoricals = ["terminal_n", "direction_n", "company_n"]
# dictionary of Kats Time Series Data
grouped = input_df.groupby(["terminal", "direction", "company"])[["ds"] + variables + categoricals + feature_list]
data = [] # list of time series 
data2 = {} # dictionary of time series 
for name, group in grouped:
    data.append(TimeSeriesData(
        group[["ds"] + variables + feature_list + categoricals],
        time_col_name="ds",
    ))
    data2[name] = TimeSeriesData(
        group[["ds"] + variables + feature_list + categoricals],
        time_col_name="ds",
    )


Below we initialize a `MLARParams` instance that we will use 
to train a weekly model without any seasonality assumption. 

In [None]:
lightgbm_params = MLARParams(
    boosting_type="gbdt",
    objective="quantile",
    target_variable=variables,
    horizon=2,
    input_window=5,
    freq="W",

    cov_history_input_windows={"k_people":5},
    categoricals=categoricals,

    n_estimators = 300,
    max_depth = 5,
    learning_rate = 1.3,
    min_split_gain = 0.0181,
    num_leaves = 160
)

2.2 Forecasting using LightGBM model with `MLAR`

We can initialize an MLAR object as follows.

In [None]:
model = MLARModel(lightgbm_params)

Now we can train the `MLAR` object.

In [None]:
import time

start_time=time.time()
model.train(data)
end_time=time.time()

mod_name = "Model 1"
run_time = end_time-start_time
print('Model: {} run time: {} seconds'.format(mod_name, run_time))


We generate prediction based on prediction steps. Note that we cannot generate forecast horizon larger than the `horizon` parameters we initialized in `LightgbmParams`.

In [None]:
forecast = model.predict(steps=2)

The `train_data_in` in model object is how regressors, categorical features and multiple target variables are input into feature space. Note that in the data space, we enlarge the feature space with several other important features such as calendar features to capture seasonality, lags, `min`, `max`, `mean` (for target variable time series) to improve the model's performance.

In [None]:
model.train_data.head()

train_data_in = pd.DataFrame(model.train_data_in)
train_data_in.columns = model.feature_columns

train_data_in.info()

After training, we can check forecasting results by category.

In [None]:
# forecasting results 
len(forecast)

In [None]:
forecast[0]

Do some plotting

In [None]:
from matplotlib.backends.backend_pdf import PdfPages

non_interactive_plotting = True

plotfile = 'ans.pdf'

with PdfPages(plotfile) as pdf:
    for i in range(len(data)):

        curr_data = data[i]["ticket_dollar"].to_dataframe().rename(columns = {"ds" : "time"}).set_index("time")
        curr_fc = forecast[i][["time", "forecast"]].set_index("time")

        plot_df = curr_data.join(curr_fc, how="outer")

        plt.figure()
        plot_df.plot(figsize=(50,20), title=f"Series: {i}")
        plt.show()

        if non_interactive_plotting:
            #plt.savefig(plotfile())
            pdf.savefig()
            plt.close()


In [None]:
# train process for dictionary of time series 

model = MLARModel(lightgbm_params)
model.train(data2)
forecast = model.predict(steps=2)

model.train_data.head()

model.train_data.columns

### 2.3  Forecasting with Hyper-parameter tunning 
As other KATS models, we can use time_series_parameter_tuning for Hyper-parameter Tunning. Specifically, we initialize an evaluation_function with `MLARParams` object for `MLAR` object. In the parameters setting, the  `n_estimators`, `max_depth`, `learning_rate`, `min_split_gain`, `num_leaves`, `num_leaves` are setted up for Hyper-parameter tunning. 


In [None]:
import kats.utils.time_series_parameter_tuning as tpt
tpt.SearchMethodFactory.create_search_method
# from hyperopt.early_stop import no_progress_loss
from kats.consts import TimeSeriesData, SearchMethodEnum
from kats.utils.parameter_tuning_utils import get_default_lightgbm_parameter_search_space


We can initialize a GMEnsemble object as follows. In the function, we can initialize LightgbmParams with parameters not in searching space, otherwise, they will be initialized with default values. 


In [None]:
def evaluation_function(params):
#     try:
    lightgbm_params = MLARParams(
        boosting_type = "gbdt",
        objective = "quantile", 
        target_variable = variables,
        horizon = 3,
        input_window = 7,
        freq="W",
        #expand_feature_space=feature_list,
        cov_history_input_windows={"k_people":5},
        categoricals = categoricals,   
        calculate_fit = True,  
        #missing_threshold = 0,
        #num_missing_to_drop = 100,
        **params)
    
    model = MLARModel(lightgbm_params)
    model.train(data)
    model._predict()

    print(model.train_data)
    error = np.mean(np.abs(model.train_data['output'].values - model.train_data["forecast"].values))

    return error

Then we use `tpt.SearchMethodFactory.create_search_method` to initialize search algorithm, such as random search, and the bootstrap_size. 


In [None]:
#TODO: replace get_default_lightgbm_parameter_search_space with get_parameter_search_space from ml_ar

from kats.utils.parameter_tuning_utils import (
    get_default_lightgbm_parameter_search_space,
)

parameter_tuner = tpt.SearchMethodFactory.create_search_method(
    objective_name="hpt_example",
    parameters=get_default_lightgbm_parameter_search_space(),  # using default parameter space
    selected_search_method=SearchMethodEnum.RANDOM_SEARCH_SOBOL,
    evaluation_function=evaluation_function, 
    bootstrap_size=5,
)


In [None]:
parameter_tuner.list_parameter_value_scores()

The arm count is used in search strategy

In [None]:
parameter_tuner.generate_evaluate_new_parameter_values(
    evaluation_function=evaluation_function, arm_count=5
)


We can print out the parameters randomly selected 

In [None]:
parameter_tuner.list_parameter_value_scores()

#parameter_tuner.list_parameter_value_scores()["parameters"][3]


## Non Categorical Feature Example 

In [None]:
variables = ["ticket_dollar"] # can be multible variables
feature_list = ["k_people"]
categoricals = []
# dictionary of Kats Time Series Data
grouped = input_df.groupby(["terminal", "direction", "company"])[["ds"] + variables + feature_list]
data = []
for name, group in grouped:
    data.append(TimeSeriesData(
        group[["ds"] + variables + feature_list],
        time_col_name="ds",
    ))

In [None]:
len(data)

In [None]:
data[0]

In [None]:
lightgbm_params = MLARParams(
    boosting_type="gbdt",
    objective="quantile",
    target_variable=variables,
    horizon=2,
    input_window=5,
    freq="W",
    cov_history_input_windows={"k_people":5},
    categoricals=categoricals,
    #missing_threshold=0,
    #num_missing_to_drop=10,
    n_estimators = 300,
    max_depth = 5,
    learning_rate = 1.3,
    min_split_gain = 0.0181,
    num_leaves = 160,
    calculate_fit = True
)

In [1]:
model = MLARModel(lightgbm_params)

model.train(data)

forecast = model.predict(steps=2)

model.train_data.head()

# forecasting results 
len(forecast)

forecast[0]


NameError: name 'MLARModel' is not defined