## Analyzing the Effects of Top/Bottom Coding on The Accuracy of Global LGBM Forecasts

***

## Import Modules

In [1]:
# general modules
import numpy as np
import pandas as pd

# light gradient boosting model package
import lightgbm as lgb

##### the `helper_functions.py` file contains many custom functions we wrote to aid in our analysis
##### `full_coding_analysis` combines all of the following - train-test split data,
##### data protection, train models, compare accuracies, return accuracy results
from helper_functions import full_coding_analysis
from helper_functions import *

# import detrender and deseasonalizer
from sktime.transformations.series.detrend import Detrender
# nice time series plots
from sktime.utils.plotting import plot_series

## Import data

In [2]:
# import weekly finance time series
Y = np.genfromtxt("../../Data/Train/Clean/weekly_finance_clean.csv", delimiter = ',', skip_header = 1)
Y = pd.DataFrame(Y)

This file experiments with applying top and bottom coding to detrended data (intuition suggests this type of data is a better candidate for this type of protection than the original finance series data) so we remove the trend from the finance data here.

In [3]:
detrender = Detrender()
detrended_series = [detrender.fit_transform(series) for _ , series in Y.iterrows()]

In [4]:
detrended_series = [i+np.abs(np.min(i))+1.0 for i in detrended_series]
Y = pd.concat(detrended_series, axis=1).T

***

We obtain results for a combination of forecast horizons, coding types (top and bottom), coding percentages, and model complexities:

* Forecast Horizons: (1, 5, 15)
* Coding Types: (Top, Bottom)
* Coding Percentages: (0.10, 0.20, 0.40)
* Model complexities (window length): (10, 20, 40)

## Simple Model (window length = 10)

In [5]:
forecaster = lgb.LGBMRegressor()
window_length=10

In [6]:
results_dict_10 = {}
types = ["Top", "Bottom"]
percentages = [0.10, 0.20, 0.40]
horizons = [1, 5, 15]

In [7]:
for t in types:
    for p in percentages:
        for h in horizons:
            results_dict_10["h="+str(h)+", "+t+" "+str(p)] = full_coding_analysis(Y, forecaster, forecast_horizon=h, coding_type=t, coding_percentage=p, window_length=window_length)

In [8]:
results_dict_10

{'h=1, Top 0.1': {'% of forecasted points adjusted downward:': 35.4,
  '% of forecasted points adjusted upward:': 64.60000000000001,
  '% Series with improved accuracy:': array([43.3, 43.3]),
  '% Series with worsened accuracy:': array([56.7, 56.7]),
  '% Series with unchanged accuracy:': array([0., 0.]),
  '% Change mean global accuracy:': array([-6.3, -6.3]),
  '% Change median global accuracy:': array([-20.9, -20.9])},
 'h=5, Top 0.1': {'% of forecasted points adjusted downward:': 35.9,
  '% of forecasted points adjusted upward:': 64.1,
  '% Series with improved accuracy:': array([52.4, 55.5]),
  '% Series with worsened accuracy:': array([47.6, 44.5]),
  '% Series with unchanged accuracy:': array([0., 0.]),
  '% Change mean global accuracy:': array([4.2, 2.8]),
  '% Change median global accuracy:': array([7.8, 4.5])},
 'h=15, Top 0.1': {'% of forecasted points adjusted downward:': 42.699999999999996,
  '% of forecasted points adjusted upward:': 57.3,
  '% Series with improved accura

***
***

## 'Medium' Model (window length = 20)

In [9]:
forecaster = lgb.LGBMRegressor()
window_length=20

In [10]:
results_dict_20 = {}
types = ["Top", "Bottom"]
percentages = [0.10, 0.20, 0.40]
horizons = [1, 5, 15]

In [11]:
for t in types:
    for p in percentages:
        for h in horizons:
            results_dict_20["h="+str(h)+", "+t+" "+str(p)] = full_coding_analysis(Y, forecaster, forecast_horizon=h, coding_type=t, coding_percentage=p, window_length=window_length)

In [12]:
results_dict_20

{'h=1, Top 0.1': {'% of forecasted points adjusted downward:': 41.5,
  '% of forecasted points adjusted upward:': 58.5,
  '% Series with improved accuracy:': array([42.7, 42.7]),
  '% Series with worsened accuracy:': array([57.3, 57.3]),
  '% Series with unchanged accuracy:': array([0., 0.]),
  '% Change mean global accuracy:': array([-6.3, -6.3]),
  '% Change median global accuracy:': array([-19.2, -19.2])},
 'h=5, Top 0.1': {'% of forecasted points adjusted downward:': 38.3,
  '% of forecasted points adjusted upward:': 61.7,
  '% Series with improved accuracy:': array([53.7, 54.9]),
  '% Series with worsened accuracy:': array([46.3, 45.1]),
  '% Series with unchanged accuracy:': array([0., 0.]),
  '% Change mean global accuracy:': array([3.8, 3.4]),
  '% Change median global accuracy:': array([4.1, 7. ])},
 'h=15, Top 0.1': {'% of forecasted points adjusted downward:': 36.0,
  '% of forecasted points adjusted upward:': 64.0,
  '% Series with improved accuracy:': array([55.5, 56.7]),


***
***

## More Complex Model (window length = 40)

In [13]:
forecaster = lgb.LGBMRegressor()
window_length=40

In [14]:
results_dict_40 = {}
types = ["Top", "Bottom"]
percentages = [0.10, 0.20, 0.40]
horizons = [1, 5, 15]

In [15]:
for t in types:
    for p in percentages:
        for h in horizons:
            results_dict_40["h="+str(h)+", "+t+" "+str(p)] = full_coding_analysis(Y, forecaster, forecast_horizon=h, coding_type=t, coding_percentage=p, window_length=window_length)

In [16]:
results_dict_40

{'h=1, Top 0.1': {'% of forecasted points adjusted downward:': 42.699999999999996,
  '% of forecasted points adjusted upward:': 57.3,
  '% Series with improved accuracy:': array([47.6, 47.6]),
  '% Series with worsened accuracy:': array([52.4, 52.4]),
  '% Series with unchanged accuracy:': array([0., 0.]),
  '% Change mean global accuracy:': array([-1., -1.]),
  '% Change median global accuracy:': array([-6.9, -6.9])},
 'h=5, Top 0.1': {'% of forecasted points adjusted downward:': 43.0,
  '% of forecasted points adjusted upward:': 56.99999999999999,
  '% Series with improved accuracy:': array([55.5, 55.5]),
  '% Series with worsened accuracy:': array([44.5, 44.5]),
  '% Series with unchanged accuracy:': array([0., 0.]),
  '% Change mean global accuracy:': array([3.4, 3.3]),
  '% Change median global accuracy:': array([4.8, 7.7])},
 'h=15, Top 0.1': {'% of forecasted points adjusted downward:': 48.1,
  '% of forecasted points adjusted upward:': 51.9,
  '% Series with improved accuracy:'