# Time Series Analysis: "The Final Project"

`End? No, the journey doesn't end here. Death is just another path. One that we all must take.
-J.R.R. Tolkien, The Return of the King`

---

## Libraries

In [17]:
%cd "Documents\tsa2021-m5"

[WinError 3] The system cannot find the path specified: 'Documents\\tsa2021-m5'
C:\Users\aamorado\Documents\tsa2021-m5


In [19]:
from statsmodels.tsa.exponential_smoothing.ets import ETSModel
import statsmodels.api as sm
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import lightgbm as lgb
from pandas.plotting import register_matplotlib_converters
from IPython.display import display
from tsa_tools import *  # Hehe

register_matplotlib_converters()
sns.set_style('darkgrid')

np.set_printoptions(precision=4)
pd.set_option('precision', 4)

---

## M5 Forecasting

For this "final project", we will be forecasting the <b><u>level 9</b></u> series (unit sales of all products, aggregated for each store and department).

Load `sales_train_evaluation.csv` and use observations from `d_1 to d_1913` for training and `d_1914 to d_1941` for testing.

In [20]:
df_calendar = pd.read_csv('../data/m5/calendar.csv')
df_sales = pd.read_csv('../data/m5/sales_train_evaluation.csv')
df_weights = pd.read_csv('../data/m5/weights_validation.csv')
# display(df_calendar, df_sales, df_weights)

In [21]:
train_df = (df_sales.set_index([*df_sales.columns[5::-1]]).T
           .set_index(pd.DatetimeIndex(df_calendar.date)[:1941]).iloc[:-28])
test_df = (df_sales.set_index([*df_sales.columns[5::-1]]).T
           .set_index(pd.DatetimeIndex(df_calendar.date)[:1941]).iloc[-28:])
# display(train_df, test_df)

In [22]:
levels = {
    1: None,
    2: "state_id",
    3: "store_id",
    4: "cat_id",
    5: "dept_id",
    6: ["state_id", "cat_id"],
    7: ["state_id", "dept_id"],
    8: ["store_id", "cat_id"],
    9: ["store_id", "dept_id"],
    10: "item_id",
    11: ["state_id", "item_id"],
    12: ["store_id", "item_id"]
}


---

## Part 1. Baseline Methods (10 pts.)

### Q1. (10 pts.)

Extract all level 9 series from the dataset.

For each series, generate a 28-step forecast using the methods enumerated below and calculate the `RMSSE` against the test set:

1. `Naive`


2. `Seasonal Naive`


3. `SES`


4. `Holt's Linear`


5. `Additive Holt-Winters`

Summarize the metrics in a dataframe and print it.

In [23]:
methods = {
    "Naive": BaseFuncModel(naivef),
    "Seasonal Naive": BaseFuncModel(snaivef, m=7),
    "SES": StatsModelsWrapper(ETSModel, trend=None, seasonal=None),
    "Holt's Linear": StatsModelsWrapper(ETSModel, trend='add', seasonal=None),
    "Additive Holt-Winters": StatsModelsWrapper(
        ETSModel, seasonal_periods=7, trend='add', seasonal='add'),
}

trainOG_df_9 = train_df.sum(axis='columns', level=levels[9])
train_df_9 = timeSeriesFiltering(trainOG_df_9, lower=10)
test_df_9 = test_df.sum(axis='columns', level=levels[9])
weights_df_9 = (df_weights
                .loc[df_weights['Level_id'] == 'Level9']
                .set_index(['Agg_Level_1', 'Agg_Level_2'])[['Weight']])

In [24]:
res = {}
for method, model in methods.items():
    forecast_df_9 = pd.DataFrame(
        {label: model.fit(content).forecast(28)
        for label, content in train_df_9.items()}
        )
    res[method] = rateMyForecast(
        trainOG_df_9, test_df_9, forecast_df_9)['RMSSE']

In [25]:
pd.set_option('display.max_rows', None)
df_res_9_base = pd.DataFrame(res)
df_res_9_base.index = pd.MultiIndex.from_tuples(df_res_9_base.index)
df_res_9_base

Unnamed: 0,Unnamed: 1,Naive,Seasonal Naive,SES,Holt's Linear,Additive Holt-Winters
CA_1,HOBBIES_1,1.4216,0.7428,0.8598,0.8589,0.6152
CA_1,HOBBIES_2,1.8601,1.0992,0.8494,0.8414,0.6931
CA_1,HOUSEHOLD_1,2.05,0.5036,1.1212,1.1331,0.4317
CA_1,HOUSEHOLD_2,2.2407,0.5094,1.1894,1.1486,0.5141
CA_1,FOODS_1,0.8768,0.688,0.8637,0.8808,0.687
CA_1,FOODS_2,2.0036,0.8068,2.0035,2.2574,0.5786
CA_1,FOODS_3,1.6443,0.475,1.0401,0.9787,0.4848
CA_2,HOBBIES_1,1.1753,0.688,1.0901,1.0902,0.6379
CA_2,HOBBIES_2,1.3073,1.4373,1.3402,1.3308,1.1234
CA_2,HOUSEHOLD_1,1.9863,0.624,1.3668,1.3564,0.5704


---

## Part 2. LightGBM (30 pts.)

### Q2. (10 pts.)

For all series, use an un-tuned `LightGBM` with 56-day lookback that uses a one-step recursive forecasting strategy to generate a 28-step forecast.

Calculate the `RMSSE` against the test set, then summarize the metrics in a dataframe and print it.

In [26]:
model = RecursiveRegressor(
    lgb.LGBMRegressor(random_state=1, w=56, h=28, n_jobs=-1))  # Model: recursive-forecasting
pred = {}

for col in train_df_9:
    model.fit(None, train_df_9[col])
    pred[col] = model.predict(trainOG_df_9[col].iloc[-56:]).squeeze()

In [27]:
df_pred_9_rrlgb = pd.DataFrame(pred)
df_pred_9_rrlgb.index=test_df_9.index

df_res_9_rrlgb = rateMyForecast(
    trainOG_df_9, test_df_9, df_pred_9_rrlgb)['RMSSE']
res["RecursiveRegressor(LGBMRegressor)"] = df_res_9_rrlgb

df_res_9_rrlgb.index = pd.MultiIndex.from_tuples(df_res_9_rrlgb.index)
df_res_9_rrlgb.to_frame()

Unnamed: 0,Unnamed: 1,RMSSE
CA_1,HOBBIES_1,0.718
CA_1,HOBBIES_2,0.6586
CA_1,HOUSEHOLD_1,0.5073
CA_1,HOUSEHOLD_2,0.5062
CA_1,FOODS_1,0.6677
CA_1,FOODS_2,0.645
CA_1,FOODS_3,0.4652
CA_2,HOBBIES_1,0.6993
CA_2,HOBBIES_2,1.2472
CA_2,HOUSEHOLD_1,0.6587


### Q3. (10 pts.)

For all series, use an un-tuned `LightGBM` with 56-day lookback that uses a direct forecasting strategy to generate a 28-step forecast.

Calculate the `RMSSE` against the test set, then summarize the metrics in a dataframe and print it.

In [28]:
model = MultiOutputRegressor(
    lgb.LGBMRegressor(random_state=1, n_jobs=-1),
    n_jobs=-1)  # Model: direct-forecasting
pred = {}

for col in train_df_9:
    X_train, _, y_train, _ = TimeseriesGenerator(
        X=train_df_9[col],
        y=None,
        w=56,
        h=28)
    model.fit(X_train, y_train)
    pred[col] = model.predict([trainOG_df_9[col].iloc[-56:]]).squeeze()

In [29]:
df_pred_9_morlgb = pd.DataFrame(pred, index=test_df_9.index)

df_res_9_morlgb = rateMyForecast(
    trainOG_df_9, test_df_9, df_pred_9_morlgb)['RMSSE']
res["MultiOutputRegressor(LGBMRegressor)"] = df_res_9_morlgb

df_res_9_morlgb.index = pd.MultiIndex.from_tuples(df_res_9_morlgb.index)
df_res_9_morlgb.to_frame()

Unnamed: 0,Unnamed: 1,RMSSE
CA_1,HOBBIES_1,0.687
CA_1,HOBBIES_2,0.6902
CA_1,HOUSEHOLD_1,0.3635
CA_1,HOUSEHOLD_2,0.6683
CA_1,FOODS_1,0.6356
CA_1,FOODS_2,0.7556
CA_1,FOODS_3,0.5491
CA_2,HOBBIES_1,0.6626
CA_2,HOBBIES_2,1.4831
CA_2,HOUSEHOLD_1,0.5651


### Q4. (10 pts.)

For all series, generate a 28-step forecast by combining the forecasts generated by the models in Q2 and Q3 (i.e. simple averaging).

Calculate the `RMSSE` against the test set, then summarize the metrics in a dataframe and print it.

In [30]:
df_pred_9_combo = (df_pred_9_morlgb + df_pred_9_rrlgb) / 2

df_res_9_combo = rateMyForecast(
    trainOG_df_9, test_df_9, df_pred_9_combo)['RMSSE']
res["Combo(LGBMRegressor)"] = df_res_9_combo

df_res_9_combo.index = pd.MultiIndex.from_tuples(df_res_9_combo.index)
df_res_9_combo.to_frame()

Unnamed: 0,Unnamed: 1,RMSSE
CA_1,HOBBIES_1,0.6904
CA_1,HOBBIES_2,0.6439
CA_1,HOUSEHOLD_1,0.4231
CA_1,HOUSEHOLD_2,0.5778
CA_1,FOODS_1,0.6198
CA_1,FOODS_2,0.6892
CA_1,FOODS_3,0.5021
CA_2,HOBBIES_1,0.6709
CA_2,HOBBIES_2,1.3542
CA_2,HOUSEHOLD_1,0.5993


---

## Part 3. WRMSSE (10 pts.)

### Q5.  (10 pts.)

Calculate the `WRMSSE` for the all the methods described above. The weights can be found in `weights_validation.csv`.

For reference, the M5 benchmarks have the following `WRMSSE` scores at level 9:

- `Naive` = <b>1.764</b>


- `S.Naive` = <b>0.888</b>


- `ES_bu` = <b>0.728</b>

<i>Note: The M5 benchmarks use a bottom-up method for forecasting, so they will not necessarily be equal to your scores.</i>

In [31]:
df_res_9_all = pd.DataFrame(res)
df_res_9_all.index = pd.MultiIndex.from_tuples(df_res_9_all.index)
df_res_9_all

(df_res_9_all.rename_axis(['Agg_Level_1', 'Agg_Level_2'])
 .multiply(weights_df_9.squeeze(), axis=0).sum())

Naive                                  1.5616
Seasonal Naive                         0.8919
SES                                    1.1585
Holt's Linear                          1.1855
Additive Holt-Winters                  0.8200
RecursiveRegressor(LGBMRegressor)      0.7569
MultiOutputRegressor(LGBMRegressor)    0.7814
Combo(LGBMRegressor)                   0.7505
dtype: float64

---

## Part 4. Middle-Out Method (30 pts.)

### Q6. Bottom-Up (15 pts.)

Using your forecasts from the best performing method in Q5, use the bottom-up method described in [FPP3](https://otexts.com/fpp3/single-level.html) to generate forecasts for levels 1 to 8.

Calculate the `WRMSSE` for levels 1 to 8 against the test set, then summarize the metrics in a dataframe and print it.

For reference, you can find the benchmark `WRMSSE` scores in the `The M5 Accuracy competition: Results, findings and conclusions` paper.

<i>Note: The M5 benchmarks use a bottom-up method for forecasting, so they will not necessarily be equal to your scores.</i>

In [15]:
df_pred_9_combo.columns = train_df.sum(axis=1, level=["store_id", "state_id", "cat_id", "dept_id"]).columns
df_pred_9_combo

store_id,CA_1,CA_1,CA_1,CA_1,CA_1,CA_1,CA_1,CA_2,CA_2,CA_2,...,WI_2,WI_2,WI_2,WI_3,WI_3,WI_3,WI_3,WI_3,WI_3,WI_3
state_id,CA,CA,CA,CA,CA,CA,CA,CA,CA,CA,...,WI,WI,WI,WI,WI,WI,WI,WI,WI,WI
cat_id,HOBBIES,HOBBIES,HOUSEHOLD,HOUSEHOLD,FOODS,FOODS,FOODS,HOBBIES,HOBBIES,HOUSEHOLD,...,FOODS,FOODS,FOODS,HOBBIES,HOBBIES,HOUSEHOLD,HOUSEHOLD,FOODS,FOODS,FOODS
dept_id,HOBBIES_1,HOBBIES_2,HOUSEHOLD_1,HOUSEHOLD_2,FOODS_1,FOODS_2,FOODS_3,HOBBIES_1,HOBBIES_2,HOUSEHOLD_1,...,FOODS_1,FOODS_2,FOODS_3,HOBBIES_1,HOBBIES_2,HOUSEHOLD_1,HOUSEHOLD_2,FOODS_1,FOODS_2,FOODS_3
date,Unnamed: 1_level_4,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4
2016-04-25,449.4956,42.4433,754.6167,195.0511,271.6638,541.5019,1917.5483,324.1858,42.1745,655.6467,...,344.8301,884.0825,1933.51,207.2512,24.6183,548.1931,158.8054,235.2273,446.0785,1615.1026
2016-04-26,396.0684,37.1498,643.8261,177.7038,253.1075,435.828,1820.1153,335.1625,41.0074,602.7929,...,327.2772,733.3352,1840.4003,177.5088,32.196,564.7635,143.3564,258.5176,412.6191,1734.2253
2016-04-27,388.4789,39.5788,586.1311,184.2913,289.1053,366.2773,1732.6518,285.6967,42.339,641.663,...,310.4772,702.6381,1727.1741,191.3946,25.9984,515.3838,129.5121,246.8596,383.138,1602.5222
2016-04-28,419.4912,37.9877,622.6572,184.2326,303.6351,340.3313,1714.0109,338.555,37.8039,631.6716,...,266.4066,824.6831,1711.54,218.4111,28.1831,579.6401,132.4203,238.3594,390.2391,1505.39
2016-04-29,456.5489,47.5803,739.6181,215.0809,314.9435,460.0979,2244.1102,365.7659,38.3668,824.5203,...,343.4292,811.1577,2078.8262,324.1257,24.4642,742.7536,168.247,264.7937,488.5465,1788.8892
2016-04-30,582.3975,55.4284,1019.7767,271.679,372.139,567.8636,2693.1875,462.4425,44.5834,1011.6265,...,382.6277,832.5479,2357.888,295.035,27.901,869.7472,208.7738,312.2918,615.5956,2453.5944
2016-05-01,546.6977,51.381,1079.9215,278.506,332.1003,623.2916,2890.6448,457.0832,45.4321,1033.0461,...,364.862,1071.6023,2462.6952,250.3653,30.4518,870.1567,204.3641,285.0829,641.34,2309.922
2016-05-02,465.732,35.9209,778.7456,195.6691,272.2665,537.8936,2030.6053,309.9696,32.2123,650.4048,...,365.2623,1276.8462,2488.2545,214.8206,26.7563,625.2188,165.1707,238.8403,569.4897,1746.4308
2016-05-03,431.2596,42.9203,655.8035,185.2495,289.3279,496.3385,1866.7531,296.4192,34.6295,607.299,...,342.9247,1333.6738,2403.2871,212.6836,24.7036,604.9833,136.9341,270.9724,680.9072,1920.9284
2016-05-04,444.1696,41.6432,628.0217,192.5918,268.0387,497.2088,1825.3116,312.1186,48.9701,635.2823,...,335.2363,1596.566,2549.1681,216.6993,31.1215,554.5598,139.3128,250.2516,648.3445,1811.1699


### Q7. Top-Down  (15 pts.)

Using your forecasts from the best performing method in Q5, use the top-down method with `average historical proportions` described in [FPP3](https://otexts.com/fpp3/single-level.html) to generate forecasts for levels 10 to 12.

Calculate the `WRMSSE` for levels 10 to 12  against the test set, then summarize the metrics in a dataframe and print it.

For reference, you can find the benchmark `WRMSSE` scores in the `The M5 Accuracy competition: Results, findings and conclusions` paper.

<i>Note: The M5 benchmarks use a bottom-up method for forecasting, so they will not necessarily be equal to your scores.</i>

In [16]:
# Your code here

# Time Series Analysis: "The Final Project"

`End? No, the journey doesn't end here. Death is just another path. One that we all must take.
-J.R.R. Tolkien, The Return of the King`

---

## Libraries

In [1]:
%cd "Documents\tsa2021-m5"

C:\Users\aamorado\Documents\tsa2021-m5


In [2]:
from statsmodels.tsa.exponential_smoothing.ets import ETSModel
import statsmodels.api as sm
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import lightgbm as lgb
from pandas.plotting import register_matplotlib_converters
from IPython.display import display
from tsa_tools import *  # Hehe

register_matplotlib_converters()
sns.set_style('darkgrid')

np.set_printoptions(precision=4)
pd.set_option('precision', 4)

---

## M5 Forecasting

For this "final project", we will be forecasting the <b><u>level 9</b></u> series (unit sales of all products, aggregated for each store and department).

Load `sales_train_evaluation.csv` and use observations from `d_1 to d_1913` for training and `d_1914 to d_1941` for testing.

In [3]:
df_calendar = pd.read_csv('../data/m5/calendar.csv')
df_sales = pd.read_csv('../data/m5/sales_train_evaluation.csv')
df_weights = pd.read_csv('../data/m5/weights_validation.csv')
# display(df_calendar, df_sales, df_weights)

In [4]:
train_df = (df_sales.set_index([*df_sales.columns[5::-1]]).T
           .set_index(pd.DatetimeIndex(df_calendar.date)[:1941]).iloc[:-28])
test_df = (df_sales.set_index([*df_sales.columns[5::-1]]).T
           .set_index(pd.DatetimeIndex(df_calendar.date)[:1941]).iloc[-28:])
# display(train_df, test_df)

In [5]:
levels = {
    1: None,
    2: "state_id",
    3: "store_id",
    4: "cat_id",
    5: "dept_id",
    6: ["state_id", "cat_id"],
    7: ["state_id", "dept_id"],
    8: ["store_id", "cat_id"],
    9: ["store_id", "dept_id"],
    10: "item_id",
    11: ["state_id", "item_id"],
    12: ["store_id", "item_id"]
}

In [6]:
train_df_9 = timeSeriesFiltering(
    train_df.sum(axis='columns', level=levels[9]), lower=10)
trainOG_df_9 = train_df.sum(axis='columns', level=levels[9])
test_df_9 = test_df.sum(axis='columns', level=levels[9])
weights_df_9 = (df_weights
                .loc[df_weights['Level_id'] == 'Level9']
                .set_index(['Agg_Level_1', 'Agg_Level_2'])[['Weight']])

---

## Part 5. King of the Hill (20 pts.)

Using whatever methods/models you desire, beat the best `WRMSSE` score in Q5.

<b><u>Do not tune your model using the test set!</b></u> If you do, you will not get points for this part.

### Q8. (10 pts.)

Describe your methodology here. 

Points will be awarded for <b>aesthetics</b> (ex. use of diagrams), <b>ease of reading</b>, <b>clarity</b>, and <b>brevity</b>.

Points will be deducted for <b>excessively long</b> walls of text and descriptions.

ANSWER HERE.

### Q9. (10 pts.)

This part is for your actual code.

In [82]:
pipe = Pipeline([
    ('transfrom_X', EndogenousTransformer(w=56, h=28, return_y=False)),
    ('transform_y', TransformedTargetRegressor(
        regressor=MultiOutputRegressor(
            lgb.LGBMRegressor(random_state=1, n_jobs=-1),
            n_jobs=-1
        ),
        transformer=EndogenousTransformer(
            w=56, h=28, return_X=False, reshape=True),
        check_inverse=False)
     )
])
pipe.get_params()

{'memory': None,
 'steps': [('transfrom_X', EndogenousTransformer(h=28, return_y=False, w=56)),
  ('transform_y',
   TransformedTargetRegressor(check_inverse=False,
                              regressor=MultiOutputRegressor(estimator=LGBMRegressor(random_state=1),
                                                             n_jobs=-1),
                              transformer=EndogenousTransformer(h=28, reshape=True,
                                                                return_X=False,
                                                                w=56)))],
 'verbose': False,
 'transfrom_X': EndogenousTransformer(h=28, return_y=False, w=56),
 'transform_y': TransformedTargetRegressor(check_inverse=False,
                            regressor=MultiOutputRegressor(estimator=LGBMRegressor(random_state=1),
                                                           n_jobs=-1),
                            transformer=EndogenousTransformer(h=28, reshape=True,
                   

In [83]:
pipe.fit(train_df_9.iloc[:, 0], train_df_9.iloc[:, 0])
pipe.predict(trainOG_df_9.iloc[-56:, 0])

array([471.7198, 402.9137, 375.4826, 425.8438, 454.9795, 612.5016,
       545.9712, 479.6887, 456.6566, 434.771 , 431.066 , 518.6863,
       590.3889, 559.975 , 501.0072, 478.5058, 417.7204, 437.7128,
       476.0144, 580.7841, 536.5159, 499.4256, 442.6632, 420.5836,
       474.8373, 443.141 , 550.5156, 531.7143])

In [8]:
param_grid = {
    'transfrom_X__w': [56, 28*3],
    'transform_y__regressor__estimator__n_estimators': [100, 200],
    'transform_y__transformer__w': [56, 28*3],
    'transform_y__regressor__estimator__num_leaves': [31, 70],
    'transform_y__regressor__estimator__max_depth': [3, 5, 7],
}

In [9]:
tscv = TimeSeriesSplit(n_splits=3, test_size=28)

In [10]:
gs = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    scoring='neg_root_mean_squared_error',
    error_score=np.NINF,
    return_train_score=True,
    verbose=5,
    cv = tscv,
    n_jobs=-1
)


In [11]:
gs.fit(train_df_9.iloc[:, 1], train_df_9.iloc[:, 1])

Fitting 3 folds for each of 48 candidates, totalling 144 fits


GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=3, test_size=28),
             error_score=-inf,
             estimator=Pipeline(steps=[('transfrom_X',
                                        EndogenousTransformer(h=28,
                                                              return_y=False,
                                                              w=56)),
                                       ('transform_y',
                                        TransformedTargetRegressor(check_inverse=False,
                                                                   regressor=MultiOutputRegressor(estimator=LGBMRegressor(random_state=1),
                                                                                                  n_jobs=-1),
                                                                   transformer=Endo...
                                                                                                     return_X=False,
               

In [13]:
pd.DataFrame(gs.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_transform_y__regressor__estimator__max_depth,param_transform_y__regressor__estimator__n_estimators,param_transform_y__regressor__estimator__num_leaves,param_transform_y__transformer__w,param_transfrom_X__w,params,...,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
0,3.2001,0.8192,0.2118,0.0284,3,100,31,56,56,{'transform_y__regressor__estimator__max_depth...,...,-inf,-inf,-inf,,1,-inf,-inf,-inf,-inf,
1,1.2546,0.0716,0.0,0.0,3,100,31,56,84,{'transform_y__regressor__estimator__max_depth...,...,-inf,-inf,-inf,,1,-inf,-inf,-inf,-inf,
2,1.4022,0.333,0.0,0.0,3,100,31,84,56,{'transform_y__regressor__estimator__max_depth...,...,-inf,-inf,-inf,,1,-inf,-inf,-inf,-inf,
3,2.9345,0.6409,0.3477,0.1436,3,100,31,84,84,{'transform_y__regressor__estimator__max_depth...,...,-inf,-inf,-inf,,1,-inf,-inf,-inf,-inf,
4,3.0851,0.1871,0.1838,0.0126,3,100,70,56,56,{'transform_y__regressor__estimator__max_depth...,...,-inf,-inf,-inf,,1,-inf,-inf,-inf,-inf,
5,1.6383,0.2257,0.0,0.0,3,100,70,56,84,{'transform_y__regressor__estimator__max_depth...,...,-inf,-inf,-inf,,1,-inf,-inf,-inf,-inf,
6,1.5259,0.3783,0.0,0.0,3,100,70,84,56,{'transform_y__regressor__estimator__max_depth...,...,-inf,-inf,-inf,,1,-inf,-inf,-inf,-inf,
7,2.8444,0.59,0.3388,0.2308,3,100,70,84,84,{'transform_y__regressor__estimator__max_depth...,...,-inf,-inf,-inf,,1,-inf,-inf,-inf,-inf,
8,3.8517,0.087,0.2929,0.2083,3,200,31,56,56,{'transform_y__regressor__estimator__max_depth...,...,-inf,-inf,-inf,,1,-inf,-inf,-inf,-inf,
9,1.478,0.0994,0.0,0.0,3,200,31,56,84,{'transform_y__regressor__estimator__max_depth...,...,-inf,-inf,-inf,,1,-inf,-inf,-inf,-inf,


* gridsearch (light GMB onnly per TS)
* modelseach (per TS, baseline seasonal) + direct [singleshot] (treees)
* wrapper search (for GBM, recursive/direct[multipleshot]/chain)
* output is 70
[
    T-3[S1, S2...],
    T-2[S1, S2...],
    T-1[S1, S2...]
]

[S1, S2...]

* exog (price, holiday, DoW, DoY, WoM, etc)

In [7]:
tscv = TimeSeriesSplit(n_splits=3, test_size=28, )
methods = {
    "Seasonal Naive": {
        'meta': 'base', 
        'model':BaseFuncModel(snaivef, m=7)},
    "SES": {
        'meta': 'stat',
        'model': StatsModelsWrapper(ETSModel, trend=None, seasonal=None)},
    "Holt's Linear": {
        'meta': 'stat',
        'model': StatsModelsWrapper(ETSModel, trend='add', seasonal=None)},
    "Additive Holt-Winters": {
        'meta': 'stat',
        'model': StatsModelsWrapper(
            ETSModel, seasonal_periods=7, trend='add', seasonal='add')},
    "RecursiveRegressor(LGBMRegressor)": {
        'meta': 'ml_recursive',
        'model': RecursiveRegressor(
            lgb.LGBMRegressor(random_state=1, w=84, h=28, n_jobs=-1))},
    "MultiOutputRegressor(LGBMRegressor)":{
        'meta': 'ml_direct',
        'model': MultiOutputRegressor(
            lgb.LGBMRegressor(random_state=1, n_jobs=-1),n_jobs=-1)},
    "RegressorChain(LGBMRegressor)": {
        'meta': 'ml_direct',
        'model': RegressorChain(
            lgb.LGBMRegressor(random_state=1, n_jobs=-1))},
    "Combo(LGBMRegressor)": {
        'meta': 'combo',
        'model': [
            RecursiveRegressor(
                lgb.LGBMRegressor(random_state=1, w=84, h=28, n_jobs=-1)),
            MultiOutputRegressor(
                lgb.LGBMRegressor(random_state=1, n_jobs=-1),n_jobs=-1)
        ]}
}

In [None]:
for col in train_df_9:
    print(col)
    methods = evaluate_methods(
        methods,
        X=train_df_9[col],
        XOG=trainOG_df_9[col],
        tscv=tscv,
        col=col,
        w=84,
        h=28)

In [10]:
rmse_scores = score_reveal(methods)
rmse_scores

Unnamed: 0,Unnamed: 1,Seasonal Naive,SES,Holt's Linear,Additive Holt-Winters,RecursiveRegressor(LGBMRegressor),MultiOutputRegressor(LGBMRegressor),RegressorChain(LGBMRegressor),Combo(LGBMRegressor)
CA_1,HOBBIES_1,112.5315,102.1342,102.2342,89.0478,88.1818,91.9518,90.5094,89.2338
CA_1,HOBBIES_2,17.1085,13.3313,13.4002,11.8658,13.7598,14.0885,12.5092,13.3713
CA_1,HOUSEHOLD_1,124.7181,182.9850,183.3727,80.5718,77.1174,78.3647,76.8850,76.4375
CA_1,HOUSEHOLD_2,42.9325,54.9331,54.7421,31.1680,32.8088,34.2279,34.3215,33.1396
CA_1,FOODS_1,96.5581,115.3984,113.5375,92.5809,83.4848,71.9233,74.6178,75.9609
...,...,...,...,...,...,...,...,...,...
WI_3,HOUSEHOLD_1,119.6917,151.5976,151.4656,100.3098,96.1385,96.7119,95.9121,93.5357
WI_3,HOUSEHOLD_2,28.4814,38.4835,38.4900,27.9321,26.6128,29.6885,27.0664,27.0069
WI_3,FOODS_1,85.1284,95.9297,96.3109,86.2128,67.6526,65.7408,64.0030,65.1575
WI_3,FOODS_2,192.7013,167.3461,166.0414,180.1888,109.7245,103.9821,111.8733,104.1259


In [11]:
best_models = rmse_scores.apply(lambda x: x.idxmin(), axis=1)
best_models

CA_1  HOBBIES_1        RecursiveRegressor(LGBMRegressor)
      HOBBIES_2                    Additive Holt-Winters
      HOUSEHOLD_1                   Combo(LGBMRegressor)
      HOUSEHOLD_2                  Additive Holt-Winters
      FOODS_1        MultiOutputRegressor(LGBMRegressor)
                                    ...                 
WI_3  HOUSEHOLD_1                   Combo(LGBMRegressor)
      HOUSEHOLD_2      RecursiveRegressor(LGBMRegressor)
      FOODS_1              RegressorChain(LGBMRegressor)
      FOODS_2        MultiOutputRegressor(LGBMRegressor)
      FOODS_3                       Combo(LGBMRegressor)
Length: 70, dtype: object

In [12]:
best_models.value_counts()

Additive Holt-Winters                  27
Combo(LGBMRegressor)                   10
RegressorChain(LGBMRegressor)          10
RecursiveRegressor(LGBMRegressor)      10
MultiOutputRegressor(LGBMRegressor)     9
SES                                     3
Holt's Linear                           1
dtype: int64

In [15]:
model = ensemble2(col_assignment=best_models.to_dict(), methods=methods, w=84, h=28)
model.fit(train_df_9)

In [None]:
df_pred_9_ensemble2 = model.predict(trainOG_df_9)
df_pred_9_ensemble2

Unnamed: 0_level_0,CA_1,CA_1,CA_1,CA_1,CA_1,CA_1,CA_1,CA_2,CA_2,CA_2,...,WI_2,WI_2,WI_2,WI_3,WI_3,WI_3,WI_3,WI_3,WI_3,WI_3
Unnamed: 0_level_1,HOBBIES_1,HOBBIES_2,HOUSEHOLD_1,HOUSEHOLD_2,FOODS_1,FOODS_2,FOODS_3,HOBBIES_1,HOBBIES_2,HOUSEHOLD_1,...,FOODS_1,FOODS_2,FOODS_3,HOBBIES_1,HOBBIES_2,HOUSEHOLD_1,HOUSEHOLD_2,FOODS_1,FOODS_2,FOODS_3
0,437.5033,39.0957,763.9959,207.237,266.7877,509.9272,1925.101,320.4134,39.31,666.1571,...,320.4731,964.3415,1868.7333,218.7019,29.3937,527.0088,159.2782,271.9992,436.9822,1605.361
1,392.2056,38.7922,654.6803,189.0628,242.2286,405.2741,1777.0478,349.0217,34.758,602.7528,...,333.5456,721.1611,1640.876,208.3781,30.4984,537.2347,139.6441,294.1283,382.8208,1620.5203
2,423.5902,40.4461,619.8833,189.6468,283.8521,348.0841,1754.382,280.4951,41.5079,602.8867,...,343.4532,710.5954,1680.4127,216.6801,31.0335,509.2187,132.3633,242.3489,359.9844,1562.3207
3,406.6826,39.1083,593.2084,197.6096,300.6551,372.3611,1844.6085,350.8923,37.9551,599.6201,...,339.3687,802.8031,1688.3539,224.516,31.2705,552.2354,142.5043,229.3149,354.8844,1490.6989
4,453.2403,43.9026,688.9625,226.5993,344.9472,518.561,2073.3072,427.0443,41.1714,765.073,...,380.7957,713.4821,2152.6675,285.1761,32.4132,661.9132,176.8324,278.5479,468.2096,1740.4981
5,580.0852,50.3332,1028.9997,290.3485,400.7362,592.585,2682.339,467.593,52.6185,1077.7876,...,405.5013,835.2595,2311.1217,314.2479,33.0978,809.0336,206.358,337.2512,538.8162,2519.8603
6,516.6611,52.1988,1084.4108,280.3655,351.8076,662.2098,2926.0347,456.7799,48.7608,1117.9953,...,349.6992,1033.3559,2383.4232,244.084,32.2626,841.5723,206.3033,283.7687,591.0831,2432.8251
7,461.42,39.1899,774.8963,207.5895,266.4044,510.1214,2140.1317,318.742,32.7714,667.6338,...,320.8609,1306.3273,2591.0659,218.81,29.457,625.0542,155.6883,252.6107,579.6351,1792.5438
8,406.5608,38.8864,669.6527,189.4153,274.1547,474.5079,1948.6778,318.5709,37.6144,604.2295,...,333.9334,1380.853,2449.1456,208.4862,30.5617,584.6945,136.9481,278.5121,720.1154,2050.8738
9,439.8678,40.5403,642.2546,189.9993,250.8226,491.0791,1899.5062,319.1154,39.6117,604.3635,...,343.841,1586.7742,2655.3447,216.7883,31.0968,595.5331,141.5213,230.4024,612.6808,1902.4786


In [17]:
df_res_9_esemble2 = rateMyForecast(
    trainOG_df_9, test_df_9, df_pred_9_ensemble2)['RMSSE']
df_res_9_esemble2.index = pd.MultiIndex.from_tuples(
    df_res_9_esemble2.index, names=['Agg_Level_1', 'Agg_Level_2'])
df_res_9_esemble2

Agg_Level_1  Agg_Level_2
CA_1         HOBBIES_1      0.7412
             HOBBIES_2      0.6931
             HOUSEHOLD_1    0.3584
             HOUSEHOLD_2    0.5141
             FOODS_1        0.5867
                             ...  
WI_3         HOUSEHOLD_1    0.7166
             HOUSEHOLD_2    0.7675
             FOODS_1        1.6214
             FOODS_2        0.8506
             FOODS_3        0.4849
Name: RMSSE, Length: 70, dtype: float64

In [18]:
df_res_9_esemble2.multiply(weights_df_9.squeeze(), axis=0).sum()

0.7259427322254518