# Time Series Analysis: "The Final Project"

`End? No, the journey doesn't end here. Death is just another path. One that we all must take.
-J.R.R. Tolkien, The Return of the King`

---

## Libraries

In [2]:
%cd "Documents\tsa2021-m5"

C:\Users\aamorado\Documents\tsa2021-m5


In [108]:
from statsmodels.tsa.exponential_smoothing.ets import ETSModel
import statsmodels.api as sm
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import lightgbm as lgb
from pandas.plotting import register_matplotlib_converters
from IPython.display import display
from tsa_tools import *  # Hehe
import warnings

register_matplotlib_converters()
sns.set_style('darkgrid')

np.set_printoptions(precision=4)
pd.set_option('precision', 4)

warnings.filterwarnings("ignore")

---

## M5 Forecasting

For this "final project", we will be forecasting the <b><u>level 9</b></u> series (unit sales of all products, aggregated for each store and department).

Load `sales_train_evaluation.csv` and use observations from `d_1 to d_1913` for training and `d_1914 to d_1941` for testing.

In [43]:
df_calendar = pd.read_csv('../data/m5/calendar.csv')
df_sales = pd.read_csv('../data/m5/sales_train_evaluation.csv')
df_weights = pd.read_csv('../data/m5/weights_validation.csv')
# display(df_calendar, df_sales, df_weights)

In [3]:
train_df = (df_sales.set_index([*df_sales.columns[5::-1]]).T
           .set_index(pd.DatetimeIndex(df_calendar.date)[:1941]).iloc[:-28])
test_df = (df_sales.set_index([*df_sales.columns[5::-1]]).T
           .set_index(pd.DatetimeIndex(df_calendar.date)[:1941]).iloc[-28:])
# display(train_df, test_df)

In [4]:
levels = {
    1: None,
    2: "state_id",
    3: "store_id",
    4: "cat_id",
    5: "dept_id",
    6: ["state_id", "cat_id"],
    7: ["state_id", "dept_id"],
    8: ["store_id", "cat_id"],
    9: ["store_id", "dept_id"],
    10: "item_id",
    11: ["state_id", "item_id"],
    12: ["store_id", "item_id"]
}

---

## Part 1. Baseline Methods (10 pts.)

### Q1. (10 pts.)

Extract all level 9 series from the dataset.

For each series, generate a 28-step forecast using the methods enumerated below and calculate the `RMSSE` against the test set:

1. `Naive`


2. `Seasonal Naive`


3. `SES`


4. `Holt's Linear`


5. `Additive Holt-Winters`

Summarize the metrics in a dataframe and print it.

In [5]:
methods = {
    "Naive": BaseFuncModel(naivef),
    "Seasonal Naive": BaseFuncModel(snaivef, m=7),
    "SES": StatsModelsWrapper(ETSModel, trend=None, seasonal=None),
    "Holt's Linear": StatsModelsWrapper(ETSModel, trend='add', seasonal=None),
    "Additive Holt-Winters": StatsModelsWrapper(
        ETSModel, seasonal_periods=7, trend='add', seasonal='add'),
}

trainOG_df_9 = train_df.sum(axis='columns', level=levels[9])
train_df_9 = timeSeriesFiltering(trainOG_df_9, lower=10)
test_df_9 = test_df.sum(axis='columns', level=levels[9])
weights_df_9 = (df_weights
                .loc[df_weights['Level_id'] == 'Level9']
                .set_index(['Agg_Level_1', 'Agg_Level_2'])[['Weight']])

In [6]:
res = {}
for method, model in methods.items():
    forecast_df_9 = pd.DataFrame(
        {label: model.fit(content).forecast(28)
        for label, content in train_df_9.items()}
        )
    res[method] = rateMyForecast(
        trainOG_df_9, test_df_9, forecast_df_9)['RMSSE']

In [7]:
pd.set_option('display.max_rows', None)
df_res_9_base = pd.DataFrame(res)
df_res_9_base.index = pd.MultiIndex.from_tuples(df_res_9_base.index)
df_res_9_base

Unnamed: 0,Unnamed: 1,Naive,Seasonal Naive,SES,Holt's Linear,Additive Holt-Winters
CA_1,HOBBIES_1,1.4216,0.7428,0.8598,0.8589,0.6152
CA_1,HOBBIES_2,1.8601,1.0992,0.8494,0.8414,0.6931
CA_1,HOUSEHOLD_1,2.05,0.5036,1.1212,1.1331,0.4317
CA_1,HOUSEHOLD_2,2.2407,0.5094,1.1894,1.1486,0.5141
CA_1,FOODS_1,0.8768,0.688,0.8637,0.8808,0.687
CA_1,FOODS_2,2.0036,0.8068,2.0035,2.2574,0.5786
CA_1,FOODS_3,1.6443,0.475,1.0401,0.9787,0.4848
CA_2,HOBBIES_1,1.1753,0.688,1.0901,1.0902,0.6379
CA_2,HOBBIES_2,1.3073,1.4373,1.3402,1.3308,1.1234
CA_2,HOUSEHOLD_1,1.9863,0.624,1.3668,1.3564,0.5704


---

## Part 2. LightGBM (30 pts.)

### Q2. (10 pts.)

For all series, use an un-tuned `LightGBM` with 56-day lookback that uses a one-step recursive forecasting strategy to generate a 28-step forecast.

Calculate the `RMSSE` against the test set, then summarize the metrics in a dataframe and print it.

In [8]:
model = RecursiveRegressor(
    lgb.LGBMRegressor(random_state=1, w=56, h=28, n_jobs=-1))  # Model: recursive-forecasting
pred = {}

for col in train_df_9:
    model.fit(None, train_df_9[col])
    pred[col] = model.predict(trainOG_df_9[col].iloc[-56:]).squeeze()

In [9]:
df_pred_9_rrlgb = pd.DataFrame(pred)
df_pred_9_rrlgb.index=test_df_9.index

df_res_9_rrlgb = rateMyForecast(
    trainOG_df_9, test_df_9, df_pred_9_rrlgb)['RMSSE']
res["RecursiveRegressor(LGBMRegressor)"] = df_res_9_rrlgb

df_res_9_rrlgb.index = pd.MultiIndex.from_tuples(df_res_9_rrlgb.index)
df_res_9_rrlgb.to_frame()

Unnamed: 0,Unnamed: 1,RMSSE
CA_1,HOBBIES_1,0.718
CA_1,HOBBIES_2,0.6586
CA_1,HOUSEHOLD_1,0.5073
CA_1,HOUSEHOLD_2,0.5062
CA_1,FOODS_1,0.6677
CA_1,FOODS_2,0.645
CA_1,FOODS_3,0.4652
CA_2,HOBBIES_1,0.6993
CA_2,HOBBIES_2,1.2472
CA_2,HOUSEHOLD_1,0.6587


### Q3. (10 pts.)

For all series, use an un-tuned `LightGBM` with 56-day lookback that uses a direct forecasting strategy to generate a 28-step forecast.

Calculate the `RMSSE` against the test set, then summarize the metrics in a dataframe and print it.

In [10]:
model = MultiOutputRegressor(
    lgb.LGBMRegressor(random_state=1))  # Model: direct-forecasting
pred = {}

for col in train_df_9:
    X_train, _, y_train, _ = TimeseriesGenerator(
        X=train_df_9[col],
        y=None,
        w=56,
        h=28)
    model.fit(X_train, y_train)
    pred[col] = model.predict([trainOG_df_9[col].iloc[-56:]]).squeeze()

In [12]:
df_pred_9_morlgb = pd.DataFrame(pred, index=test_df_9.index)

df_res_9_morlgb = rateMyForecast(
    trainOG_df_9, test_df_9, df_pred_9_morlgb)['RMSSE']
res["MultiOutputRegressor(LGBMRegressor)"] = df_res_9_morlgb

df_res_9_morlgb.index = pd.MultiIndex.from_tuples(df_res_9_morlgb.index)
df_res_9_morlgb.to_frame()

Unnamed: 0,Unnamed: 1,RMSSE
CA_1,HOBBIES_1,0.687
CA_1,HOBBIES_2,0.6902
CA_1,HOUSEHOLD_1,0.3635
CA_1,HOUSEHOLD_2,0.6683
CA_1,FOODS_1,0.6356
CA_1,FOODS_2,0.7556
CA_1,FOODS_3,0.5491
CA_2,HOBBIES_1,0.6626
CA_2,HOBBIES_2,1.4831
CA_2,HOUSEHOLD_1,0.5651


### Q4. (10 pts.)

For all series, generate a 28-step forecast by combining the forecasts generated by the models in Q2 and Q3 (i.e. simple averaging).

Calculate the `RMSSE` against the test set, then summarize the metrics in a dataframe and print it.

In [13]:
df_pred_9_combo = (df_pred_9_morlgb + df_pred_9_rrlgb) / 2

df_res_9_combo = rateMyForecast(
    trainOG_df_9, test_df_9, df_pred_9_combo)['RMSSE']
res["Combo(LGBMRegressor)"] = df_res_9_combo

df_res_9_combo.index = pd.MultiIndex.from_tuples(df_res_9_combo.index)
df_res_9_combo.to_frame()

Unnamed: 0,Unnamed: 1,RMSSE
CA_1,HOBBIES_1,0.6904
CA_1,HOBBIES_2,0.6439
CA_1,HOUSEHOLD_1,0.4231
CA_1,HOUSEHOLD_2,0.5778
CA_1,FOODS_1,0.6198
CA_1,FOODS_2,0.6892
CA_1,FOODS_3,0.5021
CA_2,HOBBIES_1,0.6709
CA_2,HOBBIES_2,1.3542
CA_2,HOUSEHOLD_1,0.5993


---

## Part 3. WRMSSE (10 pts.)

### Q5.  (10 pts.)

Calculate the `WRMSSE` for the all the methods described above. The weights can be found in `weights_validation.csv`.

For reference, the M5 benchmarks have the following `WRMSSE` scores at level 9:

- `Naive` = <b>1.764</b>


- `S.Naive` = <b>0.888</b>


- `ES_bu` = <b>0.728</b>

<i>Note: The M5 benchmarks use a bottom-up method for forecasting, so they will not necessarily be equal to your scores.</i>

In [15]:
df_res_9_all = pd.DataFrame(res)
df_res_9_all.index = pd.MultiIndex.from_tuples(df_res_9_all.index)
df_res_9_all

(df_res_9_all.rename_axis(['Agg_Level_1', 'Agg_Level_2'])
 .multiply(weights_df_9.squeeze(), axis=0).sum())

Naive                                  1.5616
Seasonal Naive                         0.8919
SES                                    1.1585
Holt's Linear                          1.1855
Additive Holt-Winters                  0.8200
RecursiveRegressor(LGBMRegressor)      0.7569
MultiOutputRegressor(LGBMRegressor)    0.7814
Combo(LGBMRegressor)                   0.7505
dtype: float64

---

## Part 4. Middle-Out Method (30 pts.)

In [104]:
full_df = (df_sales.set_index([*df_sales.columns[5::-1]]).T
           .set_index(pd.DatetimeIndex(df_calendar.date)[:1941]))
df_weights['Level_no'] = (df_weights['Level_id'].str
                          .extract('(\d+)').astype(int))
df_weights['TS'] = [x[0] if x[1] == 'X' else x for x in 
                    list(zip(df_weights.Agg_Level_1, df_weights.Agg_Level_2))]

weights1 = df_weights[['Level_no', 'TS', 'Weight']]
w1to8 = (weights1[weights1['Level_no'].isin(range(1,9))]
        .set_index(['Level_no', 'TS']))

w1to13 = (weights1[weights1['Level_no'].isin(range(10,13))]
        .set_index(['Level_no', 'TS']))


### Q6. Bottom-Up (15 pts.)

Using your forecasts from the best performing method in Q5, use the bottom-up method described in [FPP3](https://otexts.com/fpp3/single-level.html) to generate forecasts for levels 1 to 8.

Calculate the `WRMSSE` for levels 1 to 8 against the test set, then summarize the metrics in a dataframe and print it.

For reference, you can find the benchmark `WRMSSE` scores in the `The M5 Accuracy competition: Results, findings and conclusions` paper.

<i>Note: The M5 benchmarks use a bottom-up method for forecasting, so they will not necessarily be equal to your scores.</i>

In [101]:
df_pred_9_combo.columns = train_df.sum(axis=1, 
                                       level=["store_id", "state_id", 
                                              "cat_id", "dept_id"]).columns
b_up_rmsse = compute_bottomup(full_df, df_pred_9_combo, 9)
bup_rmsse = (pd.DataFrame.from_dict(b_up_rmsse, orient="index")
             .stack().to_frame()).rename(columns={0:'RMSSE'})

In [102]:
bup_res = (pd.concat([bup_rmsse, w1to8], axis=1).reset_index()
           .rename(columns={'level_0': 'Levels'}))
bup_res['WRMSSE'] = bup_res.RMSSE * bup_res.Weight
bup_res.groupby('Levels')[['WRMSSE']].agg('sum')

Unnamed: 0_level_0,WRMSSE
Levels,Unnamed: 1_level_1
1,0.5601
2,0.5811
3,0.6293
4,0.5691
5,0.6334
6,0.6087
7,0.6804
8,0.6678


### Q7. Top-Down  (15 pts.)

Using your forecasts from the best performing method in Q5, use the top-down method with `average historical proportions` described in [FPP3](https://otexts.com/fpp3/single-level.html) to generate forecasts for levels 10 to 12.

Calculate the `WRMSSE` for levels 10 to 12  against the test set, then summarize the metrics in a dataframe and print it.

For reference, you can find the benchmark `WRMSSE` scores in the `The M5 Accuracy competition: Results, findings and conclusions` paper.

<i>Note: The M5 benchmarks use a bottom-up method for forecasting, so they will not necessarily be equal to your scores.</i>

In [109]:
t_down_rmsse = compute_topdown(full_df, df_pred_9_combo, 9)

  0%|          | 0/3 [00:00<?, ?it/s]


NameError: name 'full_df' is not defined