# Project 2

We're going to continue from where we left off with Project 1. Project 1 left us with a daily time series for every product with no gaps -- exactly what we want for modeling!

In [None]:
data_path = ''

In [None]:
import pandas as pd
import numpy as np

In [None]:
data = pd.read_parquet(f'{data_path}/sales_data.parquet')
data.head()

In [None]:
data.shape

We did some EDA in Project 1, but it was primarily focused on higher level patterns (i.e., at the department level). This time, spend some time doing EDA at the item level to see what kind of items you're dealing with.

Some questions you may want to explore:

1. How do high-volume items compare to low-volume/itermittent items?
2. What sort of seasonal patterns are at play?
3. Do items from different departments show different patterns?
4. Does the same item show different behavior at different stores?

These questions are just a starting point. Feel free to explore this any way you feel is necessary to make better models. The best EDA is done iteratively, so I encourage you to come back to this once you've started fitting models!

In [None]:
# do some EDA!

Before we get to modeling, let's create our evaluation setup. The models that we're going to create have a 28-day forecast horizon, and our goal is to best approximate "average" sales.

The first step is to implement our evaluation metric. The original competition used a metric called RMSSE, or "Root Mean Squared Scaled Error." It's similar to the MASE metric that we discussed before, except that the metric optimizes better for "average" sales (as opposed to MASE, which optimizes for the median, since it's an absolute error metric). The competition actually used a weighted version of RMSSE which is techincally more robust, but we're going to stick to RMSSE. Here's what RMSSE looks like:

$RMSSE = \sqrt{\frac{1}{h}\frac{\sum^{n+h}_{t=n+1} (Y_t - \hat{Y}_t)^2}{\frac{1}{n-1}\sum^n_{t=2} (Y_t - Y_{t-1})^2}}$

where $Y_t$ is the actual future value of sales at date $t$, $\hat{Y}_t$ is your forecast for date $t$, $n$ is the number of dates in our training set, and $h$ is our forecast horizon (28 days, in our case).

That looks intimidating! But, similarly to MASE, you can break it down into two parts:
- The numerator: $\frac{1}{h}\sum^{n+h}_{t=n+1} (Y_t - \hat{Y}_t)^2$, which is just the MSE for every prediction in the validation set.
- The denominator: $\frac{1}{n-1}\sum^n_{t=2} (Y_t - Y_{t-1})^2$, which is just the MSE over the entire training set if your forecast was a naive, one-day-ahead forecast. We refer to this as the "scale" since it's really just a benchmark -- errors less than this are better than the benchmark, and errors greater than this are worse.

Of course, the "naive, one-day-ahead forecast" part only works if you calculate both the numerator and denominator separately for each `id`. So, the idea here is that you are effectively calculating an RMSSE value for each `id`, and then averaging those to get the final RMSSE.

Last comment: there are products in the dataset that don't start showing sales for some time. For those products, the denominator is only supposed to be calculated after the first sale in the dataset. I'd recommend just dropping the records for those products until that first sales, which is straightforward to do using `.cumsum()` over `sales` while grouping by `id`.

In [None]:
# QUESTION: filter out products that don't have sales using cumsum

Here's how you should implement your RMSSE:

1. Create a function called `rmsse` that looks like this:

`def rmsse(train, val, y_pred):`

where:
- `train` is the `pd.DataFrame` representing the training set
- `val` is the `pd.DataFrame` representing the validation set
- `y_pred` is either a `pd.Series` or `np.ndarray` that is the output of your model

2. Start by calculating the scale (i.e. denominator from above) for each `id` over the training set.

3. Then, calculate the MSE for each `id` over the validation set.

4. Merge the scale dataframe onto the dataframe that contains your validation MSE values.

5. Use the merged dataframe to calculate the RMSSE for each `id`, and finally return the average of all of those RMSSE values.

Don't worry that you haven't split your data into training and validation sets yet. I gave you a test case below to see if your code is working before you move on. Also, don't be afraid to do this in a simple, looped fashion before refactoring it into more beautiful Pandas code. Take advantage of that test case!

In [None]:
# QUESTION: implement rmsse

In [None]:
def test_rmsse():
    test_train = pd.DataFrame({
        'id': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'],
        'sales': [3, 2, 5, 100, 150, 60, 10, 20, 30],
    })
    test_val = pd.DataFrame({
        'id': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'],
        'sales': [6, 1, 4, 200, 120, 270, 10, 20, 30],
    })
    test_y_pred = pd.Series([1, 2, 3, 180, 160, 240, 20, 30, 40])

    assert np.abs(rmsse(test_train, test_val, test_y_pred) - 0.92290404515501) < 1e-6

test_rmsse()

# Fitting models

From this point on, the project is a bit of a "choose your own adventure." There's a huge range of skill levels out there, and I want to provide you with a path that will meet you where you're at (but test you a little). At a minimum, though, you'll be fitting a LightGBM model.

1. If you're a beginner, use [`mlforecast`](https://nixtla.github.io/mlforecast/) (the sister package to `statsforecast`). It helps a lot with both feature engineering and model fitting, so you'll be able to try out a lot of options without getting bogged down in writing complex code. Focus your efforts on trying lots of different features/hyperparameters and seeing how they affect your model!

If you want to go this route, here are the steps you should take:

- Familiarize yourself with mlforecast [here](https://nixtla.github.io/mlforecast/)
- Read the code in the below cell. This is your starting point!
- Try adding other `date_features`, like the week of year and day of year.
- Try adding `static_features=['item_id', 'dept_id', 'cat_id']` to `fcst.fit()`
- Try out other rolling mean/std lengths and at different lags to see if they help. (You can import `rolling_std` from `window_ops.rolling`)
- Try adding seasonal rolling means using the following code, which implements a 4 week seasonal rolling mean with a season length of 7 days:

```
@njit
def seasonal_rolling_mean(x):
    return seasonal_rolling_mean(x, season_length=7, window_size=4, min_samples=1)
```
- Try out some difference and lag features.
- Try adding variables from the other data files, such as price.

2. If you feel more comfortable, then I want you to not only try out different features/hyperparameters, but also compare modeling methods! Some things to try:

- Features
    - Benefits from lag features vs. rolling window features
    - Which rolling window aggregations help
    - Comparing seasonal rolling features to non-seasonal
    - Features aggregated at the department/category level (but make sure to only calculate over the training set!)
- Modeling
    - Simple 28-day forecast horizon LightGBM model
    - MLForecast's recursive strategy
    - The multi-horizon strategy (i.e. one model predicting 7 days out, a second model predicting 14 days out, etc.)
    - Deep learning models using [`neuralforecast`](https://nixtla.github.io/neuralforecast/) or `darts`

3. (Optional) no matter which group you fit into, try adding in calendar and price features from the other data files that I added!

In [None]:
# Don't worry about any error outputs here, unless you get the same "Retrying" error as Project 1
! pip install mlforecast==0.6.0

In [None]:
from mlforecast import MLForecast
from statsforecast import StatsForecast
from sklearn.preprocessing import OrdinalEncoder
from numba import njit
from window_ops.rolling import rolling_mean
import lightgbm as lgb

# split into training and validation sets and conform the column names to what MLForecast expects
val = (
    data
    .reset_index()
    .groupby('id')
    .tail(28)
    .rename(columns={
        'date': 'ds',
        'id': 'unique_id',
        'sales': 'y',
    })
)
train = (
    data
    .reset_index()
    .drop(val.index)
    .rename(columns={
        'date': 'ds',
        'id': 'unique_id',
        'sales': 'y',
    })
)

# label encode categorical features
cat_feats = ['unique_id', 'item_id', 'dept_id', 'cat_id']
enc_cat_feats = [f'{feat}_enc' for feat in cat_feats]

encoder = OrdinalEncoder()
train[enc_cat_feats] = encoder.fit_transform(train[cat_feats])
val[enc_cat_feats] = encoder.transform(val[cat_feats])

reference_cols = ['unique_id', 'ds', 'y']

# add features to this list if you want to use them
features = reference_cols + enc_cat_feats
train = train[features]
val = val[features]

@njit
def rolling_mean_28(x):
    return rolling_mean(x, window_size=28)

# feel free to tweak these parameters!
model_params = {
    'verbose': -1,
    'num_leaves': 256,
    'n_estimators': 50,
    'objective': 'tweedie',
    'tweedie_variance_power': 1.1,
}

models = [
    lgb.LGBMRegressor(**model_params),
]


fcst = MLForecast(
    models=models,
    freq='D',
    # dictionary reads like this:
    # {number of days to lag the feature: [list of functions to apply to the lagged data]}
    lag_transforms={
        7: [rolling_mean_28]
    },
    date_features=['dayofweek'],
)

# don't worry about nul value warnings. LightGBM and XGBoost can handle it!
fcst.fit(
    train, 
    id_col='unique_id', 
    time_col='ds', 
    target_col='y', 
    dropna=False
)

predictions = fcst.predict(28)

# plot the last 45 days of the training set, the validation set, and the predictions
plot_data = (
    pd.concat([
        train.groupby('unique_id').tail(45)[['unique_id', 'ds', 'y']], 
        val[['unique_id', 'ds', 'y']], 
        predictions
    ])
)

# for some reason, MLForecast doesn't have this awesome plotting method!
StatsForecast.plot(plot_data)

Write a brief summary of what helped your models and what didn't help. Was it what you expected?