# Feature Engineering

ForecastFlowML includes a preprocessing module to create features based on the time 
series dataset. This user guide shows how the features can be created in a scaleable way
before the modelling phase.

## Imports

In [14]:
from forecastflowml import FeatureExtractor
from forecastflowml import ForecastFlowML
from forecastflowml.data.loader import load_walmart_m5
from pyspark.sql import SparkSession
from lightgbm import LGBMRegressor
import pandas as pd
import sys
import os

os.environ["PYSPARK_PYTHON"] = sys.executable
pd.set_option("display.max_columns", 100)

## Initialize Spark

In [15]:
spark = (
    SparkSession.builder.master("local[4]")
    .config("spark.driver.memory", "4g")
    .config("spark.sql.shuffle.partitions", "4")
    .config("spark.sql.execution.pyarrow.enabled", "true")
    .getOrCreate()
)

## Sample Dataset

In [16]:
df = load_walmart_m5(spark).localCheckpoint()
df.show(10)

+--------------------+-----------+-------+------+--------+--------+----------+-----+
|                  id|    item_id|dept_id|cat_id|store_id|state_id|      date|sales|
+--------------------+-----------+-------+------+--------+--------+----------+-----+
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      TX|2015-01-15|  3.0|
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      TX|2015-01-16|  0.0|
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      TX|2015-01-17|  1.0|
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      TX|2015-01-18|  0.0|
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      TX|2015-01-19|  0.0|
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      TX|2015-01-20|  0.0|
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      TX|2015-01-21|  0.0|
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      TX|2015-01-22|  0.0|
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      T

## Feature Overview


With ``FeatureExtractor``, we can extract:
- Lag features
- Rolling statistics (mean, standard deviation etc.) with spesified lags
- Count of consecutive spesific values that may be used to count number of out-of-stock periods
- History length that refers to the number of periods from the beginning of the time series
- Date features


## Lags

When extracting the features, we should be careful about the lags we are creating.
In this example, we are going to prepare features for 4 weekly models.

- Model 1 will predict days 1–7, not using the the 6 most recent lag features.
- Model 2 will predict days 8–14, not using the the 13 most recent lag features.
- Model 3 will predict dayts 15–21, not using the the 20 most recent lag features.
- Model 4 will predict days 22–28, not using the the 27 most recent lag features.

For lag features, we are going to extract the sales on the same week day over the past 4 weeks. 

![image info](../_static/lag.svg)

Since each model has different horizon, they will be allowed to use different lags in the modelling phase. In summary, we need to extract ``lag_7``, ``lag_14``, ``lag_21``, ``lag_28``, ``lag_35``, ``lag_42`` and ``lag_49`` as features.

In [17]:
feature_extractor = FeatureExtractor(
    id_col="id",
    date_col="date",
    target_col="sales",
    lag_window_features={
        "lag": [7 * (i + 1) for i in range(8)],
    },
)
df_features = feature_extractor.transform(df)
df_features.show(10)

+--------------------+-----------+-------+------+--------+--------+----------+-----+-----+------+------+------+------+------+------+------+
|                  id|    item_id|dept_id|cat_id|store_id|state_id|      date|sales|lag_7|lag_14|lag_21|lag_28|lag_35|lag_42|lag_49|lag_56|
+--------------------+-----------+-------+------+--------+--------+----------+-----+-----+------+------+------+------+------+------+------+
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      TX|2015-01-15|  3.0| null|  null|  null|  null|  null|  null|  null|  null|
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      TX|2015-01-16|  0.0| null|  null|  null|  null|  null|  null|  null|  null|
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      TX|2015-01-17|  1.0| null|  null|  null|  null|  null|  null|  null|  null|
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      TX|2015-01-18|  0.0| null|  null|  null|  null|  null|  null|  null|  null|
|FOODS_1_002_TX_1_..

## Rolling Statistics

For rolling statistics, we are going to calculate the mean over the **window** of 7, 14 and 30 days, with the **most recent lags** that models can use which are 7 days for model 1, 14 days for model 2, 21 days for model 3 and 28 days for model 4.

![image info](../_static/lag_window.svg)

In [18]:
feature_extractor = FeatureExtractor(
    id_col="id",
    date_col="date",
    target_col="sales",
    lag_window_features={
        "mean": [[window, lag] for lag in [7, 14, 21, 28] for window in [7, 14, 30]],
    },
)
df_features = feature_extractor.transform(df)
df_features.show(10)

+--------------------+-----------+-------+------+--------+--------+----------+-----+-------------------+--------------------+--------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+
|                  id|    item_id|dept_id|cat_id|store_id|state_id|      date|sales|window_7_lag_7_mean|window_14_lag_7_mean|window_30_lag_7_mean|window_7_lag_14_mean|window_14_lag_14_mean|window_30_lag_14_mean|window_7_lag_21_mean|window_14_lag_21_mean|window_30_lag_21_mean|window_7_lag_28_mean|window_14_lag_28_mean|window_30_lag_28_mean|
+--------------------+-----------+-------+------+--------+--------+----------+-----+-------------------+--------------------+--------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+--------------------+-------------------

## Out-of-stock Periods

Sometimes a product might be out-of-stock for a certain period. We are now going to
count the consecutive periods where sales did not occur with the **most recent lags** 
that models can use.

In [19]:
feature_extractor = FeatureExtractor(
    id_col="id",
    date_col="date",
    target_col="sales",
    count_consecutive_values={
        "value": 0,
        "lags": [7, 14, 21, 28],
    },
)
df_features = feature_extractor.transform(df)
df_features.show(10)

+--------------------+-----------+-------+------+--------+--------+----------+-----+-----------------------------+------------------------------+------------------------------+------------------------------+
|                  id|    item_id|dept_id|cat_id|store_id|state_id|      date|sales|count_consecutive_value_lag_7|count_consecutive_value_lag_14|count_consecutive_value_lag_21|count_consecutive_value_lag_28|
+--------------------+-----------+-------+------+--------+--------+----------+-----+-----------------------------+------------------------------+------------------------------+------------------------------+
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      TX|2015-01-15|  3.0|                         null|                          null|                          null|                          null|
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      TX|2015-01-16|  0.0|                         null|                          null|                       

## History Length

We can also count the total number periods past after the introduction of the time series.

In [20]:
feature_extractor = FeatureExtractor(
    id_col="id",
    date_col="date",
    target_col="sales",
    history_length=True,
)
df_features = feature_extractor.transform(df)
df_features.show(10)

+--------------------+-----------+-------+------+--------+--------+----------+-----+--------------+
|                  id|    item_id|dept_id|cat_id|store_id|state_id|      date|sales|history_length|
+--------------------+-----------+-------+------+--------+--------+----------+-----+--------------+
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      TX|2015-01-15|  3.0|             1|
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      TX|2015-01-16|  0.0|             2|
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      TX|2015-01-17|  1.0|             3|
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      TX|2015-01-18|  0.0|             4|
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      TX|2015-01-19|  0.0|             5|
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      TX|2015-01-20|  0.0|             6|
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      TX|2015-01-21|  0.0|             7|


## Date Features

Finally, we can also include the date derived features.

In [21]:
feature_extractor = FeatureExtractor(
    id_col="id",
    date_col="date",
    target_col="sales",
    date_features=[
        "day_of_month",
        "day_of_week",
        "week_of_year",
        "week_of_month",
        "weekend",
        "quarter",
        "month",
        "year",
    ],
)
df_features = feature_extractor.transform(df)
df_features.show(10)

+--------------------+-----------+-------+------+--------+--------+----------+-----+------------+-----------+------------+-------------+-------+-------+-----+----+
|                  id|    item_id|dept_id|cat_id|store_id|state_id|      date|sales|day_of_month|day_of_week|week_of_year|week_of_month|weekend|quarter|month|year|
+--------------------+-----------+-------+------+--------+--------+----------+-----+------------+-----------+------------+-------------+-------+-------+-----+----+
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      TX|2015-01-15|  3.0|          15|          5|           3|            3|      0|      1|    1|2015|
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      TX|2015-01-16|  0.0|          16|          6|           3|            3|      0|      1|    1|2015|
|FOODS_1_002_TX_1_...|FOODS_1_002|FOODS_1| FOODS|    TX_1|      TX|2015-01-17|  1.0|          17|          7|           3|            3|      1|      1|    1|2015|
|FOODS_1_002_TX_

## Combine Features

Let's combine all of the features extraction steps together.

In [22]:
feature_extractor = FeatureExtractor(
    id_col="id",
    date_col="date",
    target_col="sales",
    lag_window_features={
        "lag": [7 * (i + 1) for i in range(8)],
        "mean": [[window, lag] for lag in [7, 14, 21, 28] for window in [7, 14, 30]],
    },
    date_features=[
        "day_of_month",
        "day_of_week",
        "week_of_year",
        "week_of_month",
        "weekend",
        "quarter",
        "month",
        "year",
    ],
    count_consecutive_values={
        "value": 0,
        "lags": [7, 14, 21, 28],
    },
    history_length=True,
)

In [23]:
df_train = feature_extractor.transform(df).localCheckpoint()
df_train.show(10)

+--------------------+-----------+-------+------+--------+--------+----------+-----+-----+------+------+------+------+------+------+------+-------------------+--------------------+--------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+-----------------------------+------------------------------+------------------------------+------------------------------+--------------+------------+-----------+------------+-------------+-------+-------+-----+----+
|                  id|    item_id|dept_id|cat_id|store_id|state_id|      date|sales|lag_7|lag_14|lag_21|lag_28|lag_35|lag_42|lag_49|lag_56|window_7_lag_7_mean|window_14_lag_7_mean|window_30_lag_7_mean|window_7_lag_14_mean|window_14_lag_14_mean|window_30_lag_14_mean|window_7_lag_21_mean|window_14_lag_21_mean|window_30_lag_21_mean|window_7_lag_28_mean|window_14_lag_28_mean|window_30_la

## Training

We can not pass the features created by ``FeatureExtractor`` to ``ForecastFlowML`` for training. As mentioned in the lag feature creation step, we are going to set ``use_lag_range=28`` to use lags which are 28 days after from the most recent lag features. Also, as we know that the models that will be built are small enough to not cause memory problems, we are going to keep them as a class attribute by ``local_result=True`` argument.

In [24]:
forecast_flow = ForecastFlowML(
    group_col="store_id",
    id_col="id",
    date_col="date",
    target_col="sales",
    date_frequency="days",
    model_horizon=7,
    max_forecast_horizon=28,
    model=LGBMRegressor(),
    use_lag_range=28,
)
forecast_flow.train(df_train, local_result=True)
forecast_flow.model_

Unnamed: 0,store_id,forecast_horizon,model,start_time,end_time,elapsed_seconds
0,CA_1,"[[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,...","[[128, 4, 149, 236, 1, 0, 0, 0, 0, 0, 0, 140, ...",19-May-2023 (03:34:43),19-May-2023 (03:34:44),0.7
1,TX_1,"[[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,...","[[128, 4, 149, 236, 1, 0, 0, 0, 0, 0, 0, 140, ...",19-May-2023 (03:34:44),19-May-2023 (03:34:45),0.7
2,WI_1,"[[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,...","[[128, 4, 149, 236, 1, 0, 0, 0, 0, 0, 0, 140, ...",19-May-2023 (03:34:45),19-May-2023 (03:34:46),0.9
3,TX_2,"[[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,...","[[128, 4, 149, 236, 1, 0, 0, 0, 0, 0, 0, 140, ...",19-May-2023 (03:34:42),19-May-2023 (03:34:43),0.8


### Examine Features

Let's examine which features are used for each model.

In [25]:
importance = forecast_flow.get_feature_importance()
importance

Unnamed: 0,store_id,forecast_horizon,feature,importance
0,CA_1,"[1, 2, 3, 4, 5, 6, 7]",day_of_week,103
1,CA_1,"[1, 2, 3, 4, 5, 6, 7]",week_of_year,126
2,CA_1,"[1, 2, 3, 4, 5, 6, 7]",week_of_month,23
3,CA_1,"[1, 2, 3, 4, 5, 6, 7]",month,32
4,CA_1,"[1, 2, 3, 4, 5, 6, 7]",quarter,0
...,...,...,...,...
283,WI_1,"[22, 23, 24, 25, 26, 27, 28]",lag_56,184
284,WI_1,"[22, 23, 24, 25, 26, 27, 28]",window_7_lag_28_mean,278
285,WI_1,"[22, 23, 24, 25, 26, 27, 28]",window_14_lag_28_mean,325
286,WI_1,"[22, 23, 24, 25, 26, 27, 28]",window_30_lag_28_mean,390


Here we can see that the minimum lag used for first week model is lag_7, and for second week model is lag_14.

In [26]:
importance[importance["store_id"] == "CA_1"].head(36)

Unnamed: 0,store_id,forecast_horizon,feature,importance
0,CA_1,"[1, 2, 3, 4, 5, 6, 7]",day_of_week,103
1,CA_1,"[1, 2, 3, 4, 5, 6, 7]",week_of_year,126
2,CA_1,"[1, 2, 3, 4, 5, 6, 7]",week_of_month,23
3,CA_1,"[1, 2, 3, 4, 5, 6, 7]",month,32
4,CA_1,"[1, 2, 3, 4, 5, 6, 7]",quarter,0
5,CA_1,"[1, 2, 3, 4, 5, 6, 7]",history_length,199
6,CA_1,"[1, 2, 3, 4, 5, 6, 7]",weekend,69
7,CA_1,"[1, 2, 3, 4, 5, 6, 7]",year,0
8,CA_1,"[1, 2, 3, 4, 5, 6, 7]",day_of_month,182
9,CA_1,"[1, 2, 3, 4, 5, 6, 7]",lag_7,234
