### Time Series Workshop 
# 3. Air Pollutants &#x1F525;: Feature Engineering

In this notebook, we will continue to work with our well known air-pollutants data set and introduce common feature engineering techniques for time series forecasting.

In [None]:
%config InlineBackend.figure_format='retina'
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
from timeseries.data import load_air_quality
from feature_engine.creation import CyclicalFeatures

DATA_DIR = Path("..") / Path("data")

## Load and process data

In [None]:
FILE_PATH = DATA_DIR / "air_quality.csv"
variables = ["co_sensor", "humidity"]

df_in = load_air_quality(FILE_PATH)[variables]
for var in variables:
    df_in = df_in[df_in[var] >= 0]

df_in.head()

## Time related features

In [None]:
df = df_in.copy()

df["month"] = df.index.month
df["week"] = df.index.isocalendar().week
df["day"] = df.index.day
df["day_of_week"] = df.index.day_of_week
df["hour"] = df.index.hour
df["is_weekend"] = np.where(df["day_of_week"] > 4, 1, 0)
df.head()

## Lag features
Lag features are past values of the variable that we can use to predict future values.

Here, we will use the following lag features to predict the next hour's pollutant concentration:
- The pollutant concentration for the previous three hours (t-1, t-2, t-3).
- The pollutant concentration for the same hour on the previous day (t-24).

The reasoning behind this is that pollutant concentrations do not change quickly and, as previously demonstrated, have a 24-hour seasonality.

In [None]:
df_processed_0 = df.copy()

for var in variables:
    for h in [1, 2, 3, 24]:
        tmp = df_processed_0[[var]].shift(freq=f"{h}H")
        tmp.columns = [f"{var}_lag_{h}"]
        df_processed_0 = df_processed_0.merge(
            tmp, left_index=True, right_index=True, how="left"
        )


df_processed_0.head()

In [None]:
# Sanity check for the first 3 hour lags:
df_processed_0[
    ["co_sensor", "co_sensor_lag_1", "co_sensor_lag_2", "co_sensor_lag_3"]
].head()

In [None]:
# Sanity check for the 24 hour lag:
df_processed_0[["co_sensor", "co_sensor_lag_24"]].head(26)

## Window Features
Window features use some form of aggregation of the features' values over a pre-defined time window of a variable as predictors for the current value.

Here, we will
- Use a rolling window of 5 hours 
- Compute the mean, min, and max values of our variables within this window
- Shift the window forward to serve as predictors for the next hour

In [None]:
tmp = (
    df_processed_0[variables]
    .rolling(window="5H")
    .agg(
        ["mean", "min", "max", "std"]
    )  # Aggregate functions over the span of the window
    .shift(freq="1H")  # Move the average 1 hour forward
)

tmp.columns = tmp.columns.map("_win_".join)
tmp.head()

df_processed_1 = df_processed_0.copy().merge(
    tmp, left_index=True, right_index=True, how="left"
)
df_processed_1.head()

## Periodic Features

Time-based features are inherently periodic. For example
- Months: 1 -> 2 -> ... -> 12 -> 1 -> ...
- Week days: 1 -> 2 -> ... -> 7 -> 1 -> ...

and so on.

While some models can capture this periodicity without any difficulty (hint: decision trees!), others cannot. Thus, additional processing can be very beneficial for the model performance.

We can encode periodic features using a sine and cosine transformation with the feature's period. This will cause the values of the features that are far apart to come closer. For example, December (12) is closer to January (1) than June (6). This relationship is not captured by the numerical representation of these features. But we could change it, if we transformed these variables with sine and cosine.

While this can, of course, be done with some short calcluations, we'll resort to some ready made transformers from the `feature_engine` package here.

In [None]:
time_vars = ["month", "hour"]

cyclical = CyclicalFeatures(
    variables=time_vars,  # The features we want to transform.
    drop_original=False,  # Whether to drop the original features.
)

df_processed_2 = cyclical.fit_transform(df_processed_1)
df_processed_2.head()

In [None]:
_, axs = plt.subplots(1, 2, figsize=(12, 3))
_ = df_processed_2[["month_sin", "month_cos"]].plot(marker=".", ax=axs[0])
_ = df_processed_2["2005-03-15":][["hour_sin", "hour_cos"]].plot(ax=axs[1])

# Remove missing data and export
- With the lag and window calculations we have introduced a bunch of missing data. 
- These aren't too many though, so we'll simply remove them.
- Finally, we'll also remove the original "humidity" feature as we want to predict the carbon monoxide concentration from the humidity (which we assume we don't know at the time of prediction)

In [None]:
df_processed_2.isnull().sum()

In [None]:
df_final = df_processed_2.dropna().drop("humidity", axis=1)
df_final.to_csv(DATA_DIR / "air_quality_processed.csv", index=True)

Done!