# Feature Engineering

Only the historical AQI and air pollutant data has been downloaded. This is because time series forecasting involves generating features from historical data that are then used to train the model and predict new values from. In other words, we need the historical data at predict time to generate the features on which we predict the future AQI. One thing we need to be extra careful of is leakage. We need to be careful that we only train and predict on features that are known at the time of prediction. In this notebook, I go through the process of engineering features that we will use to train an XGBoost classifier on. The algorithms will then be combined into a function that I can use in a separate notebook where I actually create the features and train a model.

### Import Data

In [1]:
import datetime
import pandas as pd
import numpy as np
import os

zip_code = '60603'  # Chicago
country_code = 'US'
city = 'Chicago'

aqi_table_name = f'aqi_{city}_{zip_code}'.lower()

data_path = os.path.join('data', f'{aqi_table_name}.csv')

if os.path.exists(data_path):
    df = pd.read_csv(data_path, index_col='datetime', parse_dates=True)
    start_date = df.index.max()
    start_date_id = df['id'].max()

In [2]:
need_cols = ['co', 'no', 'no2', 'o3', 'so2', 'pm2_5', 'pm10', 'nh3', 'aqi']
df = df[need_cols]

In [3]:
df.head()

Unnamed: 0_level_0,co,no,no2,o3,so2,pm2_5,pm10,nh3,aqi
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2020-11-27 00:00:00,410.56,1.9,41.47,7.78,4.71,9.9,15.92,1.44,2
2020-11-27 01:00:00,377.18,0.72,35.99,12.52,4.65,8.53,13.4,0.97,1
2020-11-27 02:00:00,347.14,0.34,31.19,16.81,4.59,7.75,11.57,0.71,1
2020-11-27 03:00:00,337.12,0.27,29.47,17.88,4.53,7.76,10.98,0.64,1
2020-11-27 04:00:00,337.12,0.31,29.47,16.63,4.65,8.2,11.27,0.64,1


### Features from the Timestamp

Alright so the data is imported and now we can get started making some features. The first ones I want to make are just from the time stamp:

In [4]:
df['hour'] = df.index.hour
df['dayofweek'] = df.index.dayofweek
df['quarter'] = df.index.quarter
df['month'] = df.index.month
df['year'] = df.index.year
df['dayofyear'] = df.index.dayofyear
df['dayofmonth'] = df.index.day
df['weekofyear'] = df.index.isocalendar().week.astype('int')

In [5]:
df.head()

Unnamed: 0_level_0,co,no,no2,o3,so2,pm2_5,pm10,nh3,aqi,hour,dayofweek,quarter,month,year,dayofyear,dayofmonth,weekofyear
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2020-11-27 00:00:00,410.56,1.9,41.47,7.78,4.71,9.9,15.92,1.44,2,0,4,4,11,2020,332,27,48
2020-11-27 01:00:00,377.18,0.72,35.99,12.52,4.65,8.53,13.4,0.97,1,1,4,4,11,2020,332,27,48
2020-11-27 02:00:00,347.14,0.34,31.19,16.81,4.59,7.75,11.57,0.71,1,2,4,4,11,2020,332,27,48
2020-11-27 03:00:00,337.12,0.27,29.47,17.88,4.53,7.76,10.98,0.64,1,3,4,4,11,2020,332,27,48
2020-11-27 04:00:00,337.12,0.31,29.47,16.63,4.65,8.2,11.27,0.64,1,4,4,4,11,2020,332,27,48


### Lagging Features

Lag features are information about a previous time step of the time series. We use them because the the past values of a variable are likely to be predictive of future variables. Past values of other predictive features can also be useful for our forecast. Thus, in forecasting, it is common practice to create lag features from time series data and use them as input to machine learning algorithms.

In this case, we can lag a lot of things to create features like the various pollutant concentrations and the AQI. I know that my end goal is to predict the AQI for 3 days into the future. This means that the minimum lag I can do is 3 days. Later, I'll create some features with window functions where it'll be important to consider the 3-day shift in order to avoid a data leak.

In [6]:
df.columns

Index(['co', 'no', 'no2', 'o3', 'so2', 'pm2_5', 'pm10', 'nh3', 'aqi', 'hour',
       'dayofweek', 'quarter', 'month', 'year', 'dayofyear', 'dayofmonth',
       'weekofyear'],
      dtype='object')

We can use the `shift()` function to lag the features by a given amount.

In [7]:
features_to_lag = ['co', 'no', 'no2', 'o3', 'so2', 'pm2_5', 'pm10', 'nh3', 'aqi']

for feature in features_to_lag:
    # lag feature by 3 days
    new_feature_name = feature + '_lag3d'
    df[new_feature_name] = df[feature].shift(freq='3D', axis=0)
    
    # lag feature by 5 days
    new_feature_name = feature + '_lag5d'
    df[new_feature_name] = df[feature].shift(freq='5D', axis=0)

    # lag feature by 9 days
    new_feature_name = feature + '_lag9d'
    df[new_feature_name] = df[feature].shift(freq='9D', axis=0)

In [8]:
df.head()

Unnamed: 0_level_0,co,no,no2,o3,so2,pm2_5,pm10,nh3,aqi,hour,...,pm2_5_lag9d,pm10_lag3d,pm10_lag5d,pm10_lag9d,nh3_lag3d,nh3_lag5d,nh3_lag9d,aqi_lag3d,aqi_lag5d,aqi_lag9d
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-11-27 00:00:00,410.56,1.9,41.47,7.78,4.71,9.9,15.92,1.44,2,0,...,,,,,,,,,,
2020-11-27 01:00:00,377.18,0.72,35.99,12.52,4.65,8.53,13.4,0.97,1,1,...,,,,,,,,,,
2020-11-27 02:00:00,347.14,0.34,31.19,16.81,4.59,7.75,11.57,0.71,1,2,...,,,,,,,,,,
2020-11-27 03:00:00,337.12,0.27,29.47,17.88,4.53,7.76,10.98,0.64,1,3,...,,,,,,,,,,
2020-11-27 04:00:00,337.12,0.31,29.47,16.63,4.65,8.2,11.27,0.64,1,4,...,,,,,,,,,,


In [9]:
df.columns

Index(['co', 'no', 'no2', 'o3', 'so2', 'pm2_5', 'pm10', 'nh3', 'aqi', 'hour',
       'dayofweek', 'quarter', 'month', 'year', 'dayofyear', 'dayofmonth',
       'weekofyear', 'co_lag3d', 'co_lag5d', 'co_lag9d', 'no_lag3d',
       'no_lag5d', 'no_lag9d', 'no2_lag3d', 'no2_lag5d', 'no2_lag9d',
       'o3_lag3d', 'o3_lag5d', 'o3_lag9d', 'so2_lag3d', 'so2_lag5d',
       'so2_lag9d', 'pm2_5_lag3d', 'pm2_5_lag5d', 'pm2_5_lag9d', 'pm10_lag3d',
       'pm10_lag5d', 'pm10_lag9d', 'nh3_lag3d', 'nh3_lag5d', 'nh3_lag9d',
       'aqi_lag3d', 'aqi_lag5d', 'aqi_lag9d'],
      dtype='object')

Notice that there are a lot of 'NaN's at the beginning of the dataframe now. That is because we don't have previous data from which to generate a lag feature. XGBoost will handle the missing values automatically. We can look at the tail of the data and see that we have data:

In [10]:
df.tail()

Unnamed: 0_level_0,co,no,no2,o3,so2,pm2_5,pm10,nh3,aqi,hour,...,pm2_5_lag9d,pm10_lag3d,pm10_lag5d,pm10_lag9d,nh3_lag3d,nh3_lag5d,nh3_lag9d,aqi_lag3d,aqi_lag5d,aqi_lag9d
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-01-23 05:00:00,283.72,0.01,11.48,59.37,3.79,5.51,6.88,1.54,1,5,...,5.04,1.22,10.39,7.63,0.63,0.79,1.99,1.0,1.0,1.0
2023-01-23 06:00:00,287.06,0.01,13.2,57.22,4.11,5.56,6.94,1.62,1,6,...,7.4,1.14,11.34,10.88,0.62,0.89,2.6,1.0,1.0,1.0
2023-01-23 07:00:00,303.75,0.02,17.65,50.07,4.83,7.03,8.73,1.95,1,7,...,10.59,1.13,13.31,15.28,0.61,1.39,3.42,1.0,2.0,2.0
2023-01-23 08:00:00,360.49,0.2,31.53,34.33,6.08,11.0,13.84,2.85,2,8,...,18.07,1.55,19.95,25.76,0.72,2.94,5.57,1.0,2.0,3.0
2023-01-23 09:00:00,420.57,1.84,45.93,17.7,8.35,15.28,19.42,3.74,2,9,...,25.91,1.85,28.42,36.3,0.81,3.89,7.98,1.0,2.0,3.0


### Rolling Window Features

Window features are the result of window operations over the variables. Here I calculate the rolling maximum, mean, and standard deviation over a window of 12 hours, and then lag the results by 3 days. I think the standard deviation is particularly important because it will help capture the volatility in the AQI data.

In [11]:
window = 12  # hours
df['aqi_max_lag_3d'] = df['aqi'].rolling(window=window).agg(['max']).shift(freq='3D', axis=0)
df['aqi_mean_lag_3d'] = df['aqi'].rolling(window=window).agg(['mean']).shift(freq='3D', axis=0)
df['aqi_std_lag_3d'] = df['aqi'].rolling(window=window).agg(['std']).shift(freq='3D', axis=0)

In [12]:
df.tail()

Unnamed: 0_level_0,co,no,no2,o3,so2,pm2_5,pm10,nh3,aqi,hour,...,pm10_lag9d,nh3_lag3d,nh3_lag5d,nh3_lag9d,aqi_lag3d,aqi_lag5d,aqi_lag9d,aqi_max_lag_3d,aqi_mean_lag_3d,aqi_std_lag_3d
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-01-23 05:00:00,283.72,0.01,11.48,59.37,3.79,5.51,6.88,1.54,1,5,...,7.63,0.63,0.79,1.99,1.0,1.0,1.0,1.0,1.0,0.0
2023-01-23 06:00:00,287.06,0.01,13.2,57.22,4.11,5.56,6.94,1.62,1,6,...,10.88,0.62,0.89,2.6,1.0,1.0,1.0,1.0,1.0,0.0
2023-01-23 07:00:00,303.75,0.02,17.65,50.07,4.83,7.03,8.73,1.95,1,7,...,15.28,0.61,1.39,3.42,1.0,2.0,2.0,1.0,1.0,0.0
2023-01-23 08:00:00,360.49,0.2,31.53,34.33,6.08,11.0,13.84,2.85,2,8,...,25.76,0.72,2.94,5.57,1.0,2.0,3.0,1.0,1.0,0.0
2023-01-23 09:00:00,420.57,1.84,45.93,17.7,8.35,15.28,19.42,3.74,2,9,...,36.3,0.81,3.89,7.98,1.0,2.0,3.0,1.0,1.0,0.0


Now that I have all the features I want, I can drop the historical data, since those are the features that I will not be able to use in prediction:

In [13]:
df = df.drop(columns=['co', 'no', 'no2', 'o3', 'so2', 'pm2_5', 'pm10', 'nh3'])
df.tail()

Unnamed: 0_level_0,aqi,hour,dayofweek,quarter,month,year,dayofyear,dayofmonth,weekofyear,co_lag3d,...,pm10_lag9d,nh3_lag3d,nh3_lag5d,nh3_lag9d,aqi_lag3d,aqi_lag5d,aqi_lag9d,aqi_max_lag_3d,aqi_mean_lag_3d,aqi_std_lag_3d
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-01-23 05:00:00,1,5,0,1,1,2023,23,23,4,243.66,...,7.63,0.63,0.79,1.99,1.0,1.0,1.0,1.0,1.0,0.0
2023-01-23 06:00:00,1,6,0,1,1,2023,23,23,4,243.66,...,10.88,0.62,0.89,2.6,1.0,1.0,1.0,1.0,1.0,0.0
2023-01-23 07:00:00,1,7,0,1,1,2023,23,23,4,247.0,...,15.28,0.61,1.39,3.42,1.0,2.0,2.0,1.0,1.0,0.0
2023-01-23 08:00:00,2,8,0,1,1,2023,23,23,4,253.68,...,25.76,0.72,2.94,5.57,1.0,2.0,3.0,1.0,1.0,0.0
2023-01-23 09:00:00,2,9,0,1,1,2023,23,23,4,257.02,...,36.3,0.81,3.89,7.98,1.0,2.0,3.0,1.0,1.0,0.0


Now let's combine everything into a single function:

In [14]:
def createFeatures(data: pd.DataFrame) -> pd.DataFrame:
    df = data.copy()
    # add date features
    df['hour'] = df.index.hour
    df['dayofweek'] = df.index.dayofweek
    df['quarter'] = df.index.quarter
    df['month'] = df.index.month
    df['year'] = df.index.year
    df['dayofyear'] = df.index.dayofyear
    df['dayofmonth'] = df.index.day
    df['weekofyear'] = df.index.isocalendar().week.astype('int')
    
    # add lag features
    features_to_lag = ['co', 'no', 'no2', 'o3', 'so2', 'pm2_5', 'pm10', 'nh3', 'aqi']

    for feature in features_to_lag:
        # lag feature by 3 days
        new_feature_name = feature + '_lag3d'
        df[new_feature_name] = df[feature].shift(freq='3D', axis=0)

        # lag feature by 5 days
        new_feature_name = feature + '_lag5d'
        df[new_feature_name] = df[feature].shift(freq='5D', axis=0)

        # lag feature by 9 days
        new_feature_name = feature + '_lag9d'
        df[new_feature_name] = df[feature].shift(freq='9D', axis=0)
        
        
    window = 12  # hours
    df['aqi_max_lag_3d'] = df['aqi'].rolling(window=window).agg(['max']).shift(freq='3D', axis=0)
    df['aqi_mean_lag_3d'] = df['aqi'].rolling(window=window).agg(['mean']).shift(freq='3D', axis=0)
    df['aqi_std_lag_3d'] = df['aqi'].rolling(window=window).agg(['std']).shift(freq='3D', axis=0)
    
    # drop the historical features
    df = df.drop(columns=['co', 'no', 'no2', 'o3', 'so2', 'pm2_5', 'pm10', 'nh3'])
    
    return df