# Basic Feature Engineering

We will look at three classes of features that we can create from our time series dataset:

- **Date Time Features**: These are components of the time step itself for each observation.
- **Lag Features**: These are values at prior time steps.
- **Window Features**: These are a summary of values over a fixed window of prior time steps.

### The goal of feature engineering
- Provide strong relationships between new inpyt features and the output feature
- We do not know the underlying inherent functional relationship between inputs and outputs (if we did, we wouldn't need to use machine learning).
- The best default strategy is to use all knowledge available to create many datasets from the original dataset and use model performance to determine what good features and good views exist.

In [1]:
import pandas as pd

series = pd.read_csv('daily-minimum-temperatures.csv', header=0, index_col=0,
                  parse_dates=True, squeeze=True)
df = pd.DataFrame()
df['month'] = [series.index[i].month for i in range(len(series))]
df['day'] = [series.index[i].day for i in range(len(series))]
df['temperature'] = [series[i] for i in range(len(series))]
df.head()

Unnamed: 0,month,day,temperature
0,1,1,20.7
1,1,2,17.9
2,1,3,18.8
3,1,4,14.6
4,1,5,15.8


In [2]:
df.shape

(3650, 3)

In [3]:
df.tail()

Unnamed: 0,month,day,temperature
3645,12,27,14.0
3646,12,28,13.6
3647,12,29,13.5
3648,12,30,15.7
3649,12,31,13.0


Using just the month and day information alone to predict temperature is not sophisticated and will likely result in a poor model. 

Additional features that can be created include:
- Minutes elapsed for the day
- Hour of day
- Business hours or not
- Weekend or not
- Season of the year
- Business quarter of the year
- Daylight savings or not
- Public holiday or not
- Leap year or not

From these examples, you can likely make predictions on which of these are not going to be valuable, but in a situation where you have limited domain expertise, you can always take a more exhaustive approach to your feature engineering.

Date-time based features are a good start, but it is often more useful to include values at previous time steps. These are called lagged values.

### Lag Values

The pandas library provides the `shift()` function to help create shifted or lag features from a time series dataset.

In [4]:
temps = pd.DataFrame(series.values)
temps.head()

Unnamed: 0,0
0,20.7
1,17.9
2,18.8
3,14.6
4,15.8


In [5]:
df = pd.concat([temps.shift(1), temps], axis=1)
df.columns = ['t', 't+1']
df.head()

Unnamed: 0,t,t+1
0,,20.7
1,20.7,17.9
2,17.9,18.8
3,18.8,14.6
4,14.6,15.8


In a sense, `t+1` is the original series as it was related to the index values. `t` is shifted down one index position, leading to the placement of a null value in the first row. We would need to discard this first row to train a supervised learning model.

In [6]:
df.dropna()

Unnamed: 0,t,t+1
1,20.7,17.9
2,17.9,18.8
3,18.8,14.6
4,14.6,15.8
5,15.8,15.8
...,...,...
3645,14.6,14.0
3646,14.0,13.6
3647,13.6,13.5
3648,13.5,15.7


We could further create additional lag features based on longer time periods:

In [7]:
df = pd.concat([temps.shift(3), temps.shift(2), df], axis=1)
df.columns=['t-2', 't-1', 't', 't+1']
df.head()

Unnamed: 0,t-2,t-1,t,t+1
0,,,,20.7
1,,,20.7,17.9
2,,20.7,17.9,18.8
3,20.7,17.9,18.8,14.6
4,17.9,18.8,14.6,15.8


As our lag increases, we will need to drop additional rows of data to maintain integrity for a supervised learning process. Identifying the ideal lag is difficult, but a good starting point may be to perform a sensitivity analysis and try a suite of different window widths and see which results in better performing models. 

Domain expertise can help here, as perhaps a linear window approach is less ideal than focusing on time lengths that are likely to have a greater relationship with the target (last week, last month, last year, etc.). 

### Rolling Window Statistics

We can calculate summary statistics across the values in the sliding window and include these as features in our dataset.

Pandas provides a `rolling()` function that creates a new data structure with the window of values at each step, such as calculating the mean. First, the series must be shifted, then the rolling dataset can be created and the mean values calculated on each window.

In [9]:
series = pd.read_csv('daily-minimum-temperatures.csv', header=0, index_col=0, parse_dates=True, squeeze=True)
temps = pd.DataFrame(series.values)
shifted = temps.shift(1)
window = shifted.rolling(window=2)
means = window.mean()
df = pd.concat([means, temps], axis=1)
df.columns = ['mean(t-1, t)', 't+1']
df.head()

Unnamed: 0,"mean(t-1, t)",t+1
0,,20.7
1,,17.9
2,19.3,18.8
3,18.35,14.6
4,16.7,15.8


In this case, two `NaN` values were created because there must be at least two prior time periods before a mean can be calculated. By the third row, we have an input value of **19.30** that can be used to predict the output of **18.8**.

We can control the size of the window, which defines the number of time periods that are used to generate summary statistics. In the following example, we can specify the window width as a named variable and pass it to the `rolling()` function.

In [20]:
series = pd.read_csv('daily-minimum-temperatures.csv', header=0, index_col=0, parse_dates=True, squeeze=True)
temps = pd.DataFrame(series.values)
width = 3
original = temps.shift(1)
shifted = temps.shift(width - 1)
window = shifted.rolling(window=width)
df = pd.concat([window.min(), window.mean(), window.max(), shifted, original, temps], axis=1)
df.columns = ['min', 'mean', 'max', 't-1', 't','t+1']
df.head(10)

Unnamed: 0,min,mean,max,t-1,t,t+1
0,,,,,,20.7
1,,,,,20.7,17.9
2,,,,20.7,17.9,18.8
3,,,,17.9,18.8,14.6
4,17.9,19.133333,20.7,18.8,14.6,15.8
5,14.6,17.1,18.8,14.6,15.8,15.8
6,14.6,16.4,18.8,15.8,15.8,15.8
7,14.6,15.4,15.8,15.8,15.8,17.4
8,15.8,15.8,15.8,15.8,17.4,21.8
9,15.8,16.333333,17.4,17.4,21.8,20.0


In the above dataset, the `min`, `mean`, and `max` are all calculated based on the previous three time periods. Consider on row 4 that the `t` values from rows 1, 2, and 3 (20.7, 17.9, and 18.8 respectively) are being considered.

In this case, we can use machine learning to determine the relationship between these input variables and the target variable `t+1`, allowing us to forecast into unknown time periods.

## Expanding Window Statistics

Another type of window that may be useful includes **all** previous data in the series. This is called an *expanding window* and can help with keeping track of the bounds of observable data. Like the `rolling()` function, Pandas provides an `expanding()` function that collects sets of all prior values for each time step. 

Statistical calculations will exclude `NaN` values in the expanding window, so no dropping of rows will be required. 

In [22]:
series = pd.read_csv('daily-minimum-temperatures.csv', header=0, index_col=0,
                    parse_dates=True, squeeze=True)
temps = pd.DataFrame(series.values)
window = temps.expanding()
df = pd.concat([window.min(), window.mean(), window.max(), temps.shift(-1)], axis=1)
df.columns = ['min', 'mean', 'max', 't+1']
df.head()

Unnamed: 0,min,mean,max,t+1
0,20.7,20.7,20.7,17.9
1,17.9,19.3,20.7,18.8
2,17.9,19.133333,20.7,14.6
3,14.6,18.0,20.7,15.8
4,14.6,17.56,20.7,15.8


We can include `t` in our dataframe with a slight modification of our code:

In [23]:
series = pd.read_csv('daily-minimum-temperatures.csv', header=0, index_col=0,
                    parse_dates=True, squeeze=True)
temps = pd.DataFrame(series.values)
window = temps.expanding()
df = pd.concat([window.min(), window.mean(), window.max(), temps, temps.shift(-1)], axis=1)
df.columns = ['min', 'mean', 'max', 't', 't+1']
df.head()

Unnamed: 0,min,mean,max,t,t+1
0,20.7,20.7,20.7,20.7,17.9
1,17.9,19.3,20.7,17.9,18.8
2,17.9,19.133333,20.7,18.8,14.6
3,14.6,18.0,20.7,14.6,15.8
4,14.6,17.56,20.7,15.8,15.8


### Summary

- We identified the type of date-time features that we can extract from datetime objects
- We used the `shift()` function to create lag-based features
- we developed sliding and expanding window summary statistic features using `rolling()` and `expanding()` functions.