In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

## **Basic Feature Engineering**

We will look at three classes of features that we can create from our
time series dataset:

- **Date Time Features:** these are components of the time step itself for each observation.
- **Lag Features:** these are values at prior time steps.
- **Window Features:** these are a summary of values over a fixed window of prior time steps

*Date Time Features*

In [5]:
series = pd.read_csv("daily-minimum-temperatures.csv",header=0,index_col=0, parse_dates=True,
                                                    squeeze=True)
series.head()

Date
1981-01-01    20.7
1981-01-02    17.9
1981-01-03    18.8
1981-01-04    14.6
1981-01-05    15.8
Name: Temp, dtype: float64

In [6]:
# An  empty DataFrame
dataframe = pd.DataFrame()

In [7]:
# Creating a year column from the parsed dates  --> optional
# dataframe["year"] = [series.index[i].year for i in range(len(series))]

# Creating a month column from the parsed dates 
dataframe["month"] = [series.index[i].month for i in range(len(series))]
# Creating a day column from the parsed dates 
dataframe["day"] = [series.index[i].day for i in range(len(series))]
dataframe["temperature"] = [series[i] for i in range(len(series))]

dataframe.head()

# print(type(dataframe))
# <class 'pandas.core.frame.DataFrame'>

Unnamed: 0,month,day,temperature
0,1,1,20.7
1,1,2,17.9
2,1,3,18.8
3,1,4,14.6
4,1,5,15.8


*Lag Features*

In [8]:
temps = pd.DataFrame(series.values)
temps.head()

Unnamed: 0,0
0,20.7
1,17.9
2,18.8
3,14.6
4,15.8


In [9]:
temps.shift(1).head()

Unnamed: 0,0
0,
1,20.7
2,17.9
3,18.8
4,14.6


In [18]:
# Try to play around with the shift value
data_concat = [temps.shift(1),temps]
dataframe = pd.concat(data_concat, axis=1)

# Renaming the columns from 0,0 -->  't', 't+1'
dataframe.columns = ['t', 't+1']
dataframe.head(5)

Unnamed: 0,t,t+1
0,,20.7
1,20.7,17.9
2,17.9,18.8
3,18.8,14.6
4,14.6,15.8


- The addition of lag features is called the sliding window method, in this case with a window width of 1.
- We can expand the window width and include more lagged features.

In [11]:
# Try to play around with the shift value
data_concat = [temps.shift(3),temps.shift(2),temps.shift(1),temps]
dataframe = pd.concat(data_concat, axis=1)

# Renaming the columns from 0,0 -->  't', 't+1'
dataframe.columns = ['t-2','t-1','t', 't+1']
dataframe.head(5)

Unnamed: 0,t-2,t-1,t,t+1
0,,,,20.7
1,,,20.7,17.9
2,,20.7,17.9,18.8
3,20.7,17.9,18.8,14.6
4,17.9,18.8,14.6,15.8


*Again, you can see that we must discard the first few rows that do not have enough data to train a supervised model. A difficulty with the sliding window approach is how large to make the window for your problem. Perhaps a good starting point is to perform a sensitivity analysis and try a suite of different window widths to in turn create a suite of different views of your dataset and see which results in better performing models. There will be a point of diminishing returns.*

*Rolling Window Statistics*

- Pandas provides a rolling() function that creates a new data structure with the window of values at each time step. We can then perform statistical functions on the window of values collected for each time step, such as calculating the mean.
- First, the series must be shifted. Then the rolling dataset can be created and the mean values calculated on each window of two values.

In [16]:
series
temps
shifted = temps.shift(1)
# shifted  # Dataframe output

window = shifted.rolling(window=2) # values needed from t+1 to calculate the mean
# window # Rolling [window=2,center=False,axis=0]

means = window.mean()
# means.head() # Dataframe output

dataframe = pd.concat([means,temps], axis=1)
dataframe.columns = ['mean(t-1,t)', 't+1']
dataframe.head(5)

Unnamed: 0,"mean(t-1,t)",t+1
0,,20.7
1,,17.9
2,19.3,18.8
3,18.35,14.6
4,16.7,15.8


Running the example prints the first 5 rows of the new dataset. We can see that the first
two rows are not useful.
- The first NaN was created by the shift of the series.
- The second because NaN cannot be used to calculate a mean value.
- Finally, the third row shows the expected value of 19.30 (the mean of 20.7 and 17.9) used to predict the 3rd value in the series of 18.8.

In [19]:
series
width = 3
shifted = temps.shift(width-1)
window = shifted.rolling(window=width)
dataframe = pd.concat([window.min(),window.mean(), window.max(), temps],axis=1)
dataframe.columns = ["min","mean","max","t+1"]
dataframe.head(10)
# Listing 5.17: Example of rolling stats features on the Minimum Daily Temperatures dataset.

Unnamed: 0,min,mean,max,t+1
0,,,,20.7
1,,,,17.9
2,,,,18.8
3,,,,14.6
4,17.9,19.133333,20.7,15.8
5,14.6,17.1,18.8,15.8
6,14.6,16.4,18.8,15.8
7,14.6,15.4,15.8,17.4
8,15.8,15.8,15.8,21.8
9,15.8,16.333333,17.4,20.0


*Expanding Window Statistics*

skipped