# Feature Engineering

Feature Engineering : It means modifying, deleting or combining existing raw features from our data to create some new features.

## Date Time Features

In [1]:
import pandas as pd
import numpy as np

df2 = pd.read_csv('D:\courses\Time_Series\Forecasting_Resources\DataandCode\daily-total-female-births-CA.csv',
                  header=0 , parse_dates=[0])

In [2]:
df2.head()

Unnamed: 0,date,births
0,1959-01-01,35
1,1959-01-02,32
2,1959-01-03,30
3,1959-01-04,31
4,1959-01-05,44


In [3]:
feature = df2.copy()

In [4]:
# Now suppose if i want to create a column for year

feature['year'] = df2['date'].dt.year

In [5]:
# Now suppose if i want to create a column for month

feature['month'] = df2['date'].dt.month

In [6]:
# Now suppose if i want to create a column for day

feature['day'] = df2['date'].dt.day

In [7]:
feature.head()

Unnamed: 0,date,births,year,month,day
0,1959-01-01,35,1959,1,1
1,1959-01-02,32,1959,1,2
2,1959-01-03,30,1959,1,3
3,1959-01-04,31,1959,1,4
4,1959-01-05,44,1959,1,5


## Lag Features

Create a lag column which contain the value of previous birth year

 .shift('diff_of_time_period')

In [8]:
feature['lag1'] = df2['births'].shift(1)

In [9]:
feature['lag2'] = df2['births'].shift(365)    # same day last year (we will get NaN value for 1st 365 days)

In [10]:
feature.head(5)

Unnamed: 0,date,births,year,month,day,lag1,lag2
0,1959-01-01,35,1959,1,1,,
1,1959-01-02,32,1959,1,2,35.0,
2,1959-01-03,30,1959,1,3,32.0,
3,1959-01-04,31,1959,1,4,30.0,
4,1959-01-05,44,1959,1,5,31.0,


## Window Feature

Suppose we want another feature, where we want avg of the value which we have and the value which is above

In [12]:
feature['Roll_mean'] = df2['births'].rolling(window=2).mean()

In [13]:
feature.head()

Unnamed: 0,date,births,year,month,day,lag1,lag2,Roll_mean
0,1959-01-01,35,1959,1,1,,,
1,1959-01-02,32,1959,1,2,35.0,,33.5
2,1959-01-03,30,1959,1,3,32.0,,31.0
3,1959-01-04,31,1959,1,4,30.0,,30.5
4,1959-01-05,44,1959,1,5,31.0,,37.5


Suppose we want to the max value of previous 3 periods. So initial 2 will be NaNa

In [14]:
feature['Roll_max'] = df2['births'].rolling(window=3).max()

In [15]:
feature.head()

Unnamed: 0,date,births,year,month,day,lag1,lag2,Roll_mean,Roll_max
0,1959-01-01,35,1959,1,1,,,,
1,1959-01-02,32,1959,1,2,35.0,,33.5,
2,1959-01-03,30,1959,1,3,32.0,,31.0,35.0
3,1959-01-04,31,1959,1,4,30.0,,30.5,32.0
4,1959-01-05,44,1959,1,5,31.0,,37.5,44.0


## Expanding Feature

Max value till that point.

Example : 

Max value till the 1st value will be the value which it contains. Then it compare the 1st and 2nd value and take the max value and give it to the 2nd row then compare 1st, 2nd and 3rd then take the max value and give it to the 3rd and so on...

In [16]:
feature['Expand_max'] = df2['births'].expanding().max()

In [18]:
feature.head(10)

Unnamed: 0,date,births,year,month,day,lag1,lag2,Roll_mean,Roll_max,Expand_max
0,1959-01-01,35,1959,1,1,,,,,35.0
1,1959-01-02,32,1959,1,2,35.0,,33.5,,35.0
2,1959-01-03,30,1959,1,3,32.0,,31.0,35.0,35.0
3,1959-01-04,31,1959,1,4,30.0,,30.5,32.0,35.0
4,1959-01-05,44,1959,1,5,31.0,,37.5,44.0,44.0
5,1959-01-06,29,1959,1,6,44.0,,36.5,44.0,44.0
6,1959-01-07,45,1959,1,7,29.0,,37.0,45.0,45.0
7,1959-01-08,43,1959,1,8,45.0,,44.0,45.0,45.0
8,1959-01-09,38,1959,1,9,43.0,,40.5,45.0,45.0
9,1959-01-10,27,1959,1,10,38.0,,32.5,43.0,45.0
