# Feature Engineering

In this lesson we'll cover automated ways to select features for modeling.

This is not all that there is to feature engineering!

In [18]:
import pandas as pd
import numpy as np
import wrangle_lesson
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

In [2]:
# Here's the source for the dataset and data dictionary https://archive.ics.uci.edu/ml/datasets/student+performance
path = "https://gist.githubusercontent.com/ryanorsinger/55ccfd2f7820af169baea5aad3a9c60d/raw/da6c5a33307ed7ee207bd119d3361062a1d1c07e/student-mat.csv"

df, X_train_explore, \
    X_train_scaled, y_train, \
    X_validate_scaled, y_validate, \
    X_test_scaled, y_test = wrangle_lesson.wrangle_student_math(path)

In [9]:
X_train_scaled.head()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2
142,0.0,1.0,1.0,0.0,0.666667,0.0,0.75,0.25,0.25,0.0,0.0,1.0,0.035714,0.357143,0.578947
326,0.333333,0.75,0.75,0.0,0.0,0.0,0.75,0.5,1.0,0.5,1.0,1.0,0.053571,0.714286,0.789474
88,0.166667,0.5,0.5,0.333333,0.333333,0.333333,0.75,0.75,0.25,0.0,0.0,0.5,0.214286,0.5,0.526316
118,0.333333,0.25,0.75,0.666667,0.333333,0.333333,1.0,0.25,0.75,0.0,0.75,1.0,0.357143,0.357143,0.368421
312,0.666667,0.25,0.5,0.0,0.333333,0.333333,0.75,1.0,0.25,0.25,0.25,0.75,0.053571,0.642857,0.578947


### SelectKBest

Uses an [F Test][1] to compare how well each feature predicts the target variable.

[1]: https://en.wikipedia.org/wiki/F-test#Formula_and_calculation

In [10]:
from sklearn.feature_selection import SelectKBest, f_regression

f_selector = SelectKBest(score_func=f_regression, k=3)
f_selector.fit(X_train_scaled, y_train)

SelectKBest(k=3, score_func=<function f_regression at 0x7fe72cef70d0>)

In [11]:
f_selector.get_support()

array([False, False, False, False, False,  True, False, False, False,
       False, False, False, False,  True,  True])

In [12]:
f_selector.transform

<bound method SelectorMixin.transform of SelectKBest(k=3, score_func=<function f_regression at 0x7fe72cef70d0>)>

In [13]:
mask = f_selector.get_support()
X_train_scaled.columns[mask]

Index(['failures', 'G1', 'G2'], dtype='object')

In [14]:
X_train_kbest = f_selector.transform(X_train_scaled)

model = LinearRegression().fit(X_train_kbest, y_train)
# ...

### Recursive Feature Elimination (RFE)

Fits a model and recursively eliminates the worst performing features.

Only works for models that can rank features.

In [15]:
model = LinearRegression().fit(X_train_scaled, y_train)
model.coef_

array([-1.57248067e+00,  5.91784593e-01, -2.21242008e-01,  1.01893087e+00,
       -1.75520671e-02,  5.02414426e-01,  1.07179785e+00,  3.27646012e-02,
        3.27458627e-01, -7.51580441e-01,  2.57014436e-01,  6.14409605e-01,
        2.55747873e+00,  2.57686922e+00,  1.88234926e+01])

In [19]:

model = DecisionTreeRegressor().fit(X_train_scaled, y_train)
model.feature_importances_

array([5.02538607e-03, 2.32872865e-04, 1.25735957e-02, 1.31260454e-03,
       2.87310813e-03, 2.19086791e-03, 1.54408630e-02, 1.45995391e-03,
       7.67548962e-04, 6.43992695e-04, 3.81139298e-03, 1.24198861e-03,
       1.31916883e-01, 2.42031648e-02, 7.96305777e-01])

In [20]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

lm = LinearRegression()
rfe = RFE(estimator=lm, n_features_to_select=2)
rfe.fit(X_train_scaled,y_train)  

RFE(estimator=LinearRegression(), n_features_to_select=2)

In [21]:
rfe.support_

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False,  True,  True])

In [22]:
X_train_scaled.columns[rfe.support_]

Index(['G1', 'G2'], dtype='object')

In [23]:
rfe.ranking_

array([ 3,  8, 12,  5, 14,  7,  4, 13, 10,  9, 11,  6,  2,  1,  1])

In [24]:

pd.Series(dict(zip(X_train_scaled.columns, rfe.ranking_))).sort_values()

G1             1
G2             1
absences       2
age            3
famrel         4
traveltime     5
health         6
failures       7
Medu           8
Dalc           9
goout         10
Walc          11
Fedu          12
freetime      13
studytime     14
dtype: int64

## Recap


- SelectKBest compares each feature against the target in isolation
- RFE compares all features by fitting multiple models on subsets of features (usually either decision trees or linear models)
- RFE generally gives more robust results, but is more expensive