In [1]:
%autosave 0

Autosave disabled


In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from pydataset import data
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest, f_regression, RFE
from sklearn.linear_model import LinearRegression

In [3]:
df = data('swiss')
df.head()

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6


Let's look at the range of values in our dataset to see if we need to scale!

In [4]:
df.describe()

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
count,47.0,47.0,47.0,47.0,47.0,47.0
mean,70.142553,50.659574,16.489362,10.978723,41.14383,19.942553
std,12.491697,22.711218,7.977883,9.615407,41.70485,2.912697
min,35.0,1.2,3.0,1.0,2.15,10.8
25%,64.7,35.9,12.0,6.0,5.195,18.15
50%,70.4,54.1,16.0,8.0,15.14,20.0
75%,78.45,67.65,22.0,12.0,93.125,21.7
max,92.5,89.7,37.0,53.0,100.0,26.6


My fields have different scales, it is wise to scale before continuing!

In [5]:
mms = MinMaxScaler()

to_scale = df.drop(columns=['Fertility']).columns

df[to_scale] = mms.fit_transform(df[to_scale])

df.head()

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,0.178531,0.352941,0.211538,0.079816,0.721519
Delemont,83.1,0.496045,0.088235,0.153846,0.845069,0.721519
Franches-Mnt,92.5,0.435028,0.058824,0.076923,0.93255,0.594937
Moutier,85.8,0.39887,0.264706,0.115385,0.323148,0.601266
Neuveville,76.9,0.477966,0.411765,0.269231,0.030761,0.620253


Let's break our data into X and y subsets.

We will move forward with SelectKBest. This technique uses a statistical test to determine how useful features may be.

In [6]:
X = df.drop(columns = ['Fertility'])
y = df.Fertility

In [7]:
skb = SelectKBest(f_regression, k = 3)

skb.fit(X, y)

In [8]:
skb_mask = skb.get_support()
X.columns[skb_mask]

Index(['Examination', 'Education', 'Catholic'], dtype='object')

According to SelectKBest, Examination, Education, and Catholic will have the strongest relationships with my target variable.

Now let's do RFE. With RFE, we need a model object that will be used to evaluate the predictive power of our features. We'll use a simple linear regression model to do this.

In [9]:
lm = LinearRegression()

rfe = RFE(lm, n_features_to_select=3)

rfe.fit(X, y)

In [10]:
rfe_mask = rfe.get_support()
X.columns[rfe_mask]

Index(['Agriculture', 'Education', 'Infant.Mortality'], dtype='object')

Our RFE object says that Agriculture, Education, and Infant.Mortality are the top performing features from our dataset.

How does this compare to the features selected by SelectKBest?

In [11]:
X.columns[skb_mask]

Index(['Examination', 'Education', 'Catholic'], dtype='object')

SelectKBest and RFE returned different results! This is ok. We used two different techniques to determine feature importance.