# Feature selection
Just because you have a dataset of 30 features (30 variables on the right hand side of your equation), it doesn't mean you have to use all 30 in your model.  Can you think of reasons why it might be benificial to drop certain variables?

Let's use our breast cancer dataset to experiment with feature selection.

In [2]:
import pandas as pd
from sklearn import linear_model
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import seaborn as sns
%matplotlib inline

In [3]:
df = pd.read_csv("../../assets/breast-cancer.csv", header=None)
df.iloc[:,1] = df.iloc[:,1] == 'M'
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,842302,True,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,True,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,True,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,True,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,True,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


# First, perform a logistic regression on all of the features
(Remember, the first column is just the patient ID--you can ignore that.)

In [4]:
y = df.iloc[:,1]
X = df.iloc[:,2:]
model = linear_model.LogisticRegression()
model.fit(X,y)
yHat = model.predict(X)
model.score(X,y)

0.95957820738137078

# But do we need all of the features?
What sort of strategy might one take to drop features?  What if we used the correlation between the x variables and the y variable?

In [5]:
corr = df.corr()
yXCorr = corr.iloc[1,2:]
yXCorr = abs(yXCorr)
yXCorr = pd.DataFrame(yXCorr)
yXCorr.sort_values(by=yXCorr.columns[0],inplace=True)
X = df.iloc[:,yXCorr.index[-3:]]
model.fit(X,y)
yHat = model.predict(X)
model.score(X,y)

0.92091388400702989

In [6]:
yXCorr.tail()

Unnamed: 0,1
4,0.742636
22,0.776454
9,0.776614
24,0.782914
29,0.793566


In [7]:
X.head()

Unnamed: 0,9,24,29
0,0.1471,184.6,0.2654
1,0.07017,158.8,0.186
2,0.1279,152.5,0.243
3,0.1052,98.87,0.2575
4,0.1043,152.2,0.1625


### Let's look at the correlations between the "three best features" according to our "most correlated with y" approach.
What can you say about how these features are correlated with each other?

In [29]:
X.corr()

Unnamed: 0,9,24,29
9,1.0,0.855923,0.910155
24,0.855923,1.0,0.816322
29,0.910155,0.816322,1.0


### Let's also add our y variable, and look at its correlation numbers, for future reference.

In [34]:
yDf = pd.DataFrame(y)
yX = yDf.join(X)
yX.corr()

Unnamed: 0,1,9,24,29
1,1.0,0.776614,0.782914,0.793566
9,0.776614,1.0,0.855923,0.910155
24,0.782914,0.855923,1.0,0.816322
29,0.793566,0.910155,0.816322,1.0


As you can see in the score above, our model score doesn't decrease by much, and we are only using three features.  To drive home the point that the correlation matters, let's repeat our test with the three least correlated variables.

In [6]:
X = df.iloc[:,yXCorr.index[:3]]
model.fit(X,y)
yHat = model.predict(X)
model.score(X,y)

0.62741652021089633

In [8]:
yXCorr.head()

Unnamed: 0,1
20,0.006522
13,0.008303
11,0.012838
16,0.067016
21,0.077972


# Use SelectKBest to get the best 3 features using chi2.
Refit the model, repredict, and reprint out the score.

In [13]:
#Don't forget to return X to the original variables (all of the features)
X = df.iloc[:,2:]
X = SelectKBest(chi2, k=3).fit_transform(X, y)
model.fit(X,y)
yHat = model.predict(X)
model.score(X,y)

0.93321616871704749

# Can we make an improvement in the score?
Let's try a "brute force search" (exhaustive) to see if we can find three features which give us a better score.  So, go through every combination of x variables (limiting to three x's per run) and fit and score your model, keeping track of the best score and the best x's.

In [22]:
# First, reset X to the original variables (all of the features)
X = df.iloc[:,2:]
bestColumnIndices = [-1,-1,-1]
bestScore = 0.0
for i in X.columns:
    for j in X.columns:
        if j <= i:
            continue
        for k in X.columns:
            if (k <= j):
                continue
            XTest = df.loc[:,[i,j,k]]
            model.fit(XTest,y)
            yHat = model.predict(XTest)
            score = model.score(XTest,y)
            if (score > bestScore):
                bestScore = score
                bestColumnIndices = [i,j,k]
print bestScore
print bestColumnIndices

0.952548330404
[22, 23, 28]


# How do these best features compare with the first three features we found?
Originally, we chose the three x's that were most correlated with y.  We obtained a solid score, but now we have obtained a better score.  Why do you think that is?
## Let's examine the correlation matrix of our new three best features and y

In [42]:
yX = yDf.join(df.loc[:,bestColumnIndices])
yX.corr()

Unnamed: 0,1,22,23,28
1,1.0,0.776454,0.456903,0.65961
22,0.776454,1.0,0.359921,0.573975
23,0.456903,0.359921,1.0,0.368366
28,0.65961,0.573975,0.368366,1.0
