# Feature selection

ISLP uses their own custom function and score to select features. In my case, I will try to use some of scikit-learn's feature selections

In [82]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import RFE, SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [2]:
heart_attack = pd.read_csv('../02-logistic-regression/data/Medicaldataset.csv')

In [3]:
heart_attack.head()

Unnamed: 0,Age,Gender,Heart rate,Systolic blood pressure,Diastolic blood pressure,Blood sugar,CK-MB,Troponin,Result
0,64,1,66,160,83,160.0,1.8,0.012,negative
1,21,1,94,98,46,296.0,6.75,1.06,positive
2,55,1,64,160,77,270.0,1.99,0.003,negative
3,64,1,70,120,55,270.0,13.87,0.122,positive
4,55,1,64,112,65,300.0,1.08,0.003,negative


In [4]:
X = heart_attack.drop(columns='Result')
y = heart_attack['Result']

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

## Lasso

Lasso is a good way to eliminate predictors. I have already practiced with l2 penalty in a previous notebook

In [40]:
lasso_pipeline = make_pipeline(StandardScaler(),
                         LogisticRegression(solver='liblinear', penalty='l1', C=0.1))

In [41]:
lasso_pipeline.fit(X_train, y_train)

In [42]:
model = lasso_pipeline.named_steps['logisticregression']

In [43]:
model.coef_

array([[ 0.43971083,  0.11720989,  0.00818281,  0.        ,  0.        ,
        -0.05408185,  1.75794687,  1.45882945]])

For this Lasso method, with the adjusted `C` parameter of `LogisticRegression`, I was able to eliminate some 

In [44]:
probs = lasso_pipeline.predict(X_test)

In [45]:
np.mean(y_test == probs)

0.75

## Recursive Feature Elimination

I am still going to use `LogisticRegression` but without the `liblinear` this time so I can have `penalty = None`

In [51]:
rfe_pipeline = make_pipeline(StandardScaler(),
                             RFE(LogisticRegression(penalty=None), n_features_to_select=5))

In [52]:
rfe_pipeline.fit(X_train, y_train)

In [76]:
pd.DataFrame({'Remove':rfe_pipeline.named_steps["rfe"].support_},
             index=X.columns)

Unnamed: 0,Remove
Age,True
Gender,True
Heart rate,True
Systolic blood pressure,False
Diastolic blood pressure,False
Blood sugar,False
CK-MB,True
Troponin,True


In [73]:
probs = rfe_pipeline.predict(X_test)

In [74]:
np.mean(probs == y_test)

0.8087121212121212

## SelectFromModel

I just realized that you are supposed to use `SelectFromModel` after the logistic regression that I did previously

In [124]:
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)

In [125]:
model = LogisticRegression(solver='liblinear', penalty='l1', C=0.1).fit(X_train_scaled, y_train)

In [126]:
sfm_model = SelectFromModel(model, prefit=True)

In [136]:
X_new = sfm_model.transform(X)



In [139]:
X.shape

(1319, 8)

In [138]:
X_new.shape

(1319, 6)

In [141]:
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.4, random_state=0)

In [142]:
new_pipeline = make_pipeline(StandardScaler(),
                             LogisticRegression(solver='liblinear',penalty='l2'))

In [144]:
new_pipeline.fit(X_train, y_train)

In [146]:
probs = new_pipeline.predict(X_test)

In [147]:
np.mean(probs == y_test)

0.7784090909090909

In this case, using `SelectFromModel`, I can use `l2` instead of `l1` penalty since I have already removed 0 coefficients from the previous `l1` penalty