# Feature Selection using <font color=red>Wrapper Methods</font>

<img src='Data/Reducing Complexity.png' width=500/>

<font color=red>__Embedded Methods__</font>
- Lasso Regressor
- Decision Tree Regressor

- In __Embedded methods__, we build models which __assign importance to individual features__ based which we can select features. So, the important features are embedded in models that is why they are named like that. Also note that only some models (like __decision trees, lasso regression__) have this ability not all models.

In [1]:
import numpy as np
import pandas as pd

In [2]:
automobile = pd.read_csv('Data/cars_processed.csv')
automobile.head(10)

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Origin,Age
0,18.0,8,307.0,130,3504,12.0,1,50
1,15.0,8,350.0,165,3693,11.5,1,50
2,16.0,8,304.0,150,3433,12.0,1,50
3,17.0,8,302.0,140,3449,10.5,1,50
4,15.0,8,429.0,198,4341,10.0,1,50
5,14.0,8,454.0,220,4354,9.0,1,50
6,14.0,8,440.0,215,4312,8.5,1,50
7,14.0,8,455.0,225,4425,10.0,1,50
8,15.0,8,390.0,190,3850,8.5,1,50
9,15.0,8,383.0,170,3563,10.0,1,50


In [3]:
X = automobile.drop('MPG', axis=1)
Y = automobile['MPG']

<font color=red>__Lasso Regressor__</font>

In [4]:
from sklearn.linear_model import Lasso

In [5]:
lasso = Lasso(alpha=0.8)
lasso.fit(X, Y)

Lasso(alpha=0.8, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)

Note: As we don't evaluate the regression model but wish to see which features are significant, we don't split data into training and testing.

In [6]:
predictors = X.columns
coef = pd.Series(lasso.coef_, predictors).sort_values()
print(coef)

Age            -0.682080
Horsepower     -0.007132
Weight         -0.006456
Cylinders      -0.000000
Displacement    0.000000
Acceleration    0.000000
Origin          0.000000
dtype: float64


The __regularization parameter__ in Lasso regression, forces the __coeffs of unimportant variables__ to go close to __0s__.

In [7]:
# Most important features of the car
lasso_features = ['Age', 'Weight']
X[lasso_features].head()

Unnamed: 0,Age,Weight
0,50,3504
1,50,3693
2,50,3433
3,50,3449
4,50,4341


<font color=red>__Decision Tree Regressor__</font>

In [8]:
from sklearn.tree import DecisionTreeRegressor

In [9]:
decision_tree = DecisionTreeRegressor(max_depth=4)
decision_tree.fit(X, Y)

DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

In [10]:
predictors = X.columns
coef = pd.Series(decision_tree.feature_importances_, predictors).sort_values()
coef

Cylinders       0.000000
Acceleration    0.000000
Origin          0.005397
Weight          0.045475
Age             0.110994
Horsepower      0.186934
Displacement    0.651200
dtype: float64

Coeffs close to 0 are unimportant variables.

In [11]:
# Most important features of the car
decision_tree_features = ['Displacement', 'Horsepower']
X[decision_tree_features].head()

Unnamed: 0,Displacement,Horsepower
0,307.0,130
1,350.0,165
2,304.0,150
3,302.0,140
4,429.0,198


Now, let us use the above __features__ selected in __Lasso__ and __Decision Tree__, to train a __Regression model__.

In [12]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

In [13]:
def build_model(X, Y, test_frac):
    x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=test_frac)
    model = LinearRegression().fit(x_train, y_train)
    y_pred = model.predict(x_test)
    
    print('Test score = ', r2_score(y_test, y_pred))

In [14]:
build_model(X, Y, 0.2)

Test score =  0.8240099050954761


In [15]:
build_model(X[lasso_features], Y, 0.2)

Test score =  0.8263539319746744


In [16]:
build_model(X[decision_tree_features], Y, 0.2)

Test score =  0.6555541288320288
