# Feature Selection II - Selecting for Model Accuracy  
In this second chapter on feature selection, you'll learn how to let models help you find the most important features in a dataset for predicting a particular target feature. In the final lesson of this chapter, you'll combine the advice of multiple, different, models to decide on which features are worth keeping.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Selecting features for model performance  
  
**Selecting features for model performance**  
  
In previous chapters, we've always looked at individual or pairwise properties of features to decide on whether we keep them in the dataset or not. Another, more pragmatic, approach is to select features based on how they affect model performance.  
  
**Ansur dataset sample**
  
Consider this sample of the ANSUR dataset with one target variable, "Gender" which we'll try to predict, and five body measurement features to do so.  
  
**Pre-processing the data**
  
To train a model on this data we should first perform a train - test split, and in this case also standardize the training feature dataset X_train to have a mean of zero and a variance of one. Notice that we've used the `.fit_transform()` method to fit the scaler and transform the data in one command.
  
**Creating a logistic regression model**
  
We can then create and fit a logistic regression model on this standardized training data. To see how the model performs on the test set we first scale these features with the `.transform()` method of the scaler that we just fit on the training set and then make our prediction. we get a test-set accuracy of 99%.  
  
**Inspecting the feature coefficients**  
  
However, when we look at the feature coefficients that the logistic regression model uses in its decision function, we'll see that some values are pretty close to zero. Since these coefficients will be multiplied with the feature values when the model makes a prediction, features with coefficients close to zero will contribute little to the end result. We can use the `zip()` function to transform the output into a dictionary that shows which feature has which coefficient. If we want to remove a feature from the initial dataset with as little effect on the predictions as possible, we should pick the one with the lowest coefficient, "handlength" in this case. The fact that we standardized the data first makes sure that we can compare the coefficients to one another.  
  
**Features that contribute little to a model**  
  
When we remove the "handlength" feature at the start of the process, our model accuracy remains unchanged at 99% while we did reduce our dataset's complexity. We could repeat this process until we have the desired number of features remaining, but it turns out, there's a Scikit-learn function that does just that.  
  
**Recursive Feature Elimination**  
  
`from sklearn.feature_selection import RFE`  
`ref = RFE(estimator= LogisticRegression(), n_features_to_select= 2, verbose= 1)`  
RFE for "Recursive Feature Elimination" is a feature selection algorithm that can be wrapped around any model that produces feature coefficients or feature importance values. We can pass it the model we want to use and the number of features we want to select. While fitting to our data it will repeat a process where it first fits the internal model and then drops the feature with the weakest coefficient. It will keep doing this until the desired number of features is reached. If we set RFE's `verbose=` parameter higher than zero we'll be able to see that features are dropped one by one while fitting. We could also decide to just keep the 2 features with the highest coefficients after fitting the model once, but this recursive method is safer, since dropping one feature will cause the other coefficients to change.  
  
**Inspecting the RFE results**  
  
`X.columns[rfe.support_]`  
`print(dict(zip(X.columns, rfe.ranking_)))`  
`print(accuracy_score(y_test, rfe.predict(X_test_std)))`  
Once RFE is done we can check the `.support_` attribute that contains True/False values to see which features were kept in the dataset. Using the `zip()` function once more, we can also check out rfe's `.ranking_` attribute to see in which iteration a feature was dropped. Values of 1 mean that the feature was kept in the dataset until the end while high values mean the feature was dropped early on. Finally, we can test the accuracy of the model with just two remaining features, 'chestdepth' and 'neckcircumference', turns out the accuracy is still untouched at 99%. This means the other 3 features had little to no impact on the model an its predictions.  

NOTE: `pprint` is a Python module that provides a `pprint()` function (pretty-print) for printing complex data structures like dictionaries, lists, and tuples in a more readable and organized way. The `pprint()` function prints the data in a formatted and indented way that makes it easier to read and understand, especially for nested data structures with multiple levels of indentation.

### Building a diabetes classifier  
  
You'll be using the Pima Indians diabetes dataset to predict whether a person has diabetes using logistic regression. There are 8 features and one target in this dataset.  

In [2]:
# Load the dataset
diabetes_df = pd.read_csv('../_datasets/PimaIndians.csv')
diabetes_df.head()

Unnamed: 0,pregnant,glucose,diastolic,triceps,insulin,bmi,family,age,test
0,1,89,66,23,94,28.1,0.167,21,negative
1,0,137,40,35,168,43.1,2.288,33,positive
2,3,78,50,32,88,31.0,0.248,26,positive
3,2,197,70,45,543,30.5,0.158,53,positive
4,1,189,60,23,846,30.1,0.398,59,positive


In [3]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from pprint import pprint


# X/y split
X, y = diabetes_df.iloc[:, :-1], diabetes_df.iloc[:, -1]

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Creating the scaler and the ML instance
scaler = StandardScaler()
lr = LogisticRegression()


In [4]:
# Fit the scaler on the training features and transform them in one go
X_train_std = scaler.fit_transform(X_train)

# Fit the logistic regression model on the scaled training data
lr.fit(X_train_std, y_train)

# Scaler the test features
X_test_std = scaler.transform(X_test)

# Predict diabetes presence on the scaled test set
y_pred = lr.predict(X_test_std)

# Print accuracy metrics and feature coefficients
print('{0:1%} accuracy on test set.'.format(accuracy_score(y_test, y_pred)))
pprint(dict(zip(X.columns, abs(lr.coef_[0]).round(2))))

77.966102% accuracy on test set.
{'age': 0.44,
 'bmi': 0.4,
 'diastolic': 0.04,
 'family': 0.25,
 'glucose': 1.01,
 'insulin': 0.17,
 'pregnant': 0.26,
 'triceps': 0.11}


We get almost 80% accuracy on the test set. Take a look at the differences in model coefficients for the different features.

### Manual Recursive Feature Elimination  
  
Now that we've created a diabetes classifier, let's see if we can reduce the number of features without hurting the model accuracy too much.  
  
On the second line of code the features are selected from the original dataframe. Adjust this selection.  

In [5]:
# Run to see what is the lowest coefficient
X = diabetes_df[[
    'pregnant', 'glucose', 'triceps', 'diastolic', 
    'insulin', 'bmi', 'family', 'age'
]]

# Performs a 25-75% train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Scales features and fits the logistic regression model
lr.fit(scaler.fit_transform(X_train), y_train)

# Calculate the accuracy on the test set and prints coefficients
acc = accuracy_score(y_test, lr.predict(scaler.transform(X_test)))
print('{0: .1%} accuracy on test set.'.format(acc))
pprint(dict(zip(X.columns, abs(lr.coef_[0]).round(2))))


 79.6% accuracy on test set.
{'age': 0.34,
 'bmi': 0.38,
 'diastolic': 0.03,
 'family': 0.35,
 'glucose': 1.23,
 'insulin': 0.19,
 'pregnant': 0.05,
 'triceps': 0.24}


In [6]:
# Remove the feature with the lowest model coefficient ('diastolic')
X = diabetes_df[[
    'pregnant', 'glucose', 'triceps', 'insulin', 'bmi', 'family', 'age'
]]

# Performs a 25-75% train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Scales features and fits the logistic regression model
lr.fit(scaler.fit_transform(X_train), y_train)

# Calculate the accuracy on the test set and prints coefficients
acc = accuracy_score(y_test, lr.predict(scaler.transform(X_test)))
print('{0: .1%} accuracy on test set.'.format(acc))
pprint(dict(zip(X.columns, abs(lr.coef_[0]).round(2))))

 80.6% accuracy on test set.
{'age': 0.35,
 'bmi': 0.39,
 'family': 0.34,
 'glucose': 1.24,
 'insulin': 0.2,
 'pregnant': 0.05,
 'triceps': 0.24}


In [7]:
# Remove the 2 features with the lowest model coefficients ('insulin', 'pregnant')
X = diabetes_df[['glucose', 'triceps', 'bmi', 'family', 'age']]

# Performs a 25-75% train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Scales features and fits the logistic regression model
lr.fit(scaler.fit_transform(X_train), y_train)

# Calculates the accuracy on the test set and prints coefficients
acc = accuracy_score(y_test, lr.predict(scaler.transform(X_test)))
print("{0:.1%} accuracy on test set.".format(acc)) 
pprint(dict(zip(X.columns, abs(lr.coef_[0]).round(2))))

79.6% accuracy on test set.
{'age': 0.37, 'bmi': 0.34, 'family': 0.34, 'glucose': 1.13, 'triceps': 0.25}


In [8]:
# Only keep the feature with the highest coefficient
X = diabetes_df[['glucose']]

# Performs a 25-75% train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Scales features and fits the logistic regression model to the data
lr.fit(scaler.fit_transform(X_train), y_train)

# Calculates the accuracy on the test set and prints coefficients
acc = accuracy_score(y_test, lr.predict(scaler.transform(X_test)))
print("{0:.1%} accuracy on test set.".format(acc)) 
print(dict(zip(X.columns, abs(lr.coef_[0]).round(2))))

75.5% accuracy on test set.
{'glucose': 1.28}


Removing all but one feature only reduced the accuracy by a few percent.

### Automatic Recursive Feature Elimination  
  
Now let's automate this recursive process. Wrap a Recursive Feature Eliminator (RFE) around our logistic regression estimator and pass it the desired number of features.

In [9]:
# X/y split
X, y = diabetes_df.iloc[:, :-1], diabetes_df.iloc[:, -1]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Creating LogisticRegression instance
lr = LogisticRegression()

# Fitting the scaler on the training features and transforming them in one go
X_train_std = scaler.fit_transform(X_train)

# Fit the logistic regression model on the scaled training data
lr.fit(X_train_std, y_train)

# Scalar the test features
X_test_std = scaler.transform(X_test)

In [10]:
from sklearn.feature_selection import RFE

# Create the RFE with a LogisticRegression estimator and 3 features to select
rfe = RFE(estimator=LogisticRegression(), n_features_to_select=3, verbose=1)

# Fit the eliminator to the data
rfe.fit(X_train_std, y_train)

# Print the features and their ranking (high = dropped early on)
print(dict(zip(X.columns, rfe.ranking_)))

# Print the features that are not elimiated
print(X.columns[rfe.support_])

# CAlculates the test set accuracy
acc = accuracy_score(y_test, rfe.predict(X_test_std))
print("{0:.1%} accuracy on test set.".format(acc))

Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Fitting estimator with 5 features.
Fitting estimator with 4 features.
{'pregnant': 3, 'glucose': 1, 'diastolic': 6, 'triceps': 4, 'insulin': 5, 'bmi': 1, 'family': 2, 'age': 1}
Index(['glucose', 'bmi', 'age'], dtype='object')
78.0% accuracy on test set.


When we eliminate all but the 3 most relevant features we get a 83.1% accuracy on the test set.

## Tree-based feature selection  
  
**Tree-based feature selection**  
  
Some models will perform feature selection by design to avoid overfitting.  
  
**Random forest classifier**  
  
`from sklearn.ensemble import RandomForestClassifier`  
`from sklearn.metrics import accuracy_score`  
`rf = RandomForestClassifier()`  
`rf.fit(X_train, y_train)`  
`print(accuracy_score(y_test, rf.predict(X_test)))`  
One of those, is the random forest classifier. It's an ensemble model that will pass different, random, subsets of features to a number of decision trees. To make a prediction it will aggregate over the predictions of the individual trees. The example forest shown here contains four decision trees. While simple in design, random forests often manage to be highly accurate and avoid overfitting even with the default Scikit-learn settings. If we train a random forest classifier on the 93 numeric features of the ANSUR dataset to predict gender, its test set accuracy is 99%. This means it managed to escape the curse of dimensionality and didn't overfit on the many features in the training set.  
  
![Alt text](../_images/rfc.png)  
  
In this illustration of what the trained model could look like, the first decision tree in the forest used the neck circumference feature on its first decision node and hand length later on to determine if a person was male or female. By averaging how often features are used to make decisions inside the different decision trees, and taking into account whether these are important decisions near the root of the tree or less important decisions in the smaller branches of the tree, the random forest algorithm manages to calculate feature importance values.  
  
**Feature importance values**  
  
`rf = RandomForestClassifier()`  
`rf.fit(X_train, y_train)`  
`print(rf.feature_importances_)`  
`print(sum(rf.feature_importances_))`  
These values can be extracted from a trained model using the `feature_importances_` attribute. Just like the coefficients produced by the logistic regressor, these feature importance values can be used to perform feature selection, since for unimportant features they will be close to zero. An advantage of these feature importance values over coefficients is that they are comparable between features by default, since they always sum up to one. Which means we don't have to scale our input data first. We can use the feature importance values to create a True/False mask for features that meet a certain importance threshold. Then, we can apply that mask to our feature DataFrame to implement the actual feature selection.  
  
**RFE with random forests**  
  
`mask = rf.feature_importances_ > 0.1`  
`print(mask)`  
`X_reduced = X.loc[:, mask]`  
`print(X_reduced.columns)`  
Remember dropping one weak feature can make other features relatively more or less important. If you want to play safe and minimize the risk of dropping the wrong features, you should not drop all least important features at once but rather one by one. To do so we can once again wrap a Recursive Feature Eliminator, or `RFE()`, around our model. Here, we've reduced the number of features to six with no reduction in test set accuracy. However, training the model once for each feature we want to drop can result in too much computational overhead.  
  
`from sklearn.feature_selection import RFE`  
`rfe = RFE(estimator= RandomForestClassifier(), n_features_to_select= 6, step= 10, verbose= 1)`  
`rfe.fit(X_train, y_train)`  
`print(accuracy_score(y_test, rfe.predict(X_test)))`  
To speed up the process we can pass the `step=` parameter to `RFE()`. Here, we've set it to 10 so that on each iteration the 10 least important features are dropped. Once the final model is trained, we can use the feature eliminator's `.support_` attribute as a mask to print the remaining column names.  


### Building a random forest model  
  
You'll again work on the Pima Indians dataset to predict whether an individual has diabetes. This time using a random forest classifier. You'll fit the model on the training data after performing the train-test split and consult the feature importance values.  

In [11]:
from sklearn.ensemble import RandomForestClassifier

# Perform a 75% training and 25% test data split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Fit the random forest model to the training data
rf = RandomForestClassifier(random_state=0)
rf.fit(X_train, y_train)

# Calculate the accuracy
acc = accuracy_score(y_test, rf.predict(X_test))

# Print the importances per feature
pprint(dict(zip(X.columns, rf.feature_importances_.round(2))))

# Print accuracy
print("{0:.1%} accuracy on test set.".format(acc))

{'age': 0.13,
 'bmi': 0.12,
 'diastolic': 0.09,
 'family': 0.12,
 'glucose': 0.25,
 'insulin': 0.14,
 'pregnant': 0.07,
 'triceps': 0.09}
79.6% accuracy on test set.


The random forest model gets 79.6% accuracy on the test set and 'glucose' is the most important feature (0.25).

### Random forest for feature selection

Now lets use the fitted random model to select the most important features from our input dataset X.

In [12]:
# Creating a mask for feature importances above the threshold
mask = rf.feature_importances_ > 0.15

# Displaying the mask
print(mask)

# Applying the mask to the feature dataset, X
reduced_X = X.loc[:, mask]

# Display selected column names
print(reduced_X.columns)

[False  True False False False False False False]
Index(['glucose'], dtype='object')


Only the feature 'glucose' was sufficiently important.

### Recursive Feature Elimination with random forests
  
You'll wrap a Recursive Feature Eliminator around a random forest model to remove features step by step. This method is more conservative compared to selecting features after applying a single importance threshold. Since dropping one feature can influence the relative importances of the others.

In [13]:
# Wrap the feature eliminator around the random forest model
rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=2, verbose=1)

# Fit the model to the training data
rfe.fit(X_train, y_train)

# Create a mask using an attribute of rfe
mask = rfe.support_

# Apply the mask to the feature dataset X and print the result
reduced_X = X.loc[:, mask]
print(reduced_X.columns)

Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Fitting estimator with 5 features.
Fitting estimator with 4 features.
Fitting estimator with 3 features.
Index(['glucose', 'insulin'], dtype='object')


In [14]:
# Wrap the feature eliminator around the random forest model
rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=2, step=2, verbose=1)

# Fit the model to the training data
rfe.fit(X_train, y_train)

# Create a mask using an attribute of rfe
mask = rfe.support_

# Apply the mask to the feature dataset X and print the result
reduced_X = X.loc[:, mask]
print(reduced_X.columns)

Fitting estimator with 8 features.
Fitting estimator with 6 features.
Fitting estimator with 4 features.
Index(['glucose', 'bmi'], dtype='object')


Great! Compared to the quick and dirty single threshold method from the previous exercise one of the selected features is different.

## Regularized linear regression
  
So far, we focused on how to reduce dimensionality using classification algorithms. Let's see what we can do with regressions.
  
**Linear model concept**  
  
To refresh how linear regressions work, we'll build a model that derives the linear function between three input values and a target. However, we'll be creating the feature dataset and linear function ourselves, so that we can control the ground truth that our model tries to derive.
  
**Creating our own dataset**  
  
We create three features x1, x2, and x3 that all follow a simple normal distribution. We can then create our own target y with a function of our choice. Let's say that y=20+5x1+2x2+0x3+error (and an error term). The 20 at the start is called the intercept; 5, 2 and 0 are the coefficients of our features, they determine how big an effect each has on the target. The third feature has a coefficient of zero and will therefore have no effect on the target whatsoever. It would be best to remove it from the dataset, since it could confuse a model and make it overfit. Now that we've set the ground truth for this dataset, let's see if a model can derive it.
  
**Linear regression in Python**  
  
When you fit a `LinearRegression()` model with Scikit-Learn, the model object will have `.coef_` for coefficient attribute that contains a NumPy array with a number of elements equal to the number of input features. These are the three values we just set to 5, 2, and 0 and the model was able to estimate them pretty accurately. Same goes for the intercept. To check how accurate the model's predictions are we can calculate the R-squared value on the test set. This tells us how much of the variance in the target feature our model can predict. Our model scores an impressive 97.6%. However, the third feature, which had no effect whatsoever, was estimated to have a small effect of -0.05. If there would be more of these irrelevant features, the model could overfit. To solve this, we'll have to look at what the model actually does while training.  
```
from sklearn.linear_model import LinearRegression()

lr = LinearRegression()
lr.fit(X_train, y_train)

#Actual coefficients = [5 2 0]
#Actual intercept = 20
print(lr.coef_)
print(lr.intercept_)

#Terminal
[4.95 1.83 -0.05]
19.8

#R-squared
print(lr.score(X_test, y_test))

#Terminal
0.976
```

  
**Loss function: Mean Squared Error**
  
The model will try to find optimal values for the intercept and coefficients by minimizing a loss function. This function contains the mean sum of the squared differences between actual and predicted values, the gray squares in the plot. Minimizing this MSE makes the model as accurate a possible. However, we don't want our model to be super accurate on the training set if that means it no longer generalizes to new data.  
  
![Alt text](../_images/mse.png)  
  
  
**Adding regularization**  
  
To avoid this we can introduce regularization. The model will then not only try to be as accurate as possible by minimizing the MSE, it will also try to keep the model simple by keeping the coefficients low. The strength of regularization can be tweaked with alpha, when it's too low the model might overfit, when it's too high the model might become too simple and inaccurate. One linear model that includes this type of regularization is called Lasso, for least absolute shrinkage and selection operator.
  
![Alt text](../_images/mse-reg.png)  
  
**Lasso regressor**  
  
Lasso: least absolute shrinkage and selection operator.  
  
When we fit it on our dataset we see that it indeed reduced the coefficient of the third feature to zero, ignoring it, but also that it reduced the other coefficients resulting in a lower R squared.
To avoid this we can change the alpha parameter. When we set it to 0.05 the third feature is still ignored but the other coefficients are reduced less and our R squared is up again.
  
```
from sklearn.linear_model import Lasso
la = Lasso()
la.fit(X_train, y_train)

#Actual coefficients = [5 2 0]
print(la.coef_)

#Terminal
[4.07 0.59 0. ]

#R-squared
print(la.score(X_test, y_test))

#Terminal
0.861


from sklearn.linear_model import Lasso
la = Lasso(alpha=0.05)
la.fit(X_train, y_train)

#Actual coefficients = [5 2 0]
print(la.coef_)

#Terminal
[4.91 1.76 0. ]

#R-squared
print(la.score(X_test, y_test))

#Terminal
0.974
```

### Creating a LASSO regressor

You'll be working on the numeric ANSUR body measurements dataset to predict a persons Body Mass Index (BMI) using the `Lasso()` regressor. BMI is a metric derived from body height and weight but those two features have been removed from the dataset to give the model a challenge.
  
You'll standardize the data first using the `StandardScaler()` that has been instantiated for you as scaler to make sure all coefficients face a comparable regularizing force trying to bring them down.

In [15]:
# Loading the dataset
ansur_df = pd.read_csv('../_datasets/ANSUR_II_MALE.csv')

# Unused columns in the dataset/illustration
unused = ['Gender', 'Branch', 'Component', 'BMI_class', 'Height_class', 'weight_kg', 'stature_m']

# Drop the non-numeric columns from df
ansur_df = ansur_df.drop(unused, axis=1)

# X/y split
X, y = ansur_df.drop('BMI', axis=1), ansur_df['BMI']

# Initialize the scaler
scaler = StandardScaler()

In [16]:
from sklearn.linear_model import Lasso

# Set the test size to 30% to get a 70-30% train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Fit the scaler on the training features and transform these in one go
X_train_std = scaler.fit_transform(X_train)

# Creating the Lasso model
la = Lasso()

# Fit the model to the standardized training data
la.fit(X_train_std, y_train)


You've fitted the Lasso model to the standardized training data. Now let's look at the results!

### Lasso model results
Now that you've trained the Lasso model, you'll score its predictive capacity (R^2) on the test set and count how many features are ignored because their coefficient is reduced to zero.
  
1. Transform the test set with the pre-fitted scaler.
2. Calculate the R<sup>2</sup> value on the scaled test data.
3. Create a list that has True values when coefficients equal 0.
4. Calculate the total number of features with a coefficient of 0.

In [17]:
# Transform the test set with the pre-fitted scaler
X_test_std = scaler.transform(X_test)

# Calculate the coefficient of determination (R squared) on X_test_std
r_squared = la.score(X_test_std, y_test)
print("The model can predict {0:.1%} of the variance in the test set (R^2).".format(r_squared))

# Create a list that has True values when coefficients equal 0
zero_coef = la.coef_ == 0

# Calculate how many features have a zero coefficient
n_ignored = sum(zero_coef)
print("The model has ignored {} out of {} features.".format(n_ignored, len(la.coef_)))

The model can predict 84.7% of the variance in the test set (R^2).
The model has ignored 82 out of 91 features.


We can predict almost 85% of the variance in the BMI value using just 9 out of 91 of the features. The R<sup>2</sup> could be higher though.

### Adjusting the regularization strength
The current Lasso model has an R<sup>2</sup> score of 84.7%. When a model applies overly powerful regularization it can suffer from high bias, hurting its predictive power.
  
Let's improve the balance between predictive power and model simplicity by tweaking the alpha parameter.
  
Find the highest value for alpha that gives an R<sup>2</sup> value above 98% from the options: 1, 0.5, 0.1, and 0.01.

In [18]:
# List of alphas
alpha_list = [1, 0.5, 0.1, 0.01]

# Placeholder values
max_r = 0
max_alpha = 0

for alpha in alpha_list:
    # Find the highest alpha value with R-squared above 98%
    la = Lasso(alpha=alpha, random_state=0)

    # Fitting the model and calculating performance stats recursively
    la.fit(X_train_std, y_train)
    r_squared = la.score(X_test_std, y_test)
    n_ignored_features = sum(la.coef_ == 0)

    # Print performance stats
    print('The model can predict {0:.1%} of the variance in the test set'.format(r_squared))
    print('{} out of {} features were ignored'.format(n_ignored_features, len(la.coef_)))
    if r_squared > 0.98:
        max_r = r_squared
        max_alpha = alpha
        break

# Display
print('Max R-squared: {:.4}, alpha: {}'.format(max_r, max_alpha))

The model can predict 84.7% of the variance in the test set
82 out of 91 features were ignored
The model can predict 93.8% of the variance in the test set
79 out of 91 features were ignored
The model can predict 98.3% of the variance in the test set
64 out of 91 features were ignored
Max R-squared: 0.9828, alpha: 0.1


With this more appropriate regularization strength we can predict 98% of the variance in the BMI value while ignoring 2/3 of the features.

## Combining feature selectors
  
In the previous lesson we saw how Lasso models allow you to tweak the strength of regularization with the alpha parameter.

**Lasso regressor**
  
We manually set this alpha parameter to find a balance between removing as much features as possible and model accuracy. However, manually finding a good alpha value can be tedious. Good news is, there is a way to automate this.
  
**LassoCV regressor**
  
The `LassoCV()` class will use cross validation to try out different alpha settings and select the best one. When we fit this model to our training data it will get an `.alpha_` attribute with the optimal value.
LassoCV regressor
  
To actually remove the features to which the Lasso regressor assigned a zero coefficient, we once again create a mask with True values for all non-zero coefficients. We can then apply it to our feature dataset X with the `.loc` method.

**Taking a step back**
  
Two lessons ago we talked about random forests, they're an example of an ensemble model. It's designed on the idea that a lot of weak predictors can combine to form a strong one. When we use models to perform feature selection we could apply the same idea. Instead of trusting a single model to tell us which features are important we could have multiple models each cast their vote on whether we should keep a feature or not. We could then combine the votes to make a decision.
  
**Feature selection with LassoCV**
  
To do so lets first train the models one by one. We'll be predicting BMI in the ANSUR dataset just like you did in the last exercises. If we use `LassoCV()` we'll get an R squared of 99% and when we create a mask that tells us whether a feature has a coefficient different from 0 we can see that this is the case for 66 out of 91 features. We'll put this lcv_mask to the side for a moment and move on to the next model.

**Feature selection with random forest**
  
The second model we train is a random forest regressor model. We've wrapped a Recursive Feature Selector or `RFE()`, around the model to have it select the same number of features as the `LassoCV()` regressor did. We can then use the `.support_` attribute of the fitted model to create rf_mask.
  
**Feature selection with gradient boosting**
  
Then, we do the same thing with a gradient boosting regressor. Like random forests gradient boosting is an ensemble method that will calculate feature importance values. The trained model too has a `.support_` attribute that we can use to create gb_mask.

**Combining the feature selectors**
  
Finally, we can start counting the votes on whether to select a feature. We use NumPy's `sum()` function, pass it the three masks in a list, and set the axis argument to 0. We'll then get an array with the number of votes that each feature got. What we do with this vote then depends on how conservative we want to be. If we want to make sure we don't lose any information, we could select all features with at least one vote. In this example we chose to have at least two models voting for a feature in order to keep it. All that is left now is to actually implement the dimensionality reduction. We do that with the `.loc[]` method of our feature DataFrame X.

### Creating a LassoCV regressor

You'll be predicting biceps circumference on a subsample of the male ANSUR dataset using the `LassoCV()` regressor that automatically tunes the regularization strength (alpha value) using Cross-Validation.

In [19]:
# Creating the X/y split
X = ansur_df[['acromialheight', 'axillaheight', 'bideltoidbreadth', 'buttockcircumference', 'buttockkneelength', 'buttockpopliteallength', 'cervicaleheight', 'chestcircumference', 'chestheight',
       'earprotrusion', 'footbreadthhorizontal', 'forearmcircumferenceflexed', 'handlength', 'headbreadth', 'heelbreadth', 'hipbreadth', 'iliocristaleheight', 'interscyeii',
       'lateralfemoralepicondyleheight', 'lateralmalleolusheight', 'neckcircumferencebase', 'radialestylionlength', 'shouldercircumference', 'shoulderelbowlength', 'sleeveoutseam',
       'thighcircumference', 'thighclearance', 'verticaltrunkcircumferenceusa', 'waistcircumference', 'waistdepth', 'wristheight', 'BMI']]
y = ansur_df['bicepscircumferenceflexed']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Fitting and transforming the X_train, transforming the X_test
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [20]:
from sklearn.linear_model import LassoCV


# Create and fit the LassoCV
lcv = LassoCV()
lcv.fit(X_train, y_train)
print('Optimal alpha = {0:.3f}'.format(lcv.alpha_))

# Calculate R-squared on the test-set
r_squared = lcv.score(X_test, y_test)
print('The model explains {0:.2%} of the test set variance'.format(r_squared))

# Create a mask for coefficients
lcv_mask = lcv.coef_ != 0
print('{} features out of {} selected'.format(sum(lcv_mask),len(lcv_mask)))

Optimal alpha = 0.035
The model explains 85.62% of the test set variance
24 features out of 32 selected


Great! We got a decent R squared and removed 10 features. We'll save the lcv_mask for later on.

### Ensemble models for extra votes

The `LassoCV()` model selected 22 out of 32 features. Not bad, but not a spectacular dimensionality reduction either. Let's use two more models to select the 10 features they consider most important using the Recursive Feature Eliminator (RFE).
  
1. Select 10 features with RFE on a GradientBoostingRegressor and drop 3 features on each step.
  
2. Calculate the R-squared on the test-set
  
3. Assign the support array of the fitted model to gb_mask.
  
4. Modify the first step to select 10 features with RFE on a `RandomForestRegressor()` and drop 3 features on each step.

In [21]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import GradientBoostingRegressor


# Select 10 features with RFE on a GradientBoostingRegressor, drop 3 features on each step
rfe_gb = RFE(estimator= GradientBoostingRegressor(), n_features_to_select=10, step=3, verbose=1)
rfe_gb.fit(X_train, y_train)

# Calculate the R-squared on the test-set
r_squared = rfe_gb.score(X_test, y_test)
print('The model can explain {0:.2%} of the variance in the test-set'.format(r_squared))

# Assign the support array to the mask
gb_mask = rfe_gb.support_


Fitting estimator with 32 features.
Fitting estimator with 29 features.
Fitting estimator with 26 features.
Fitting estimator with 23 features.
Fitting estimator with 20 features.
Fitting estimator with 17 features.
Fitting estimator with 14 features.
Fitting estimator with 11 features.
The model can explain 83.26% of the variance in the test-set


In [22]:
# Creating the RandomForestRegressor() model with the Recursive Feature Eliminator
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor


# Creating the model and fitting it
rfe_rf = RFE(RandomForestRegressor(), n_features_to_select=10, step=3, verbose=1)
rfe_rf.fit(X_train, y_train)

# Calculate the R-squared on the test-set
r_squared = rfe_rf.score(X_test, y_test)
print('The model can explain {0:.2%} of the variance in the test-set'.format(r_squared))

# Assign the support array to the mask
rf_mask = rfe_rf.support_

Fitting estimator with 32 features.
Fitting estimator with 29 features.
Fitting estimator with 26 features.
Fitting estimator with 23 features.
Fitting estimator with 20 features.
Fitting estimator with 17 features.
Fitting estimator with 14 features.
Fitting estimator with 11 features.
The model can explain 82.30% of the variance in the test-set


Including the Lasso linear model from the previous exercise, we now have the votes from 3 models on which features are important.

### Combining 3 feature selectors

We'll combine the votes of the 3 models you built in the previous exercises, to decide which features are important into a meta mask. We'll then use this mask to reduce dimensionality and see how a simple linear regressor performs on the reduced dataset.

In [23]:
# Sum the votes of the three models
votes = np.sum([lcv_mask, gb_mask, rf_mask], axis=0)
print(votes)

# Create a mask for features selected by all 3 models
meta_mask = votes == 3
print(meta_mask)

# Apply the dimensionality reduction on x
X_reduced = X.loc[:, meta_mask]
print(X_reduced.columns)

[0 1 3 3 0 1 1 3 1 1 1 3 1 1 1 0 0 1 0 1 1 1 3 3 0 3 2 2 1 2 0 3]
[False False  True  True False False False  True False False False  True
 False False False False False False False False False False  True  True
 False  True False False False False False  True]
Index(['bideltoidbreadth', 'buttockcircumference', 'chestcircumference',
       'forearmcircumferenceflexed', 'shouldercircumference',
       'shoulderelbowlength', 'thighcircumference', 'BMI'],
      dtype='object')


In [24]:
from sklearn.linear_model import LinearRegression


# Creating the LR
lr = LinearRegression()

# Plug the reduced data into a linear regression pipeline
X_train, X_test, y_train, y_test = train_test_split(X_reduced, y, test_size=0.3, random_state=0)

# Fit/transform the training features, then fitting the whole model on the scaled features
lr.fit(scaler.fit_transform(X_train), y_train)

# Acquiring the r^2
r_squared = lr.score(scaler.transform(X_test), y_test)
print('The model can explain {0:.1%} of the variance in the test set using {1:} features.'.format(r_squared, len(lr.coef_)))


The model can explain 84.0% of the variance in the test set using 8 features.


Using the votes from 3 models you were able to select just 8 features that allowed a simple linear model to get a high accuracy!