# Module 6: Exercise B

In this exercise, you will practice feature selection methods for regression.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

from sklearn import metrics

## Data Preprocessing

We will be using the housing data that consists of sold price and the conditions of the houses.

Let's import the "housing.csv" file and check the first 5 rows.

In [2]:
housing_df = pd.read_csv('housing.csv')
housing_df.head()

Unnamed: 0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated,street,city,statezip,country
0,2014-05-02 00:00:00,313000.0,3.0,1.5,1340,7912,1.5,0,0,3,1340,0,1955,2005,18810 Densmore Ave N,Shoreline,WA 98133,USA
1,2014-05-02 00:00:00,2384000.0,5.0,2.5,3650,9050,2.0,0,4,5,3370,280,1921,0,709 W Blaine St,Seattle,WA 98119,USA
2,2014-05-02 00:00:00,342000.0,3.0,2.0,1930,11947,1.0,0,0,4,1930,0,1966,0,26206-26214 143rd Ave SE,Kent,WA 98042,USA
3,2014-05-02 00:00:00,420000.0,3.0,2.25,2000,8030,1.0,0,0,4,1000,1000,1963,0,857 170th Pl NE,Bellevue,WA 98008,USA
4,2014-05-02 00:00:00,550000.0,4.0,2.5,1940,10500,1.0,0,0,4,1140,800,1976,1992,9105 170th Ave NE,Redmond,WA 98052,USA


>__Task 1__
>
>Filter out the outliers in the __price__ column
>
>- Find the min, mean, max, and other quantiles of the column
>- Create a mask for values less than `mean-3*sd` and greater than `mean+3*sd`
>- Use the mask to filter out the outlier rows. 

In [None]:
...

### Train/Test Split

>__Task 2__
>
>- Assign the `price` column to `y`
>- Assign __bedrooms__, __bathrooms__, __sqft_living__, __sqft_lot__, __floors__, __waterfront__, __view__, __condition__, __sqft_above__, __sqft_basement__, __yr_built__, __yr_renovated__ to `X`
>- Split data with a 80(train):20(test) ratio and set the random seed to 156.

In [None]:
...

## Filter Methods

In [7]:
X_train.columns

Index(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'sqft_above', 'sqft_basement',
       'yr_built', 'yr_renovated'],
      dtype='object')

### Variance Threshold

>__Task 3__
>
>Apply `VarianceThreshold` with 40% threshold and 90% threshold
>
>- Print the shape of the resulting data
>- Print the selected features
>- Print feature names that were dropped

In [None]:
# 40% threshold
...

In [None]:
# 90% threshold
...

### Univariate Feature Selection (`SelectKBest`)

#### F-Test

>__Task 4__
>
>- Select the top 6 features using F-test
>- Print the selected features
>- Print feature names as well as their scores and and p-values in a DataFrame

In [None]:
...

Note that __f-scores are independent of our choice of `k`__.

Now, the data set contains 6 features. Once the fit object is created and trained, we can apply it to train or test set using `.transform`:

In [16]:
X_train_flt = fit.transform(X_train)
X_train_flt.shape

(3652, 6)

In [17]:
# Check first 5 rows of the filtered data
X_train_flt[:5,:] 

array([[3.00e+00, 2.00e+00, 2.12e+03, 2.00e+00, 0.00e+00, 2.12e+03],
       [2.00e+00, 1.75e+00, 1.37e+03, 1.00e+00, 0.00e+00, 1.37e+03],
       [3.00e+00, 1.75e+00, 2.07e+03, 1.00e+00, 0.00e+00, 1.42e+03],
       [4.00e+00, 3.50e+00, 4.14e+03, 2.00e+00, 0.00e+00, 3.16e+03],
       [4.00e+00, 1.00e+00, 1.41e+03, 1.50e+00, 0.00e+00, 1.41e+03]])

In [18]:
X_test_flt = fit.transform(X_test)
X_test_flt.shape

(914, 6)

>__Task 5__
>
>Select features whose p-value is less than 1% threshold
>
>- Find the number of features below the threshold
>- Set `k` to that number of features
>- Apply `SelectKBest` to `X_train` and print its shape
>- Print feature names of the resulting data

In [None]:
...

### Select Percentile

An alternative to selecting the best k features is selecting based on the percentile values. If we have 20 features, the best 10% is the top 2, which may not be meaningful. But it can be useful if we have hundreds of features. The method is applied in the same way as `SelectKBest`.

>__Task 6 (optional)__
>
>Select 30% most effective features using F-test with `SelectPercentile`
>
>- Print the shape of the resulting data set
>- Print feature names of the resulting data

In [None]:
...

---

## Wrapper Methods

We will first apply a linear regression to the problem without feature selection.

>__Task 7__
>
>Apply a linear regression without feature selection
>
>- Print the Mean Squared Error (MSE)

In [None]:
...

### Recursive Feature Elimination (RFE)

>__Task 8__
>
>Apply RFE to select the best 8 features
>
>- Print the MSE of linear regression with selected features
>- Print feature names of the resulting data

In [None]:
...

### RFE with Cross-Validation (RFECV)

>__Task 9__
>
>Apply RFECV with minimum 5 features and 5 folds
>
>- Print the MSE of linear regression with selected features
>- Print feature names of the resulting data

In [None]:
...

---

## Embedded Methods

>__Task 10__
>
>Print coefficients of the linear regression model without feature selection as a DataFrame

In [None]:
...

### Lasso (L1) Regularization

>__Task 11__
>
>Compare L1 regularization with `alpha=10` and `alpha=10000`
>
>- Fit a linear lasso model at both `alpha` values
>- Print the MSE of the model
>- Print model coefficients in a table
>
>Are there any features should be removed according to the result?

In [None]:
# alpha=10
...

In [None]:
# alpha=10000
...

### Ridge (L2) Regularization

>__Task 12__
>
>Compare L2 regularization with `alpha=10` and `alpha=10000` (try to use a for loop this time)
>
>- Fit a linear ridge model at both `alpha` values
>- Print the MSE of the model
>- Print model coefficients
>
>What is your finding here compared to lasso regularization?

In [None]:
...

### Comparison Between L1 and L2 Regularizations

>__Task 13__
>
>Compare two regularizations with different `alpha` values
>
>- Try to use a for loop to fit both models with `alpha` value range `(10,10000,1000)`
>- Append the `alpha` values and MSE values of both models
>- Plot both models with `alpha` in x-axis and `MSE` in y-axis
>
>Which model and what value of `alpha` do you recommend in this case? 

In [None]:
...