# Module 6: Exercise A

In this exercise, you will practice feature selection methods for classification.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

from sklearn import metrics

## Data Preprocessing

We will be using an income data set, which includes demographic data. The target variable indicates whether the income exceeds $50K per year, based on census data.

Let's import the "income_cleaned.csv" file and check the first 5 rows.

In [2]:
income = pd.read_csv('income_cleaned.csv')
income.head()

Unnamed: 0,age,workclass,education-num,occupation,race,sex,capital-gain,capital-loss,hours-per-week,income_50k
0,47,Private,9,Prof-specialty,Other,Female,0,0,40,0
1,27,Other,14,Other,Asian-Pac-Islander,Male,0,0,20,0
2,39,Private,10,Sales,Asian-Pac-Islander,Female,0,0,38,0
3,40,Private,9,Exec-managerial,White,Female,0,0,40,0
4,39,Private,9,Exec-managerial,White,Female,0,0,40,0


>__Task 1__
>
> Convert categorical variables to numerical variables of __workclass__, __occupation__, __race__, and __sex__ columns
>
>- Add the encoded columns to the original DataFrame (Hint: set `drop_first` parameter)
>- Drop these four categorical columns

In [None]:
...

income.head()

### Train/Test Split

>__Task 2__
>
>- Assign the __income_50k__ column to `y` and the remaining columns to `X`
>- Split data with a 80(train):20(test) ratio and set `random_state` to 144

In [None]:
...

## Filter Methods

In [5]:
X_train.columns

Index(['age', 'education-num', 'capital-gain', 'capital-loss',
       'hours-per-week', 'workclass_Local-gov', 'workclass_Never-worked',
       'workclass_Other', 'workclass_Private', 'workclass_Self-emp-inc',
       'workclass_Self-emp-not-inc', 'workclass_State-gov',
       'workclass_Without-pay', 'occupation_Armed-Forces',
       'occupation_Craft-repair', 'occupation_Exec-managerial',
       'occupation_Farming-fishing', 'occupation_Handlers-cleaners',
       'occupation_Machine-op-inspct', 'occupation_Other',
       'occupation_Other-service', 'occupation_Priv-house-serv',
       'occupation_Prof-specialty', 'occupation_Protective-serv',
       'occupation_Sales', 'occupation_Tech-support',
       'occupation_Transport-moving', 'race_Asian-Pac-Islander', 'race_Black',
       'race_Other', 'race_White', 'sex_Male'],
      dtype='object')

### Variance Threshold

>__Task 3__
>
>Apply `VarianceThreshold` with 10% threshold and 90% threshold
>
>- Print the shape of the resulting data
>- Print the selected features
>- Print feature names that were dropped

In [None]:
# 10% threshold
...

In [None]:
# 90% threshold
...

### Univariate Feature Selection (`SelectKBest`)

#### ANOVA F-Value

>__Task 4__
>
>- Select the top 5 features using ANOVA F-value
>- Print feature names as well as their scores and and p-values in a DataFrame

In [None]:
...

Note that __f-scores are independent of our choice of `k`__.

The p-value of __age__ is almost 0, so we cannot reject the hypothesis that the feature corresponding to the p-value has no explanatory power on the target value.

Now, the data set contains 5 features. Once the fit object is created and trained, we can apply it to train or test set using `.transform`:

In [13]:
X_train_flt = fit.transform(X_train)
X_train_flt.shape

(23800, 5)

In [14]:
# Check first 5 rows of the filtered data
X_train_flt[:5,:] 

array([[37., 13., 50.,  0.,  1.],
       [29., 11., 40.,  0.,  0.],
       [62., 13., 50.,  1.,  1.],
       [33., 11., 76.,  0.,  1.],
       [28.,  9., 40.,  0.,  0.]])

In [15]:
X_test_flt = fit.transform(X_test)
X_test_flt.shape

(5950, 5)

>__Task 5__
>
>Select features whose p-value is less than 5% threshold
>
>- Find the number of features below the threshold
>- Set `k` to that number of features
>- Apply `SelectKBest` to `X_train` and print its shape
>- Print feature names of the resulting data

In [None]:
...

#### Chi Squared Test

>__Task 6__
>
>Select features whose p-value is less than 5% threshold using chi squared
>
>- Find the number of features below the threshold
>- Set `k` to that number of features
>- Apply `SelectKBest` to `X_train` and print its shape
>- Print feature names of the resulting data

In [None]:
...

### Select Percentile

An alternative to selecting the best k features is selecting based on the percentile values. If we have 20 features, the best 10% is the top 2, which may not be meaningful. But it can be useful if we have hundreds of features. The method is applied in the same way as `SelectKBest`.

>__Task 7 (optional)__
>
>Select 20% most effective features using ANOVA F-value with `SelectPercentile`
>
>- Print the shape of the resulting data set
>- Print feature names of the resulting data

In [None]:
...

---

## Wrapper Methods

We will first apply a logistic regression to the problem without feature selection.

>__Task 8__
>
>Apply a logistic regression without feature selection
>
>- Print the accuracy of the model

In [None]:
...

### Recursive Feature Elimination (RFE)

>__Task 9__
>
>Apply RFE to select the best 10 features
>
>- Print the accuracy of logistic regression with selected features
>- Print feature names of the resulting data

In [None]:
...

### RFE with Cross-Validation (RFECV)

>__Task 10__
>
>Apply RFECV with minimum 8 features and 5 folds
>
>- Print the accuracy of logistic regression with selected features
>- Print feature names of the resulting data

In [None]:
...

---

## Embedded Methods

### Lasso (L1) Regularization

>__Task 11__
>
>Compare L1 regularization with `C=10` and `C=0.001`
>
>- Fit a logistic regression model with `liblinear` solver at both `C` values 
>- Print the accuracy of the model with L1
>- Print model coefficients in a table
>
>What do you find with the coefficient values when `C=10` and `C=0.001`?

In [None]:
# C=10
...

In [None]:
# C=0.001
...

### Ridge (L2) Regularization

>__Task 12__
>
>Compare L2 regularization with `C=10` and `C=0.001` (try to use a for loop this time)
>
>- Fit a logistic regression model with `liblinear` solver at both `C` values 
>- Print the accuracy of the model with L2
>- Print model coefficients
>
>What is your finding here compared to L1 regularization?

In [None]:
...

### Comparison Between L1 and L2 Regularizations

>__Task 13__
>
>Compare two regularizations with different `C` values
>
>- Try to use a for loop to fit both models with `C` value range `(10,0.001,-1)`
>- Append the `C` values and accuracy values of both models
>- Plot both models with `C` in x-axis and `Accuracy` in y-axis
>- Set `plt.ylim(0.5,1.2)` as y-axis limits
>
>Which model and what value of `C` do you recommend in this case?

In [None]:
...