# Feature Selection

Learn what feature selection is, its different methods and how it helps improve the working of ML models

## Overview

- Definition

- Examples

- Wrapper Methods

- Filter Methods

- Embedded Methods
 
- PCA

- Feature Selection Checklist



## Pre-requisites

- Basic of Pandas Dataframe
- Basic of Descriptive and Inferential Statistics
- Linear Regression and regularisation
- Basic of Data cleaning and preprocessing

## Learning Objectives

- Importance of feature selection
- Where to use feature selection?
- Implementation of Filter,wrapper,Embedded Method


## Chapter 1 : Introduction to Feature Selection

In this chapter you will learn about what is Feature selection ,what are the various techniques of feature selection and intuition building on dataset.

***

## 1.1 What is Feature Selection?
**Problem Statement**

Let’s continue solving the same problem we encountered in the regression modules.

We have with us the complete [`Iowa housing dataset`](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).

Each row in the dataset describes the properties of a single house as well as the amount it was sold for.The data set contains 81 features and 1300 data points.

Let's apply simple Linear Regression model on it:

```python
#Data Loaded

#Data Split

model=LinearRegression()
model.fit(X_train,y_train)
score=model.score(X_test,y_test)
print(score)
```

**Output:**
```python
0.6665487890480106
```

We get `(r2_score)` as `0.66`.

***
Knowing this score to be low, we apply **one Feature Selection technique** and we get a `subset` of 30 features.
***

After subsetting the data to incorporate only those 30 features, we apply Linear Regression model 
```python
#Data Loaded

#Feature Selection method applied to dataset.

#Data Split

model=LinearRegression()
model.fit(X_train,y_train)
score=model.score(X_test,y_test)
print(score)
```

**Output:**
```python
0.768239
```

The `(r2_score)` is now `0.76` .

That is a increase by **10%** despite removal of 50 features.

Q: Doesn't that contradict the ML assumption that more features is equivalent to more information?

Ans: All the features in a dataset might not be useful. In a dataset some features may contribute no information at all, while some features may contribute similar information as the other features. 


So selecting the important features is more important than having a high no. of features and that's what feature selection methods help us do.


**Definition**

Feature selection is the ML process of finding the subset of features that are most relevant for a `better` predictive model.

When presented data with very high dimensionality(large no. of features), models usually choke because

1. Less training time.

2. Risk of overfitting.

Feature selection methods can help identify as well as remove redundant and irrelevant attributes from data that do not contribute to the predictive power of the model.

The objective of feature selection is three-fold: 

1. Improving the prediction performance of the predictors.

2. Providing faster and more cost-effective predictors.

3. Providing a better understanding of the underlying process that generated the data.



**Algorithms of Feature Selection**

Following are the categories, the different Feature Selection attributes are broadly divided into:

`Filter Method` - Filter Methods are used to find the relationship between features and the target variable. This results in computing the importance of features.


![](../images/Filter_method.jpg)

`Wrapper Method` - Wrapper Methods selects best subset of features by iteratively checking model performance.

![Wrapper Method](../images/Wrapper.jpg)

`Embedded Method` - Embedded methods are the methods implemented by algorithms that have a built in feature selection 'embedded' in them. It selects the best subset of features during the building of the model itself.


![Embedded Method](../images/embedded_method.jpg)

Let's dive into detail of each method in the following  chapters.


## Chapter 2 : Filter Method


## 2.1 Correlation Coefficent##

**Overview of filter method**


![](../images/Filter_method.jpg)

    • Filter methods are a set of powerful method of feature selection because selection happens independent of any machine learning algorithms.
    
    • Features are selected on the basis of scores of statistical tests between features & target variable.
 
 
Following are some of the statistical tests:

- Correlation coefficient

- Chi Squared test

- Anova(F-Score)
    

**Correlation Coefficent**

It makes intuitive sense to choose features that are highly correlated to the target variable.
The more correlated the features are to the target, easier it is for machine to predict it

Selecting features having correlation coefficents above a certain threshold will result in the model performing better.

Additionally one can also filter out redundant features by not selecting certain features that are already strongly correlated with other features. 

For eg: If x1 and x2 have strong correlation to the target variable but they are also strongly correlated to each other. In that case, a ML model including both x1 and x2 will give almost the  same results compared to the ML models where either only x1 or x2 was included in the dataset. 

There are different methods to calculate the correlation factor, however, Pearson’s correlation coefficient is most widely used.

The Pearson coefficient is a measure of the strength of association between two continuous variables.

It has already been covered in `Descriptive Statistics`. 

Still, let's refresh, 

Pearson's correlation coefficient is calculated by the formula:

$$ corr = \frac{cov \ (x,y)}{\sigma_x\sigma_y}$$

Where,
- $cov(x, y)$ - covariance between x and y
- $\sigma_x$ - standard deviation of x
- $\sigma_y$ - standard deviation of y


By taking x as our target variable and y as each of the features, we can easily find how much they are correlated to each other.

Setting up a threshold after that(for eg: correlation coefficent >0.5 ), we can identify strongly related features and drop the others.


Let's see how we can implement the same in python.

```python
#Sample Dataframe
data=pd.DataFrame({'A':[1,2,3,4,5], 'B':[2,2,6,6,6], 'C':[4,4,6,10,10], 'D':[1,1,1,1,5]})
print("Dataframe:")
print(data)

#Finding the pearson's correlation among variables 
data_corr=data.corr()
print("\nCorrelation Matrix:")
print(data_corr)

#Subseting and only taking those features which are strongly correlated to 'D'
data_corr_d= data_corr[data_corr['D']>0.5]

print("\nFeatures closely related to D(Corr>0.5):")
print(data_corr_d.index.values)

```

Output:

```python
Dataframe:
   A  B   C  D
0  1  2   4  1
1  2  2   4  1
2  3  6   6  1
3  4  6  10  1
4  5  6  10  5

Correlation Matrix:
          A         B         C         D
A  1.000000  0.866025  0.938315  0.707107
B  0.866025  1.000000  0.842701  0.408248
C  0.938315  0.842701  1.000000  0.589768
D  0.707107  0.408248  0.589768  1.000000

Features closely related to D(Corr>0.5):
['A' 'C' 'D']

```

If you compare values of D with A and C you can clearly see correlation between them.

A and D have two same values.

C and D have similar pattern distribution for two values[1 maps to 4]

# Task : Feature selection using correlation

In this task, after loading the housing dataset, we will filter out features based on pearson correlation and then train the model with only the selected features.


## Instructions:

- Load the dataset from path using the `"read_csv()"` method from pandas and store it in a variable called `'ames'`

- Store all the features of `'ames_model_data'` in  a variable called `X`


- Store the target variable (`SalePrice`) of `'ames_model_data'` in a variable called `y`


- Split `'X'` and `'y'` into `X_train,X_test,y_train,y_test` using `train_test_split()` function. Use `test_size = 0.3` and `random_state = 0`

**Finding Correlation**

- To simplify the process, create a new column in `'X_train'` called `Class` which stores the value of `'y_train'`.


- Find the correlation dataframe among all the features using `"X_train.corr()"` and store it in a variable called `"t_corr"`. Since we are only interested in correlation of features with target variable, only extract the `SalePrice` column of `'t_corr'` and save it back to `'t_corr'` 


- From `'t_corr'`, extract only those columns whose `absolulte correlation score` is greater than 0.5 using `index` function and store them in a variable called `'corr_columns'`


- Create a subset dataframe from `'X_train'` having only the columns stored in `'corr_columns'`. Save the new subsetted dataframe in `'X_train_new'`

- Create a subset dataframe from `'X_test'` having only the columns stored in `'corr_columns'`. Save the new subsetted dataframe in `'X_test'`


- Initialise a linear regression model with `LinearRegression()` and save it to a variable called `'model'`.


- Fit the model on the training data `'X_train_new'` and `'y_train'` using the `'fit()'` method.


- Find out the r^2 score between `X_test_new` and `'y_test'` using the `'score()'` method and save it in a variable called `'corr_score'`

**Things to ponder**

* Did the accuracy increase from the base score of `0.66`?
* As a side task, see how many features were actually selected on the basis of pearson correlation?


In [22]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Code starts here

#Loading of data
path='../data/Cleaned_Data.csv'
ames = pd.read_csv(path)


X=ames.drop(['SalePrice'],1)
y=ames['SalePrice'].copy()

#Splitting of data
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3, random_state=0) 

#Creating temp. dataframe
X_train['Class']=y_train
t_corr=X_train.corr()
t_corr=t_corr['Class']

#Selecting columns having correlation higher than 0.5
corr_columns=t_corr[abs(t_corr)>0.5].index

#Dropping the column `Class`
corr_columns=corr_columns.drop('Class')

#Updating train and test dataframes
X_train_new=X_train[corr_columns]

X_test_new=X_test[corr_columns]

#Initialising the model
model=LinearRegression()

#Fitting the model
model.fit(X_train_new,y_train)

#Finding the score of the model
corr_score=model.score(X_test_new,y_test)
print(corr_score)

#Checking how many columns were selected
print(len(X_train_new.columns))

print(X_train.iloc[25][10]==4.0)
print(X_test.iloc[25][5]==1.0)

print(y_train.iloc[10]==214500)
print(y_test.iloc[5]==85000)

0.7227056628201058
13
True
True
214500
85000


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


# 2.2 Chi Squared Test

Another way to identify relationship between features and target variable is the chi-square test statistic.

Similar to Pearson correlation, this is another preferred way to check for strongly correlated features.

Chi Squared Test has been covered in `Inferential Statistics`. 

Still let's refresh, 

The chi-squared test of independence is used to detemine whether the two variables are inter-related to each other.

Like any statistical hypothesis test, the Chi-square test has both a null hypothesis and an alternative hypothesis.

The hypothesis for a chi-square test of independence are as follows.

$H_0$: Variable A and Variable B are independent.

$H_1$: Variable A and Variable B are not independent.

In this case we will be calculating a chi-square test statistic as follows

$$\chi ^2 = \sum \frac{(observed-expected)^2}{expected}$$

In the formula, observed is the actual observed count for each category and expected is the expected count based on the distribution of the population for the corresponding category.

In the Chi-square context, the word “expected” is equivalent to what you’d expect if the null hypothesis is true. If your observed distribution is sufficiently different than the expected distribution (no relationship), you can reject the null hypothesis and infer that the variables are related.

Since the chi-square test measures dependence between variables, using this function “weeds out” the features that are the most likely to be independent of the target variable class and therefore irrelevant for prediction.


Q: In Inferential statistics, we learned chi-square tests is an independence test between two `categorical` variables. How are the numerical variables dealt with?

A: Binning

**Explanation:** Even for categorical variables, the null hypothesis is defined as 

``The `frequency distribution` of certain events observed in a sample is consistent with a particular theoretical distribution.``


As an example look at these sets of variables:

a = ['dog', 'cat', 'dog', 'cat']

b = ['wild', 'trained', 'trained', 'trained']

The categorical variables a and b can be compared by counting the co-occurences, and this is what happens with a chi-squared test:

|- |Dog|Cat|
|-----|-----|-----|
|wild|1|0|
|trained|1|2|

However, you can also binarise the values of 'a' and get the following variables:

a1 = [1, 0, 1, 0]

a2 = [0, 1, 0, 1]

b = ['wild', 'trained', 'trained', 'trained']

Counting the values is now equal to summing the values that correspond to the value of b.

|- |a1|a2|
|-----|-----|-----|
|wild|1|0|
|trained|1|2|

***


For feature selection purposes, it has a different python implementation than the one you learned in Inferential Statistics.


```python
data=pd.DataFrame({'A':[1,2,3,4,5], 'B':[2,2,6,6,6], 'C':[4,4,4,2,10], 'D':[1,1,1,1,5], 'E': [2,2,2,1,5]})
print('Dataframe:')
print(data)

#Using chi square score to calculate best two features
test = SelectKBest(score_func=chi2, k=2)

#Transforming the data based on chi square(Target variable is E)
data_chi= test.fit_transform(data.iloc[:,:4], data.iloc[:,4]) 

#In the above function, 'data.iloc[:,:4]' are the features, 'data.iloc[:,4]' is the target variable         

print("\nTwo columns having the highest chi square score with respect to 'E'")
print(data_chi)

```

Output:

```python
Dataframe:
   A  B   C  D  E
0  1  2   4  1  2
1  2  2   4  1  2
2  3  6   4  1  2
3  4  6   2  1  1
4  5  6  10  5  5

Two columns having the highest chi square score with respect to 'E'
[[ 4  1]
 [ 4  1]
 [ 4  1]
 [ 2  1]
 [10  5]]
```
If you compare values of E with C and D you can clearly see why Chi-Square test score is highest for them.
Both C and D have the most similar `frequency distribution` when compared to E

# Task : Chi Square test


- Store all the features of `'ames'`(Loaded in the previous task) in  a variable called `X`


- Store the target variable (`SalePrice`) of `'ames'` in a variable called `y`


- Initialise a `"SelectKBest()"` with the parameters `score_func=chi2` & `k=60` and save it to a variable called `'test'`.


- Split `'X'` and `'y'` into `X_train,X_test,y_train,y_test` using `train_test_split()` function. Use `test_size = 0.3` and `random_state = 0`



- Fit `'test'` on the training data `'X_train'` and `'y_train'` using the `'fit_transform()'` method. Store the result back into `'X_train'`


- Transform `'X_test'`using the `'transform()'` method of `test` .Store the result back into `'X_test'`


- Initialise a linear regression model with `LinearRegression()` and save it to a variable called `'model'`.


- Fit the model on the training data `'X_train'` and `'y_train'` using the `'fit()'` method.


- Find out the r^2 score between `X_test` and `'y_test'` using the `'score()'` method and store it in a variable called `'chi2_score'`


In [23]:
# import packages
from sklearn.feature_selection import chi2

from sklearn.feature_selection import SelectKBest

# Code starts here

X=ames.drop(['SalePrice'],1)
y=ames['SalePrice'].copy()


#Splitting dataframe into test and train
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3, random_state=0) 

#Initialising the score function
test = SelectKBest(score_func=chi2, k=60)

#Fitting and transforming the model on X_train
X_train= test.fit_transform(X_train, y_train)

#Fitting and transforming the model on X_test
X_test= test.transform(X_test)

#Initialising the Linear Regression model
model=LinearRegression()

#Fitting the model
model.fit(X_train,y_train)

#Finding the model score
chi2_score=model.score(X_test,y_test)

print(chi2_score)
# Code ends here

0.7526152480701574


# 2.3 Anova

Analysis of variance (ANOVA) is another method to check for close relationship between two variables.

Just like any statistical test, it has two hypothesis

$H_{0}:$ The mean(average value) is the same for all groups. (NULL Hypothesis)

$H_{1}:$ The mean is not the same for all groups.


The proportion of variance(or variability) explained by the two variables is calculated using `F-score` to figure out if they are closely related

We won't be delving much into the actual Anova Tests and only focusing on its Feature Selection application.


**Intuition**
Consider a binary classification problem :

~~~python

x1  x2  x3  ... class
0.3 0.5 0.1 ... A
0.1 0.7 0.4 ... B
0.1 0.1 0.2 ... A
0.2 0.4 0.2 ... A
0.5 0.7 0.8 ... B
~~~

Assume the plot for target variable with respect to x1 looks like:

![](../images/x1_features.jpg)

Assume the plot for target variable with respect to x2 looks like:

![](../images/x2_feature.jpg)


Q: If one has to predict say Class A based on x1 and x2, which feature do you think you can predict better?
A: x1

Reason: There is no overlap of target variable with respect to x1 data points.

Though x1 and x2 are two extreme cases, it can be seen that the overlap is reduced when

1. The means of A and B are more separated(with respect to a feature x_i) 

2. The variances of A and B are small(with respect to a feature x_i)


The F-score captures these two properties, such that a high F-score reflects a small overlap.

**Definition**

F-score is defined as 

**variance between the samples(of a feature and target variable)**/ **the variance within the samples**

Where **Variance within** is found by:

$\frac{\sum_{i=1}^m \sum_{j=1}^n (x_{j}- \overline{x_{i}} )^2}{n-m}$ 


where **m** is the number of samples(groups) and **n** is the size of the m samples altogether(observations). 


And **Variance between** is found by:


$\frac{ \sum_{j=1}^n  n_{j} (\overline{x_{j}} - \overline{X})^2 }{m-1}$
($\overline{x_{j}}$ is the mean of sample j, and $\overline{X}$ is the grand mean)


With respect to Feature Selection, Anova uses F-tests to statistically assess among all features which features are more likely to correctly predict the target variable.


**Working**

Let's try to understand Feature Selection using ANOVA with an example:
***
A large car company recently extended its working days to include Sunday from the previous schedule of Mon- Sat. The company now wants to optimise the sales on Sunday.

Large amount of important sales happens during the days of Friday & Saturday and therfore the data science team wants to know between Friday and Saturday Sales, which day will help predict the sales on Sunday.

The company collects data for the number of car sales each day for two months. 

Following is the data:

Fridays: 288, 292, 310, 267, 243, 293, 255, 273

Saturdays: 276, 323, 298, 256, 277, 309, 312, 265, 311

Sundays: 243, 279, 301, 285, 274, 243, 228, 298, 255

Let's find which of the groups explains greater variance of the data:

***
#### Group 1(Sat, Sun):



$mean_{sat}$ = 291.8

$mean_{sun}$ = 267.3

mean of means = 279.55


**Step 1:** Calculating the Sum of Squares within groups:

[(276-291.8)^2+…+(311-291.8)^2+ (243-267.3)^2+…+(255-267.3)^2]=10002.4


**Step 2:** Calculating degrees of freedom:

Here, degrees of freedom= No. of observation - No. of groups= 18-2= 16


**Step 3:** Computing variance within: 

 Sum of squares within groups/degrees of freedom

=[(276-291.8)^2+…+(311-291.8)^2+ (243-267.3)^2+…+(255-267.3)^2/[18-2]=625.185


**Step 4:** Calculating the Sum of Squares between groups:

[9(291.8-279.55)^2 + 9(267.3-279.55)^2]= 2701.125

**Step 5:** Calculating degrees of freedom:

Here, degrees of freedom= No. of groups- 1= 2-1= 1

**Step 6:** Computing variance between:

[9(291.8-279.55)^2+9(267.3-279.55)^2]/[2-1] = 2701.125

**Step 7:** Computing the F-score: 

we will get,

F-score=$\frac {2701.125}{625.18}= 4.32$

***


#### Group 2(Fri, Sun):

$mean_{fri}$ = 277.6

$mean_{sun}$ = 267.3

mean of means = 272.45


**Step 1:** Calculating the Sum of Squares within groups:
[(288-277.6)^2+…+(273-277.6)^2 + (243-267.3)^2+…+(255-267.3)^2] = 8893.8

**Step 2:** Calculating degrees of freedom:

Here, degrees of freedom= No. of observation - No. of groups=17 -2= 15


**Step 3:** Computing variance within: 

 Sum of squares within groups/degrees of freedom

=[(288-277.6)^2+…+(273-277.6)^2 + (243-267.3)^2+…+(255-267.3)^2/[17-2]= 592.92


**Step 4:** Calculating the Sum of Squares between groups:

[8(277.6-272.45)^2 + 9(267.3-272.45)^2]= 450.88

**Step 5:** Calculating degrees of freedom:

Here, degrees of freedom= No. of groups- 1= 2-1= 1

**Step 6:** Computing variance between:

[8(277.6-278.9)^2 + 9(291.8-279.55)^2+ 9(267.3-279.55)2]/[2-1] = 450.88

**Step 7:** Computing the F-score: 

we will get,

F-score=$\frac {450.88}{592.2}= 0.76$


It can be seen from the F-Score that group 1 explains more variance than group 2.

Conclusion: Saturday Sales will be a better predictor than Friday Sales 


Similar to the above example, ANOVA table tells the proportion of variance explained by the features with respect to the target variable

Obviously the features that explain the largest proportion of the variance should be retained.


It has two python implementations in the form of `f_classif` and `f_regression`

Since in our Ames Dataset problem we are dealing with a regression problem, we will learn how to implement f_regression score from sklearn library.

It's implementation is very similar to the implementation of 'chi-square' score

~~~Python
data=pd.DataFrame({'A':[1,2,3,4,5], 'B':[2,2,6,6,6], 'C':[4,4,6,10,10], 'D':[1,1,1,1,5], 'E':[2,2,2,1,5]})
print('Dataframe:')
print(data)

#Using ANOVA score to calculate best two features
test = SelectKBest(score_func=f_regression, k=2)

#Transforming the data based on ANOVA(Target variable is E)
data_anova= test.fit_transform(data.iloc[:,:4], data.iloc[:,4])


print("\nTwo columns having the highest ANOVA score with respect to 'E'")
print(data_anova)
   


~~~

Output:

```python
Dataframe:
   A  B   C  D  E
0  1  2   4  1  2
1  2  2   4  1  2
2  3  6   6  1  2
3  4  6  10  1  1
4  5  6  10  5  5

Two columns having the highest ANOVA score with respect to 'E'
[[1 1]
 [2 1]
 [3 1]
 [4 1]
 [5 5]]

```

If you compare values of E with A and D you can clearly see why Anova Score is highest for them.
A has the `highest variance between` when compared to other columns whereas D has the `lowest variance within` when compared to other columns.


# Task : Anova Score

- Store all the features of `'ames'`(Loaded in the first task) in  a variable called `X`


- Store the target variable (`SalePrice`) of `'ames'` in a variable called `y`


- Initialise a `"SelectKBest()"` with the parameters `score_func=f_regression` & `k=60` and save it to a variable called `'test'`.


- Split `'X'` and `'y'` into `X_train,X_test,y_train,y_test` using `train_test_split()` function. Use `test_size = 0.3` and `random_state = 0`


- Fit `'test'` on the training data `'X_train'` and `'y_train'` using the `'fit_transform()'` method. Store the result back into `'X_train'`


- Transform `'X_test'`using the `'transform()'` method of `test` .Store the result back into `'X_test'`


- Initialise a linear regression model with `LinearRegression()` and save it to a variable called `'model'`.


- Fit the model on the training data `'X_train'` and `'y_train'` using the `'fit()'` method.


- Find out the r^2 score between `X_test` and `'y_test'` using the `'score()'` method and store it in a variable called `'f_regress_score'`

In [24]:
# import packages
import pandas as pd
from sklearn.feature_selection import f_regression

from sklearn.feature_selection import SelectKBest

# Code starts here
X=ames.drop(['Id','SalePrice'],1)
y=ames['SalePrice'].copy()

# Splitting the dataframe into train and test
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3, random_state=0) 

#Initalising the score function
test = SelectKBest(score_func=f_regression, k=60)


#Fitting and transforming the model on X_train
X_train= test.fit_transform(X_train, y_train)

#Fitting and transforming the model on X_test
X_test= test.transform(X_test)

#Initialising the Linear Regression Model
model=LinearRegression()

#Fitting the model
model.fit(X_train,y_train)

#Finding the model score
f_regress_score=model.score(X_test,y_test)
print(f_regress_score)
# Code ends here

0.7566701199447429


# 3. Wrapper Methods

# 3.1 Wrapper methods

**Overview of wrapper method**

![Wrapper Method](../images/Wrapper.jpg)


     * Wrapper Methods generate models with different subsets of feature and gauge their model performances.

    * Wrapper methods can give high classification accuracy for particular classifiers, but generally they have high computational complexity

Following are the wrapper methods:

* Forward Selection

* Backward Selection

* RFE

Before discussing RFE, let's briefly look at Forward Selection and Backward Selection.


**Forward Selection**

Forward selection is an iterative technique in which we begin with having no features in the model. In every cycle, we continue including features which best enhances our model till an adding of another variable does not enhance the performance of the model.

Consider the following python code:
```python
#Creating a dataframe
data=pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10],  'B':[4,4,6,10,10,4,4,6,10,10], 'C':[1,1,1,1,5,1,1,1,1,5], 'D':[2,2,2,1,5,2,2,2,1,5]})

print('Dataframe:')
print(data)

#Only selecting feature B
X=data[['B']]
y=data['D'].copy()


X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3, random_state=0) 
model =LinearRegression()

#Fitting the model
model.fit(X_train,y_train)
print("Score when features [B] is selected:", model.score(X_test,y_test))

```
**Output**

```python
Dataframe:
    A  B   C  D  E
0   1  2   4  1  2
1   2  2   4  1  2
2   3  6   6  1  2
3   4  6  10  1  1
4   5  6  10  5  5
5   6  2   4  1  2
6   7  2   4  1  2
7   8  6   6  1  2
8   9  6  10  1  1
9  10  6  10  5  5

Score when features [B] is selected: 0.06698063840920998
```
***

Let's add `C` to the model as well

***
```python


#Selecting B,C
X=data[['B','C']]
y=data['D'].copy()


X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3, random_state=0) 
model =LinearRegression()

#Fitting the model
model.fit(X_train,y_train)
print("Score when features [B,C] is selected:", model.score(X_test,y_test))
```
**Output**

```python
Score when features [B,C] is selected: 0.9904640813731723

```
***

Let's now add `A` to the model.

***
```python
#Selecting A,B,C
X=data[['A','B','C']]
y=data['D'].copy()


X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3, random_state=0) 
model =LinearRegression()

#Fitting the model
model.fit(X_train,y_train)
print("Score when features [A,B,C] is selected:", model.score(X_test,y_test))



```
**Output**

```python**
Score when features [A,B,C] is selected: 0.9812539002371345
```

The score increased when predicting using `B` & `C`  but adding of `A` decreased it. Therefore we stop the iteration process.



**Backward Selection**

In backward selection, we do the opposite of forward selection. We start with all the features and remove the least significant feature after each iteration which improves the performance of the model. We continue this until no improvement is observed on removal of features.

Consider the following python code:

```python
#Creating a dataframe
data=pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10], 'B':[4,4,6,10,10,4,4,6,10,10], 'C':[1,1,1,1,5,1,1,1,1,5], 'D':[2,2,2,1,5,2,2,2,1,5]})

print('Dataframe:')
print(data)

#Selecting A,B,C
X=data[['A','B','C']]
y=data['D'].copy()


X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3, random_state=0) 
model =LinearRegression()

#Fitting the model
model.fit(X_train,y_train)
print("Score when features [A,B,C] is selected:", model.score(X_test,y_test))


```
**Output**

```python**
Score when features [A,B,C] is selected: 0.9812539002371345
```
***
Let's remove `A` from the model 
***
```python


#Removing A from the dataframe
X=data[['B','C']]
y=data['D'].copy()


X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3, random_state=0) 
model =LinearRegression()

#Fitting the model
model.fit(X_train,y_train)
print("Score when features [B,C] is selected:", model.score(X_test,y_test))
```
**Output**

```python
Score when features [B,C] is selected: 0.9904640813731723

```
***
Removing `C` from the model
***
```python

#Only selecting feature B
X=data[['B']]
y=data['D'].copy()


X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3, random_state=0) 
model =LinearRegression()

#Fitting the model
model.fit(X_train,y_train)
print("Score when features [B] is selected:", model.score(X_test,y_test))

```
**Output**

```python
Score when features [B] is selected: 0.06698063840920998
```

In the above method, the score increased when we droppped `A`  but dropping of `C` resuled in drastic decrease of score. Therefore we stop the iteration process.


***

Though effective in certain cases, these two methods(Forward and Backward) can provide problems when dealing with especially large or highly-dimensional datasets. 


Though not popularly used, you can implement the same using [mlxtend](https://rasbt.github.io/mlxtend/api_subpackages/mlxtend.feature_selection/) library


**Recursive Feature Elimination**

RFE method involves repeatedly constructing a specific model and selecting the most impactful (or least impactful feature), setting that feature aside and then repeating the process with the rest of the features. This process is iterated until all features in the dataset are used up(or other stopping criteria are satisfied).

Features are then ranked according to when they were eliminated. In this way, RFE is able to work out the best subset of features that will enhance the performance of the model.


![](../images/rfe.jpg)

Let's see its python implementation.

For this we will use a subset of our original dataset containing only 200 datapoints.

```python
from sklearn.feature_selection import RFE

#Creating a sample of 200 data points
df_new=ames.sample(200,random_state=49)

X = df_new.drop(['SalePrice'],1)
y=df_new['SalePrice'].copy()

X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3, random_state=0) 

#Selecting model as Linear Regressor
model =LinearRegression()
model.fit(X_train,y_train)
print("Score before RFE:", model.score(X_test,y_test))

#Passing the model along with no. of features you wish to select
selector = RFE(model,13)

#Fitting the data with the above conditions
X_train_rfe = selector.fit_transform(X_train, y_train)
X_test_rfe=selector.transform(X_test)
model.fit(X_train_rfe,y_train)

print("Score after RFE:",model.score(X_test_rfe,y_test))


```

Output:
```python

Score before RFE: 0.599523748467163

Score after RFE: 0.8038362233621985
    
```

Following is a image comparing the 13 features(and their correlation score with the target variable `SalePrice`) selected by the `RFE method` and `Pearson coefficent method` for the above sampled dataset: 

![](../images/pc_v_rfe.png)


Its interesting to note that RFE isn't selecting features which have the max correlation with target variable. It's also inherently identifying `correlation between features` as well.


RFE will always provide the most optimised features and almost always results in good performance, but it has major drawback. This whole process of going back and forth with the features is a very time consuming process. For datasets with large feature space(which is almost always in any ML problem), this method is not a good choice. 


# Task : RFE

Let's now try to implement RFE and see the score in the whole credit dataset.

In this task we will also try to identify the optimum no. of features to use

- Store all the features of `'ames'`(Loaded in the first task) in  a variable called `X`


- Store the target variable (`SalePrice`) of `'ames'` in a variable called `y`


- Three variables `'nof_list'`, `'high_score'` and `'nof'` are already defined for you.


- Run a `n` loop passing through each element of `'nof_list'`.


- Inside the loop, split `'X'` and `'y'` into `X_train,X_test,y_train,y_test` using `train_test_split()` function. Use `test_size = 0.3` and `random_state = 0`


- Initialise a linear regression model with `LinearRegression()` and save it to a variable called `'model'`.


- Initialise a `RFE()` object with parameters `'model'` & `'n'` and store it to a variable called `'rfe'`.


- Fit `'rfe'` on the training data `'X_train'` and `'y_train'` using the `'fit_transform()'` method. Store the result into `'X_train_rfe'`


- Transform `'X_test'`using the `'transform()'` method of `rfe` .Store the result into `'X_test_rfe'`


- Fit the model on the training data `'X_train_rfe'` and `'y_train'` using the `'fit()'` method.


- Write a condition to store the highest R2 score of all `n`. Store the highest R2 score in `'high score'` and the `n` assosciated with it in `'nof'`

In [4]:
# Feature Extraction with RFE
from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import ExtraTreesClassifier

#no of features list
nof_list=[20,30,40,50,60,70,80]

#Variable to store the highest score
high_score=0

#Variable to store the optimum features
nof=0

#Code begins here
X = ames.drop(['SalePrice'],1)
y=ames['SalePrice'].copy()

#Loop to select the optimum features
for n in nof_list:
    X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3, random_state=0) 
    model = LinearRegression()
    rfe = RFE(model, n)
    X_train_rfe = rfe.fit_transform(X_train, y_train)
    X_test_rfe = rfe.transform(X_test)
    model.fit(X_train_rfe,y_train)
    if model.score(X_test_rfe,y_test)>high_score:
        high_score=model.score(X_test_rfe,y_test)
        feat=n


#Printing the no. features with the highest score along with the highest score
print("No. of features=",feat, "gives the best score=",high_score)


  from numpy.core.umath_tests import inner1d


No. of features= 30 gives the best score= 0.7627843526086595


# 4. Embedded Methods



## 4.1 : Overview of embedded method


This approach is a hybrid of filter and wrapper methods and is implemented by algorithms that have their own built-in feature selection methods.
These methods are thus `embedded` in the algorithm either as its normal or extended functionality. 

***
**Embedded method is a type of filter method because:**

This procedure of feature selection can be understood as adding a penalty to reduce the degree of overfitting. This results in weights of features become very small(or 0),therefore filtering out unnecessary features.
***
***

**Embedded method is a type of wrapper method because:**

In this approach feature selection is performed during the process of training of the model itself and therefore is specific to the learning algorithms.
***


Embedded methods usually achieve high accuracy characteristic to wrapper methods and high efficiency characteristic to filter methods

Most commonly used embedded feature selection methods are regularization methods namely LASSO and RIDGE.

## 4.2 LASSO/RIDGE

Both types of regularization have already been covered extensively in the `Advanced Linear Regression` module. 

Just to refresh,

**Lasso (L1):** It stands for *Least Absolute Shrinkage and Selection Operator* and adds **absolute value of magnitude of coefficient** as penalty term to the loss function. Mathematically, the new regularized cost function becomes:  $$J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)^2 + \lambda\sum_{i=1}^{n}|\theta_i|$$ 



**Ridge (L2):** Ridge regression adds **squared magnitude of coefficient** as penalty term to the loss function. The new cost function becomes: 

$$J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)^2 + \lambda\sum_{i=1}^{n}{\theta_i}^2$$ 

In both the above techniques of feature selection, the penalty term helps regularise the feature coefficents and therefore automatically eliminate the unnecessary features and select the relevant ones while building the model itself.

Implementing it is the same as implementing a ML model because feature selection happens internally. 

Let's now apply the above regularization techniques in our ames dataset problem.


# Task : LASSO/RIDGE

In this task we will try to implement Linear Regression with Lasso and Ridge regularisation.

- Split `'X'` and `'y'` into `X_train,X_test,y_train,y_test` using `train_test_split()` function. Use `test_size = 0.3` and `random_state = 0`


- Initialise a `"Lasso()"` with the parameter `random_state=0`and save it to a variable called `'lasso'`.


- Fit the model on the training data `'X_train'` and `'y_train'` using the `'fit()'` method of `'lasso'`.


- Find out the r^2 score between `X_test` and `'y_test'` using the `'score()'` method of `'lasso'` and store it in a variable called `'lasso_score'`


- Initialise a `"Ridge()"` with the parameter `random_state=0`and save it to a variable called `'ridge'`.


- Fit the model on the training data `'X_train'` and `'y_train'` using the `'fit()'` method of `'ridge'`.


- Find out the r^2 score between `X_test` and `'y_test'` using the `'score()'` method of `'ridge'` and store it in a variable called `'ridge_score'`


In [5]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

# Code starts here
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3, random_state=0) 

#Initialsing \ lasso model
lasso = Lasso(random_state=0)

# Fitting the model with train
lasso.fit(X_train, y_train)

#Finding the score of model
lasso_score=lasso.score(X_test,y_test)
print(lasso_score)

# checking how many feature coefficients are zero
print(sum(lasso.coef_ == 0))

#Initialising the ridge model
ridge=Ridge(random_state=0)
# Fitting the model with train
ridge.fit(X_train, y_train)

#Finding the score of the model
ridge_score = lasso.score(X_test,y_test)
print(ridge_score)

# checking how many feature coefficients are zero
print(sum(ridge.coef_ == 0))
# Code ends here

0.6671828548745306
0
0.6671828548745306
0


In [6]:
int(165.2)

165

# 5. PCA

Another closely related topic with Feature Selection is Feature Extraction.

Feature Extraction refers to the method of reducing the known variables of the data into lesser number of principal variables holding the same amount of information.

Though it sounds very similar to Feature Selection(because both are just different techniques of dimensionality reduction), one thing to keep in mind is that in this process we are aiming to capture the variance of the data and nothing else.

One of the popular Dimensionality Reduction techniques is Principal Component Analysis(PCA)

**5.1 Intuition**

Consider the following plot:

![2d](../images/2d.png)

It has the following x and y axis configuration:


![2dpoints](../images/2dpoints.png)


It can be intuitively seen that in the above diagram direction pointed by 'red' line looks more important than the 'green'.(The line made in the direction pointed by 'red' is a better fit for the data points)

If we transform our x and y axes to red and green arrows, we get the following :

![pca](../images/pca.png)

If we refer to red as principal component 1 and green as principal component 2, we get the following configuration:

![pca](../images/pcapoints.png)

It can be clearly seen that the by dropping pc2, we don't lose much variation present in the data.

Similar to the above example, for data having multiple features ,we can identify which 'directions' are important & explain the most variance and in effect drop the 'directions' which are least important.


**5.2 Definition** 

Principal Component Analysis or PCA is the method of transforming original variables into a new set of variables such that they are orthogonal (and hence linearly independent) and then ranking according to the variance of data along them. These newly extracted variables are called Principal Components.

Principal components are extracted in such a way that the first principal component explains maximum variance in the dataset.

Second principal component(uncorrelated to the first) tries to explain the remaining variance(not explained by the first).

Third principal component explains the variance not explained by first and second and so on.


**5.3 Working of PCA**

To understand the working of PCA, it's important for you to have a knowledge of eigenvalues and eigenvectors. You can refresh the concept here(link to be pasted).

Follwing are its steps:

Step 1:
Calculate covariance of matrix X(Feature matrix).

Step 2:
Calculate eigen vectors and corresponding eigen values of X.

Step 3:
Sort the eigen vectors according to their eigen value (in decreasing order).

Step 4:
Use Eigenvectors corresponding to the (k)largest eigenvalues to reconstruct a large fraction of variance of the original data.

Let's try to understand the steps better using an example.

***
Consider a sample of our IOWA housing dataset:



It has 80 features and one target variable. 

Before the first step, let's scale the features.

***
**Q:** Why do we need to scale it?

**A:** The end result of PCA is to calculate a new projection of your data set. 

As in many other multivariate procedures one needs to ensure that no extra weight is given to the “larger” variables which otherwise will lead to biased outcomes.

For eg:
In the event we decide to change the scale of one of the features from metres to cm, the resulting variance of the feature would be 100 times more than it previously was. This would result in that feature playing a major role deciding the first PC since the PCA algorithm would be biased towards this new feature, in order to maximize its variance.

Scaling the data, all variables will have the same standard deviation, thus none of the variables will have any bias and PCA can calculate relevant axis.

***

After scaling, our first row of feature matrix will look something like:

```python
[[-0.38780276 -0.88641057 -0.06590224 -1.73003297  5.61934547  0.07088812
   0.02132492 -1.38156245  0.34076569  0.         -1.06331739 -0.17586311
   .................................................. 
   0.09853293  0.5514855  -0.14023183 -0.10655645 -1.24153445 -0.60510586
   0.33428219 -1.43182551]
  .......................................................... 
  ..........................................................

]]
```

Converting the above matrix into a covariance matrix, we get,

```python

Covariance matrix 
[[ 1.00502513  0.0461804  -0.08984672 ... -0.02916568  0.01314686
  -0.08965875]
 .......................................................
 
 [-0.08965875  0.05498568 -0.0809293  ... -0.07523145  0.31478126
   1.00502513]]
```
Finding the eigenvalues and eigenvectors of the above matrix and sorting them in descending order we get, we get,

```python

Eigenvalues 
[ 9.91860031e+00  4.92366237e+00  3.78092267e+00  3.56179086e+00
  3.39293240e+00  2.95161032e+00  2.74565913e+00  2.33694130e+00
  ..............................................................
  9.30686007e-02  9.62727861e-02  1.13943247e-01  1.30156014e-01
  1.34252254e-01 -6.77986291e-16 -7.78746845e-16  0.00000000e+00]

```
***
**Note:**
PCA implementations perform a Singular Vector Decomposition instead of eigendecomposition of the covariance to improve the computational efficiency. 

The $V^{T}$ in the SVD formula $X=USV^{T}$ is nothing but the above eigenvectors matrix(Why?) 

***
After sorting of the eigenpairs, we need to decide "how many principal components are we going to choose for our new feature subspace?". We do this using "explained variance," which can be calculated from the eigenvalues. 

The explained variance tells us how much information (variance) can be attributed to each of the principal components.


Following is the plot for the same:

![PCA_plot](../images/pca_graph.png)


In the plot above we can observe that most of the variance (More than 80% ) can be explained by the first 30 principal component alone. While the remaining components(50) have very less information to give and therefore can be dropped.

Our projection matrix will therefore look like 

```python

 
Projection Matrix M([80 x 30]) :

[[ 0.00448071 -0.01350817  0.12326238 ...  0.19900511  0.05865323
   0.05688455]
 [-0.01759388  0.07024879  0.19174906 ...  0.02688533 -0.06747264
  -0.01217759]
  ..............................................................
 [-0.01299174 -0.01049473  0.03013888 ... -0.03344006  0.00668279
  -0.12497142]
 [ 0.06469133  0.10257877 -0.03185315 ... -0.12730551  0.15506105
  -0.12573368]]


```

Transforming our original matrix X using projection Matrix M, we get:

```python
Y=X.M                       

=[[ 2.19814018  0.34126792 -1.60125569 ...  0.78203228 -0.21044354
  -0.62360616]
 [ 0.35239026 -2.05793391  1.12301188 ...  1.53015364  0.24685173
  -0.58152253]
 ...............................................................
 [-2.51897468 -3.00342066  1.61685692 ... -0.96805527 -0.28230475
   0.19131301]
 [-0.68028629 -2.64923105  1.59040403 ... -0.38116799 -1.86049732
  -0.12103677]]

Dimensions of Y: [200 x 30]
```

**5.4 Scikit implementation**

Though we have learned the step by step process, python has a simpler implementation in scikit-learn with all the steps internally taken care of.

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA 

df = pd.read_csv('../data/Cleaned_Data.csv')
df_new=df.sample(200,random_state=0)

X = df.drop(['SalePrice'],1)

print("\nFirst two rows of X matrix:")
print(X.iloc[0:2,])

scaler=StandardScaler()

X_scaled = scaler.fit_transform(X)
print("\nFirst two rows of scaled X matrix:")
print(X_scaled[0:2,])


pca = PCA(n_components=30)
X_pca = pca.fit_transform(X_scaled)
print("\nFirst two rows of pca transformed X matrix:")
print(X_pca[0:2,])


```
Output

```python

First two rows of X matrix:
    
     MSSubClass  MSZoning  LotFrontage  LotArea  Street  Alley  LotShape  \
0             60         3         65.0     8450       1      1         3   
1             20         3         80.0     9600       1      1         3   

   LandContour  Utilities      ...        ScreenPorch  PoolArea  PoolQC  \
0            3          0      ...                  0         0       3   
1            3          0      ...                  0         0       3   

   Fence  MiscFeature  MiscVal  MoSold  YrSold  SaleType  SaleCondition  
0      4            1        0       2    2008         8              4  
1      4            1        0       5    2007         8              4  

[2 rows x 80 columns]

First two rows of scaled X matrix:
    
[[-1.73086488  0.07337496 -0.04553194  0.2128772  -0.20714171  0.06423821
   0.02469891  0.75073056  0.31466687 -0.02618016  0.60466978 -0.22571613
   ......................................................................
   0.06330477  0.45744736 -0.1859753  -0.08768781 -1.5991111   0.13877749
   0.31386709  0.2085023 ]
 [-1.7284922  -0.87256276 -0.04553194  0.64574726 -0.09188637  0.06423821
   0.02469891  0.75073056  0.31466687 -0.02618016 -0.62831608 -0.22571613
   .....................................................................
   0.06330477  0.45744736 -0.1859753  -0.08768781 -0.48911005 -0.61443862
   0.31386709  0.2085023 ]]

First two rows of pca transformed X matrix:
    
[[ 2.19814018  0.34126792 -1.60125569 ...  0.78203228 -0.21044354
  -0.62360616]
 [ 0.35239026 -2.05793391  1.12301188 ...  1.53015364  0.24685173
  -0.58152253]
 
 ...............................................................
 
 [-2.51897468 -3.00342066  1.61685692 ... -0.96805527 -0.28230475
   0.19131301]
 [-0.68028629 -2.64923105  1.59040403 ... -0.38116799 -1.86049732
  -0.12103677]]

```

Though PCA does help to perform dimensionality reduction but as as rule of thumb consider PCA only if :

1. You want to reduce the data dimensions, but aren’t able to identify variables to remove.
2. You are comfortable making the independent variables less interpretable.


Let's now try and implement PCA on our whole dataset

# PCA Task

In this task we will try to use PCA for dimensionality reduction on our data and then implement linear regression.


- Store all the features of `'ames_model_data'` in  a variable called `X`


- Store the target variable (`SalePrice`) of `'ames_model_data'` in a variable called `y`


- Split `'X'` and `'y'` into `X_train,X_test,y_train,y_test` using `train_test_split()` function. Use `test_size = 0.3` and `random_state = 0`


- Initialise a `"StandardScaler()"` object and store it to a variable called `'scaler'`.


- Scale `'X_train'` using the `fit_transform()` method of `'scaler'` and store the scaled output in `'X_train_scaled'`


- Scale `'X_test'`as well using the `transform()` method of `'scaler'` and store the scaled output in `'X_test_scaled'`


- Initialise a `"PCA()"`object with the parameter `n_components=35` and   and store it to a variable called `'pca'`.


- Transform `'X_train_scaled'` using the `fit_transform()` method of `'pca'` and store the scaled output in `'X_train_pca'`


- Transform `'X_test_scaled'`as well using the `transform()` method of `'pca'` and store the scaled output in `'X_test_pca'`


- Initialise a linear regression model with `LinearRegression()` and save it to a variable called `'model'`.


- Fit the model on the training data `'X_train_pca'` and `'y_train'` using the `'fit()'` method of `'model'`.


- Find out the r^2 score between `X_test_pca` and `'y_test'` using the `'score()'` method of `'model'` and store it in a variable called `'pca_score'`


In [46]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA 

# Code starts here

X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3, random_state=0) 
 
#Initialising standard scaler 
scaler=StandardScaler()

#Scaling the features
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#Initialising PCA
pca = PCA(n_components=35, random_state=0)

#Transforming the features
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca=pca.transform(X_test_scaled)

#Initialising the model
model=LinearRegression()

#Fitting the model
model.fit(X_train_pca,y_train)

#Scoring the model
pca_score=model.score(X_test_pca,y_test)
print(pca_score)


print(np.round(X_train_scaled[25][4],2))

print(np.round(X_test_scaled[25][2],2))

print(np.round(X_train_pca[25][4],2))
print(np.round(X_test_pca[25][2],2))

print(np.round(X_train_scaled[25][4],2)==0.07)

print(np.round(X_test_scaled[25][2],2)==(1.88))

print(np.round(X_train_pca[25][4],2)==2.92)
print(np.round(X_test_pca[25][2],2)==2.67)

0.7624759749537409
0.07
1.88
2.92
2.67
True
True
False
True


# 6. Conclusion

Let's look at the table storing the results of the different feature selection techniques(and PCA) that we applied on our AMES dataset:


|Feature Selection Method|No. of features selected|r2 Score|
|-----|-----|-----|
|Linear Regression|80|0.66|
|Pearson Correlation|13|0.72|
|Chi-Square|60|0.75|
|Anova|35|0.75|
|RFE|30|0.76|
|LASSO|Internal Selection|0.66|
|RIDGE|Internal Selection|0.66|
|PCA|35|0.76|

Does it mean that RFE or PCA are the best methods for feature selection?
**NO**

The reason there exists so many methods is that there is no single feature selection method that will give best results across all Machine Learning problems.

To make the most of the methods mentioned here, the user should know what kind will work best for him given the domain, data and the problem he is trying to solve.

If at some point, you are too confused to decide what to use, following is a checklist you can refer to, to help you ease up the feature selection process:

**Feature Selection Check list:**
***

* `Do you have domain knowledge?` If yes, construct a better set of “ad hoc” features that will provide more information gain.


* `Are your features all of the same scale?` If no, consider normalizing them.


* `Do you suspect interdependence of features?` If yes, expand your feature set by constructing products of features, as much as your computer resources allow 


* `Do you need to reduce the count of input variables (e.g. for cost, speed or data understanding reasons)?` If no, then construct disjunctive features or weighted sums of features 


*  `Do you need to assess features individually (e.g. to understand their influence on the systemor because their number is so large that you need to do a first filtering)? `If yes, use a variable ranking method ; else, do it anyway to get baseline results.


* `Do you need a predictor?` If no, stop.


* `Do you suspect your data is “dirty” (has a few meaningless input patterns and/or noisyoutputs or wrong class labels)? `If yes, detect the outlier examples using the top ranking variables as representation; check and/or discard them.



* `Do you know what to try first?` If no, use a linear predictor. Construct a sequence of predictors of same nature using increasing subsets of features. Can you match or improve performance with a smaller subset? If yes, try a non-linear predictor with that subset.


* `Do you have new ideas, time, computational resources, and enough examples?` If yes, compare several feature selection methods, including your new idea, correlation coefficients, backward selection and embedded methods . Use linear and non-linear predictors. Select the best approach with model selection .


* `Do you want a stable solution (to improve performance and/or understanding)?` If yes, subsample your data and redo your analysis for several “subsets” 

***

The above checklist is made by Isabelle Guyon and Andre Elisseeff the authors of [“An Introduction to Variable and Feature Selection” (PDF)](http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf). You can check out the pdf link to learn more about it

**END OF NOTEBOOK**

In [8]:
#Sample Dataframe
data=pd.DataFrame({'A':[1,2,3,4,5], 'B':[2,2,6,6,6], 'C':[4,4,6,10,10], 'D':[1,1,1,1,5]})
print("Dataframe:")
print(data)

#Finding the pearson's correlation among variables 
data_corr=data.corr()
print("\nCorrelation Matrix:")
print(data_corr)

#Subseting and only taking those features which are strongly correlated to 'D'
data_corr_d= data_corr[data_corr['D']>0.5]

print("\nFeatures closely related to D(>0.5):")
print(data_corr_d.index.values)

Dataframe:
   A  B   C  D
0  1  2   4  1
1  2  2   4  1
2  3  6   6  1
3  4  6  10  1
4  5  6  10  5

Correlation Matrix:
          A         B         C         D
A  1.000000  0.866025  0.938315  0.707107
B  0.866025  1.000000  0.842701  0.408248
C  0.938315  0.842701  1.000000  0.589768
D  0.707107  0.408248  0.589768  1.000000

Features closely related to D(>0.5):
['A' 'C' 'D']


In [9]:
data=pd.DataFrame({'A':[1,2,3,4,5], 'B':[2,2,6,6,6], 'C':[4,4,6,10,10], 'D':[1,1,1,1,5], 'E':[2,3,4,5,1]})
print('Dataframe:')
print(data)

#Using chi square score to calculate best two features
test = SelectKBest(score_func=chi2, k=2)

#Transforming the data based on chi square(Target variable is E)
data_chi= test.fit_transform(data.iloc[:,:4], data.iloc[:,4])


print("\nTwo columns having the highest chi square score with respect to 'E'")
print(data_chi)
   

Dataframe:
   A  B   C  D  E
0  1  2   4  1  2
1  2  2   4  1  3
2  3  6   6  1  4
3  4  6  10  1  5
4  5  6  10  5  1

Two columns having the highest chi square score with respect to 'E'
[[ 4  1]
 [ 4  1]
 [ 6  1]
 [10  1]
 [10  5]]


In [10]:
data=pd.DataFrame({'A':[1,2,3,4,5], 'B':[2,2,6,6,6], 'C':[4,4,6,10,10], 'D':[1,1,1,1,5], 'E':[2,3,4,5,1]})
print('Dataframe:')
print(data)

#Using ANOVA score to calculate best two features
test = SelectKBest(score_func=f_regression, k=2)

#Transforming the data based on ANOVA score(Target variable is E)
data_anova= test.fit_transform(data.iloc[:,:4], data.iloc[:,4])


print("\nTwo columns having the highest ANOVA score with respect to 'E'")
print(data_anova)
   

Dataframe:
   A  B   C  D  E
0  1  2   4  1  2
1  2  2   4  1  3
2  3  6   6  1  4
3  4  6  10  1  5
4  5  6  10  5  1

Two columns having the highest ANOVA score with respect to 'E'
[[2 1]
 [2 1]
 [6 1]
 [6 1]
 [6 5]]


In [11]:
data=pd.DataFrame({'A':[1,2,3,4,5], 'B':[2,2,6,6,6], 'C':[4,4,6,10,10], 'D':[1,1,1,1,5], 'E':[2,3,4,5,1]})
print('Dataframe:')
print(data)

#Using mutual info score to calculate best two features
test = SelectKBest(score_func=mutual_info_regression, k=2)

#Transforming the data based on mutual info score(Target variable is E)
data_anova= test.fit_transform(data.iloc[:,:4], data.iloc[:,4])


print("\nTwo columns having the highest mutual information score with respect to 'E'")
print(data_anova)
   

Dataframe:
   A  B   C  D  E
0  1  2   4  1  2
1  2  2   4  1  3
2  3  6   6  1  4
3  4  6  10  1  5
4  5  6  10  5  1


NameError: name 'mutual_info_regression' is not defined

In [None]:
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFE

# #Creating a dummy dataset
# X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)

# print("X has following no. of features:", X.shape[1])


df = pd.read_csv('../data/Cleaned_Data.csv')
df_new=df.sample(200,random_state=0)

X = df.drop(['SalePrice'],1)
y=df['SalePrice'].copy()

X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3, random_state=0) 

#Selecting model as Linear Regressor
model =LinearRegression()


model.fit(X_train,y_train)
print("Score before RFE:", model.score(X_test,y_test))


#Passing the model along with no. of features=30
selector = RFE(model,30)

#Fitting the data with the above conditions
X_train_rfe = selector.fit_transform(X_train, y_train)
X_test_rfe=selector.transform(X_test)

model.fit(X_train_rfe,y_train)
print("Score after RFE:",model.score(X_test_rfe,y_test))


Example:
A young bike seller wants to find the optimum bikes to keep at his store each day. He knows that he needs to have more bikes on Fridays than on other days, but he is trying to find if the need for bikes is same also for Mondays. He collects data for the number of transactions each day for two months. Following is the data:

Mondays: 276, 323, 298, 256, 277, 309, 312, 265, 311

Fridays: 243, 279, 301, 285, 274, 243, 228, 298, 255

Our null hypothesis is that across all days the mean value is the same:

$H_{o}: mean_{m}=mean_{f} and he decides to use α – .05. 

He finds:

$mean_{m}$ = 291.8

$mean_{f}$ = 267.3

and the grand mean = 279.55

Computing variance within:

[(276-291.8)2+…+(311-291.8)2+ (243-267.3)2+…+(255-267.3)2/[18-2]=6251.85

Computing variance between:

[9(291.8-279.55)2+9(267.3-279.55)2]/[2-1] = 2701.125

Computing the F-score, we will get,

F-score=$\frac {6251.8}{2701.125}= 2.31$

Critical F-Value based on the degrees of freedom(1,30) would be 4.17(Based on DF Table)

Since the F-score is smaller than the critical F-value(Alternatively he could also have compared the p-value with alpha), he concludes that the mean number of bicycles required is equal on both days.


Similarly, using this score, the predictor(feature) from the model is removed if it is not statistically significant and we end up having features that are statistically similar to target variable.


Example:
A young bike seller wants to find the optimum bikes to keep at his store each day. He knows that he needs to have more bikes on Saturdays than on other days, but he is trying to find if the need for bikes is constant across the rest of the week. He collects data for the number of transactions each day for two months. Following is the data:

Mondays: 276, 323, 298, 256, 277, 309, 312, 265, 311

Tuesdays: 243, 279, 301, 285, 274, 243, 228, 298, 255

Wednesdays: 288, 292, 310, 267, 243, 293, 255, 273

Thursdays: 254, 279, 241, 227, 278, 276, 256, 262


Our null hypothesis is that across all days the mean value is the same:

$H_{o}: mean_{m}=mean_{tu}=mean_{w}=mean_{th}$ 

and decides to use α – .05. 

He finds:

$mean_{m}$ = 291.8

$mean_{tu}$ = 267.3

$mean_{w}$ = 277.6

$mean_{th}$ = 259.1

and the grand mean = 274.3

Computing variance within:

[(276-291.8)2+(323-291.8)2+…+(243-267.6)2+…+(288-277.6)2+…+(254-259.1)2]/[34-4]=15887.6/30=529.6

Computing variance between:

[9(291.8-274.3)2+9(267.3-274.3)2+8(277.6-274.3)2+8(259.1-274.3)2]/[4-1]

= 5151.8/3 = 1717.3

Computing the F-score, we will get,

F-score=$\frac {1717.3}{529.6}= 3.24$

Critical F-Value based on the degrees of freedom(3,30) would be 2.92

Since the F-score is larger than the critical F-value(Alternatively he could also have compared the p-value with alpha), he concludes that the mean number of bicycles required is not equal on different days of the week, or at least there is one day that is different from others.


In [None]:
import plotly.plotly as py
from plotly.graph_objs import *
import plotly.tools as tls



In [None]:
import pandas as pd

df = pd.read_csv(
    filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', 
    header=None, 
    sep=',')

df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.dropna(how="all", inplace=True) # drops the empty line at file-end

df.head()

In [None]:
from sklearn.preprocessing import StandardScaler
X = df.ix[:,0:4].values
y = df.ix[:,4].values

X_std = StandardScaler().fit_transform(X)
print(X_std[0:5,])
import numpy as np
mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-1)
# print('Covariance matrix \n%s' %cov_mat)

cov_mat = np.cov(X_std.T)

eig_vals, eig_vecs = np.linalg.eig(cov_mat)

print('Eigenvectors \n%s' %eig_vecs)
# print('\nEigenvalues \n%s' %eig_vals)


u,s,v = np.linalg.svd(X_std.T)
# print(u)

# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low

eig_pairs.sort()
print("sadas",eig_pairs)
eig_pairs.reverse()

# Visually confirm that the list is correctly sorted by decreasing eigenvalues
print('Eigenvalues in descending order:')
for i in eig_pairs:
    print(i[0])
    
    
matrix_w = np.hstack((eig_pairs[0][1].reshape(4,1), 
                      eig_pairs[1][1].reshape(4,1)))
print('Projection Matrix M :\n', matrix_w)


Y = X_std.dot(matrix_w)
print('\nY:\n',Y[0:5,])

In [None]:
import matplotlib.pyplot as plt
tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)

print(cum_var_exp)
print(var_exp)

pca_names=['PCA1','PCA2','PCA3','PCA4']
plt.bar(pca_names,var_exp)
plt.title("Explained Variance by principal components")
# plt.plot(pca_names,cum_var_exp)

In [None]:

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA 

df = pd.read_csv(
    filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', 
    header=None, 
    sep=',')

df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.dropna(how="all", inplace=True) 

X = df.iloc[:,0:4].values
print("\nFirst five rows of X matrix:")
print(X[0:5,])

X_scaled = StandardScaler().fit_transform(X)
print("\nFirst five rows of scaled X matrix:")
print(X_scaled[0:5,])


pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print("\nFirst five rows of PCA transformed X matrix:")
print(X_pca[0:5,])

# u,s,v = np.linalg.svd(X_scaled)
# print(v)

In [None]:
import plotly.plotly as py
from plotly.graph_objs import *
import plotly.tools as tls
import pandas as pd

df = pd.read_csv('../data/Cleaned_Data.csv')
df_new=df.sample(200,random_state=0)


In [None]:

from sklearn.preprocessing import StandardScaler
X = df_new.drop('SalePrice',1)
y = df_new['SalePrice']

X_std = StandardScaler().fit_transform(X)
# print(X_std[0:2,])
import numpy as np
mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-1)
# print('Covariance matrix \n%s' %cov_mat)

cov_mat = np.cov(X_std.T)

eig_vals, eig_vecs = np.linalg.eig(cov_mat)

# print('Eigenvectors \n%s' %eig_vecs)
# print('\nEigenvalues \n%s' %eig_vals)


u,s,v = np.linalg.svd(X_std.T)
# print(u)

# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort()

eig_pairs.reverse()

# Visually confirm that the list is correctly sorted by decreasing eigenvalues
# print('Eigenvalues in descending order:')
# for i in eig_pairs:
#     print(i[0])

matrix_w=(eig_pairs[0][1].reshape(80,1))
for i in range(1,30):   
    matrix_w=np.concatenate((matrix_w,eig_pairs[i][1].reshape(80,1)),1)
    

# matrix_w = np.hstack((eig_pairs[0][1].reshape(80,1), 
#                       eig_pairs[1][1].reshape(80,1)))

print('Projection Matrix M :\n', matrix_w)
print(matrix_w.shape)


Y = X_std.dot(matrix_w)
print(Y.shape)
# print(Y)
print('\nTransformed Matrix Y :\n',Y[0:5,])



In [None]:
import matplotlib.pyplot as plt
tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
plt.figure(figsize=(18,10))
# print(cum_var_exp)
# print(var_exp)
pca_names=[]
for i in range(80):
    pca_names.append(str(i))
# plt.bar(pca_names,var_exp)
plt.title("Explained Variance by principal components")
plt.plot(pca_names,cum_var_exp)
plt.plot([30]*83, range(83),color='green')
plt.plot(range(0,31), [82]*31,color='green')
plt.xticks([0,25,50,75])

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA 

df = pd.read_csv('../data/Cleaned_Data.csv')
df_new=df.sample(200,random_state=0)

X = df.drop(['SalePrice'],1)

print("\nFirst two rows of X matrix:")
print(X.iloc[0:2,])

scaler=StandardScaler()

X_scaled = scaler.fit_transform(X)
print("\nFirst two rows of scaled X matrix:")
print(X_scaled[0:2,])


pca = PCA(n_components=30)
X_pca = pca.fit_transform(X_scaled)
print("\nFirst two rows of pca transformed X matrix:")
print(X_pca[0:2,])




How do we know which features help and which don't?

For that we need to understand the information contributed by the features:


1) Entropy of a feature

Entropy is characterized as the proportion of the irregularity in the data being handled. The higher the entropy, the harder it is to reach any inferences from that data. 

For eg: Flipping a coin is an activity that gives data that is irregular. 

In Data Science, entropy of a feature say x1 is figured by excluding x1 and afterward ascertaining the entropy of the remaining features. 

Lower the entropy  (excluding x1) the higher will be the information content of x1. 

At the end, features are selected based on some threshold value. Thus entropy of the features can give good information.


2) Mutual Information between features

The mutual information (MI) of two random variables is defined as the measure of the mutual dependence between the two variables. It quantifies the amount of information obtained about one random variable through observing the other random variable. 

The concept of mutual information is intricately linked to that of entropy of a random variable. 

The features which have high mutual information value with respect to the target variable are considered optimal since they can influence the predictive model towards making the right prediction and hence increase the performance of the model.
 

3443.8
4552.96
5450.01


In [None]:
# import numpy as np
# import pandas as pd
# import seaborn as sns
# %matplotlib inline
# from matplotlib import pyplot as plt
# from sklearn.preprocessing import LabelEncoder
# # Code starts here


# ames = pd.read_csv('../data/train.csv')
# print(ames.head())
# obj_df = ames.select_dtypes(include=['object']).copy()

# for col in obj_df.columns:
#     ames[col].fillna('NA',inplace=True)
#     le=LabelEncoder()
#     ames[col]=le.fit_transform(ames[col]) 

# num_df = ames.select_dtypes(include=['number']).copy()

# for col in num_df.columns:
#     ames[col].fillna(0,inplace=True)
    
    
# ames.to_csv("../images/Cleaned_Data.csv", index=False)    

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Code starts here


ames = pd.read_csv('../data/Cleaned_Data.csv')
# ames_train=ames.sample(600, random_state=0)
# ames_test=ames[~ames['Id'].isin(ames_train['Id'].values)]



X=ames.drop(['Id','SalePrice'],1)
y=ames['SalePrice'].copy()

    
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3, random_state=0) 


model=LinearRegression()
model.fit(X_train,y_train)
score=model.score(X_test,y_test)
print(score)


# code ends here


In [None]:
data=pd.DataFrame({'A':[1,2,3,4,5], 'B':[2,2,6,6,6], 'C':[4,4,6,10,10], 'D':[1,1,1,1,5], 'E':[2,2,2,1,5]})
print('Dataframe:')
print(data)

#Using ANOVA score to calculate best two features
test = SelectKBest(score_func=f_regression, k=2)

#Transforming the data based on ANOVA(Target variable is E)
data_anova= test.fit_transform(data.iloc[:,:4], data.iloc[:,4])


print("\nTwo columns having the highest ANOVA score with respect to 'E'")
print(data_anova)
   



In [None]:
#Creating a dataframe
data=pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10],  'B':[4,4,6,10,10,4,4,6,10,10], 'C':[1,1,1,1,5,1,1,1,1,5], 'D':[2,2,2,1,5,2,2,2,1,5]})

print('Dataframe:')
print(data)


#Selecting A,C,D
X=data[['A','C','D']]
y=data['E'].copy()


X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3, random_state=0) 
model =LinearRegression()

#Fitting the model
model.fit(X_train,y_train)
print("Score when features [A,C,D] is selected:", model.score(X_test,y_test))

####################################################################################

#Selecting C,D
X=data[['C','D']]
y=data['E'].copy()


X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3, random_state=0) 
model =LinearRegression()

#Fitting the model
model.fit(X_train,y_train)
print("Score when features [C,D] is selected:", model.score(X_test,y_test))

############################################################################

#Only selecting feature C
X=data[['C']]
y=data['E'].copy()


X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3, random_state=0) 
model =LinearRegression()

#Fitting the model
model.fit(X_train,y_train)
print("Score when features [C] is selected:", model.score(X_test,y_test))


In [None]:
from sklearn.feature_selection import RFE

#Creating a sample of 200 data points
df_new=ames.sample(200,random_state=49)

X = df_new.drop(['SalePrice'],1)
y=df_new['SalePrice'].copy()

X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3, random_state=0) 

#Selecting model as Linear Regressor
model =LinearRegression()
model.fit(X_train,y_train)
print("Score before RFE:", model.score(X_test,y_test))

#Passing the model along with no. of features you wish to select
selector = RFE(model,13)

#Fitting the data with the above conditions
X_train_rfe = selector.fit_transform(X_train, y_train)
X_test_rfe=selector.transform(X_test)
model.fit(X_train_rfe,y_train)

print("Score after RFE:",model.score(X_test_rfe,y_test))

# print(selector.support_)  # The mask of selected features.
# print(selector.ranking_)  # The feature ranking.

List=[]
for i in range(len(df_new.columns)-1):
    if selector.support_[i]:
        List.append(df_new.columns[i])
print(List)
score={}
for i in range(len(List)):

    score[List[i]]=abs(y_train.corr(X_train[List[i]],method='pearson'))
    
fig, (ax_1, ax_2) = plt.subplots(1,2, figsize=(20,8))    
# print(score)    
from collections import OrderedDict
from operator import itemgetter   
s1=OrderedDict(sorted(score.items(), key = itemgetter(1), reverse = False))
# plt.bar(range(len(score)), list(score.values()), align='center')
ax_1.bar(list(s1.keys()), list(s1.values()), align='center')
ax_1.set_title('Features selected by RFE method')
# ax_1.set_xticks()

####################################

X_train['Class']=y_train
t_corr=X_train.corr()
t_corr=t_corr['Class']

corr_columns=t_corr[abs(t_corr)>0.5].index
corr_columns=corr_columns.drop('Class')

X_train_new=X_train[corr_columns]

X_test_new=X_test[corr_columns]

# X_train_new=

model=LinearRegression()
model.fit(X_train_new,y_train)
corr_score=model.score(X_test_new,y_test)
print(corr_score)
print(len(X_train_new.columns))


List_2=X_train_new.columns
score_2={}
for i in range(len(List_2)):
#     score_2[List_2[i]]=abs(df_new['SalePrice'].corr(df_new[List_2[i]],method='pearson'))
        score_2[List_2[i]]=abs(y_train.corr(X_train[List_2[i]],method='pearson'))
# print(score_2)

# n=['Utilities','Condition2','PoolArea','PoolQC','Street']
# for i in n:
#     score_2[i]=0
s2=OrderedDict(sorted(score_2.items(), key = itemgetter(1), reverse = False))
print(s2)
# plt.bar(range(len(score_2)), list(score_2.values()), align='center')
ax_2.bar(list(s2.keys()), list(s2.values()), align='center')
ax_2.set_title('Features selected by Correlation method')

ax_2.set_xticks(list(s2.keys()))

for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=90)

plt.show()

# print(list(s2.keys()))

# Quiz

1. Which of the following are good reasons to implement PCA?

a) As a method to implement regularisation by reducing features and preventing overfitting

b) To visualise high dimensional data(by choosing k=2 or k=3) [Answer]

c) As a method to speed up a learning algorithm that lags with high dimensional data [Answer]

d) To compress the data to take up less disk space [Answer]

2. If the input features are on very different scales, PCA automatically takes care of it and performs dimensionality reduction.

a) True

b) False [Answer]

3. F-score is defined as 

a) Variance between the samples/ Variance within the samples [Answer]

b) Variance within the samples/ Variance between the samples

c) Variance within the samples * Variance between the samples

d) Variance between the samples * Variance within the samples


4. Which among the following is not a filter method?

a) Correlation Coefficient 

b) Anova

c) L1 Regularization [Answer]


5. PCA is a method of feature selection

a) True

b) False [Answer]



6. Which one of the following methods should be preferred when dealing with large amounts of data?


a) Filter methods [Answer]

b) Wrapper methods


7. Which of the following methods takes into account the information dependency between the evaluated features?

a) Wrapper methods [Answer]

b) Filter methods



8. PCA method is a feature selection method 

a) True 

b) False [Answer]


9. Embedded methods will always perform better than the filter methods and wrapper methods because of its hybrid nature.

a) True

b) False [Answer]


10. You should implement feature selection methods before the data cleaning methods because that would result in lesser features to clean and preprocess

a) False [Answer]

b) True