### Objectives
* Feature Selection vs Feature Engineering vs Dimensionality Reduction
* Feature Selection Methods
* Filter Methods
* Wrapper Methods
* Embedded Methods
* Hybrid Methods
* Advanced Methods

<hr>

### Feature Selection vs Feature Engineering vs Dimensionality Reduction

* Feature Selection - It's a process by which we reduce the number of features that is considered in machine learning. 
     Benefits - Model more interpretable, Shorter training time, reduce overfitting .i.e more genralized model
     
* Feature Engineering - This is all about data preprocessing to make model more effective. It transforms data in such a way that model performs better.
     Benefits - Improved model accuracy
     
* Dimensionality Reduction - Techniques like SVD/PCA etc which transforms features to lower dimension is called Dimensionality Reduction.
     Benefits - Faster model training, improved accuracy
     
<hr>

### Feature Selection Methods
* Filter Methods - Simple way of chossing features that you think will have impact in target without any ML algorithm.

* Wrapper Methods - It uses ML algos to identify the subset of features which will be better predictor. Dependent on algo

* Embedded Methods - 

### Filter Methods for Feature Selection
* Feature selection method idenpendently of ML algo.
* Based on characterstics of data.
* These are simple & quick way of feature selection

### Advantages
* Selected features can be used for all ML algorithms. This means if you change ML algo, no need to change the feature selected.
* Computationally not so expensive.

### Types 
* Univariate Filter Based Methods - These methods treat each feature idependently
* Multivariate Filter based Methods - They will use relationship between features.


## Filter Methods

### Basic

* <b>Constant or Quasi Constant Features</b>

* <b> Feature selector that removes all low-variance features </b>

In [1]:
from sklearn.feature_selection import VarianceThreshold

In [4]:
vt = VarianceThreshold(threshold=0.2)

#Features with a training-set variance lower than this threshold will be removed. 
#The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples.

In [5]:
import pandas as pd
df = pd.DataFrame({'A':['m','f','m','m','m','m','m','m'], 
              'B':[1,2,3,1,2,1,1,1], 
              'C':[1,2,3,1,2,1,1,1]})

In [6]:
from sklearn.preprocessing import OrdinalEncoder

In [7]:
oe = OrdinalEncoder()

In [8]:
df['A'] = oe.fit_transform(df[['A']])

In [9]:
df

Unnamed: 0,A,B,C
0,1.0,1,1
1,0.0,2,2
2,1.0,3,3
3,1.0,1,1
4,1.0,2,2
5,1.0,1,1
6,1.0,1,1
7,1.0,1,1


In [10]:
vt.fit_transform(df)

array([[1., 1.],
       [2., 2.],
       [3., 3.],
       [1., 1.],
       [2., 2.],
       [1., 1.],
       [1., 1.],
       [1., 1.]])

In [11]:
vt.variances_

array([0.109375, 0.5     , 0.5     ])

* <b>Dropping duplicated columns</b>

### Correlation Filter

* Correlation is measured as linear relationship between two quantitive columns. It tell's how one variable depends on other.
* Say, we have 3 features A,B,C & one target T. To find out important features for model predicting T, we need to measure correlation between A & T, B & T, C & T.
* If we see feature A & B are correlated, what to do ?
* A. A & B feature provide redundent information to model for predicting T, thus one of them should be removed.

Three ways of calculating correlation - Pearson, Spearman, Kendall

In [12]:
import pandas as pd

In [16]:
df = pd.read_csv('https://raw.githubusercontent.com/edyoda/data-science-complete-tutorial/master/Data/winequality-white.csv', sep=';')
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [20]:
corr = df.corr(method='pearson')
corr['quality'].sort_values(ascending=False)

quality                 1.000000
alcohol                 0.435575
pH                      0.099427
sulphates               0.053678
free sulfur dioxide     0.008158
citric acid            -0.009209
residual sugar         -0.097577
fixed acidity          -0.113663
total sulfur dioxide   -0.174737
volatile acidity       -0.194723
chlorides              -0.209934
density                -0.307123
Name: quality, dtype: float64

In [19]:
import numpy as np

np.abs(corr.loc['fixed acidity']).sort_values(ascending=False)#[:6]

fixed acidity           1.000000
pH                      0.425858
citric acid             0.289181
density                 0.265331
alcohol                 0.120881
quality                 0.113663
total sulfur dioxide    0.091070
residual sugar          0.089021
free sulfur dioxide     0.049396
chlorides               0.023086
volatile acidity        0.022697
sulphates               0.017143
Name: fixed acidity, dtype: float64

### Chi-squared Method
* Used for testing relationship between categorical variables (binary targets/ counts etc.)
* This calculates relationship betwen all features & target (both categorical)

In [21]:
from sklearn.feature_selection import chi2, SelectKBest

In [22]:
cols = ['age','workclass','fnlwgt','education','education-num','marital-status','occupation','relationship'
        ,'race','sex','capital-gain','capital-loss','hours-per-week','native-country','Salary']
adult_data = pd.read_csv('https://raw.githubusercontent.com/zekelabs/data-science-complete-tutorial/master/Data/adult.data.txt', names=cols)

In [30]:
adult_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,Salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [23]:
cat_adult_data = adult_data.select_dtypes(include=['object'])

In [24]:
oe = OrdinalEncoder()

In [25]:
data_td = oe.fit_transform(cat_adult_data)

In [26]:
df = pd.DataFrame(data_td, columns=list(cat_adult_data.columns.values))

In [27]:
chi_2, pval = chi2(df.drop(columns=['Salary']), df.Salary)

In [28]:
chi_2

array([  47.50811916,  297.94227041, 1123.46981798,  504.5588538 ,
       3659.14312486,   33.03130514,  502.43941948,   13.61925602])

In [29]:
pval

array([5.47766026e-012, 9.24882165e-067, 2.61759457e-246, 9.68421957e-112,
       0.00000000e+000, 9.06868555e-009, 2.80029903e-111, 2.23877386e-004])

In [35]:
feature_importances = pd.Series(chi_2, index=list(df.drop(columns=['Salary']).columns.values))
feature_importances

workclass           47.508119
education          297.942270
marital-status    1123.469818
occupation         504.558854
relationship      3659.143125
race                33.031305
sex                502.439419
native-country      13.619256
dtype: float64

In [33]:
feature_importances.sort_values(ascending=False)#[:4]

relationship      3659.143125
marital-status    1123.469818
occupation         504.558854
sex                502.439419
education          297.942270
workclass           47.508119
race                33.031305
native-country      13.619256
dtype: float64

In [43]:
df[list(feature_importances.sort_values(ascending=False)[:4].index)].head()


Unnamed: 0,relationship,marital-status,occupation,sex
0,1.0,4.0,1.0,1.0
1,0.0,2.0,4.0,1.0
2,1.0,0.0,6.0,1.0
3,0.0,2.0,6.0,1.0
4,5.0,2.0,10.0,0.0


In [34]:
df[list(feature_importances.sort_values(ascending=False)[:4].index)][:5]

Unnamed: 0,relationship,marital-status,occupation,sex
0,1.0,4.0,1.0,1.0
1,0.0,2.0,4.0,1.0
2,1.0,0.0,6.0,1.0
3,0.0,2.0,6.0,1.0
4,5.0,2.0,10.0,0.0


In [44]:
fs = SelectKBest(k=4, score_func=chi2)

In [45]:
fs.fit_transform(df.drop(columns=['Salary']), df.Salary)

array([[4., 1., 1., 1.],
       [2., 4., 0., 1.],
       [0., 6., 1., 1.],
       ...,
       [6., 1., 4., 0.],
       [4., 1., 3., 1.],
       [2., 4., 5., 0.]])

In [51]:
fs.get_feature_names_out()

array(['marital-status', 'occupation', 'relationship', 'sex'],
      dtype=object)

In [52]:
pd.DataFrame(fs.fit_transform(df.drop(columns=['Salary']), df.Salary), columns = fs.get_feature_names_out())

Unnamed: 0,marital-status,occupation,relationship,sex
0,4.0,1.0,1.0,1.0
1,2.0,4.0,0.0,1.0
2,0.0,6.0,1.0,1.0
3,2.0,6.0,0.0,1.0
4,2.0,10.0,5.0,0.0
...,...,...,...,...
32556,2.0,13.0,5.0,0.0
32557,2.0,7.0,0.0,1.0
32558,6.0,1.0,4.0,0.0
32559,4.0,1.0,3.0,1.0


In [46]:
fs.scores_

array([  47.50811916,  297.94227041, 1123.46981798,  504.5588538 ,
       3659.14312486,   33.03130514,  502.43941948,   13.61925602])

<b>PS: Corr is for finding relationship between continues feature & continues target. Chi2 is for finding relationship between categorical features & categorical target</b>

### ANOVA Univariate Test
* Suited if feature is continues & normally distributed.
* Target can be discrete/categorical. f_classif
* Target can also be continues. f_regression

In [53]:
df = pd.read_csv('https://raw.githubusercontent.com/edyoda/data-science-complete-tutorial/master/Data/winequality-white.csv', sep=';')

In [77]:
from sklearn.feature_selection import f_classif
fs = SelectKBest(k=5,score_func=f_classif)

In [78]:
fs

In [79]:
feature_data = fs.fit_transform(df.drop(columns=['quality']), df.quality)
feature_data

array([[2.7000e-01, 4.5000e-02, 1.7000e+02, 1.0010e+00, 8.8000e+00],
       [3.0000e-01, 4.9000e-02, 1.3200e+02, 9.9400e-01, 9.5000e+00],
       [2.8000e-01, 5.0000e-02, 9.7000e+01, 9.9510e-01, 1.0100e+01],
       ...,
       [2.4000e-01, 4.1000e-02, 1.1100e+02, 9.9254e-01, 9.4000e+00],
       [2.9000e-01, 2.2000e-02, 1.1000e+02, 9.8869e-01, 1.2800e+01],
       [2.1000e-01, 2.0000e-02, 9.8000e+01, 9.8941e-01, 1.1800e+01]])

In [80]:
pd.DataFrame(feature_data, columns = fs.get_feature_names_out())

Unnamed: 0,volatile acidity,chlorides,total sulfur dioxide,density,alcohol
0,0.27,0.045,170.0,1.00100,8.8
1,0.30,0.049,132.0,0.99400,9.5
2,0.28,0.050,97.0,0.99510,10.1
3,0.23,0.058,186.0,0.99560,9.9
4,0.23,0.058,186.0,0.99560,9.9
...,...,...,...,...,...
4893,0.21,0.039,92.0,0.99114,11.2
4894,0.32,0.047,168.0,0.99490,9.6
4895,0.24,0.041,111.0,0.99254,9.4
4896,0.29,0.022,110.0,0.98869,12.8


In [81]:
from sklearn.linear_model import LogisticRegression

In [82]:
lr = LogisticRegression()

In [83]:
from sklearn.model_selection import train_test_split

In [84]:
trainX, testX, trainY, testY = train_test_split(feature_data, df.quality)

In [85]:
lr.fit(trainX, trainY)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [86]:
lr.score(testX,testY)

0.45387755102040817

In [87]:
from sklearn.tree import DecisionTreeClassifier

In [88]:
dt = DecisionTreeClassifier()

In [89]:
dt.fit(trainX, trainY)

In [90]:
dt.score(testX,testY)

0.5755102040816327