# Feature Selection
The performance of machine learning model is directly proportional to the data features used to train it. The performance of ML model will be affected negatively if the data features provided to it are irrelevant. On the other hand, use of relevant data features can increase the accuracy of your ML model especially linear and logistic regression.

The following are some of the benefits of feature selection before modeling the data:

- Performing feature selection before data modeling will reduce the overfitting.
- Performing feature selection before data modeling will increases the accuracy of ML model.
- Performing feature selection before data modeling will reduce the training time.


## Univariate Selection
This feature selection technique is very useful in selecting those features, with the help of statistical testing, having strongest relationship with the prediction variables. We can implement univariate feature selection technique with the help of `SelectKBest` class of scikit-learn Python library.

In [1]:
from pandas import read_csv
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

path = 'diabetes.csv'
df = read_csv(path)
array= df.values

Next, we will separate array into input and output components:

In [2]:
X = array[:, 0:8]
Y = array[:, 8]

In [3]:
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)
print(fit.scores_)

[ 111.51969064 1411.88704064   17.60537322   53.10803984 2175.56527292
  127.66934333    5.39268155  181.30368904]


In [4]:
featured_data = fit.transform(X)
print("Featured data: \n", featured_data[0:4])

Featured data: 
 [[148.    0.   33.6  50. ]
 [ 85.    0.   26.6  31. ]
 [183.    0.   23.3  32. ]
 [ 89.   94.   28.1  21. ]]


In [4]:
df.head(3)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1


## Recursive Feature Elimination
As the name suggests, `RFE (Recursive feature elimination)` feature selection technique removes the attributes recursively and builds the model with remaining attributes. We can implement RFE feature selection technique with the help of `RFE` class of scikit-learn Python library.

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

model = LogisticRegression()
rfe = RFE(model, n_features_to_select=3, step=1)
fit = rfe.fit(X, Y)

print("\nNumber of Features: ", 3)
print("\nSelected Features: ", fit.support_)
print("\nFeature Ranking: ", fit.ranking_)


Number of Features:  3

Selected Features:  [ True False False False False  True  True False]

Feature Ranking:  [1 2 4 5 6 1 1 3]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Feature Importance
As the name suggests, feature importance technique is used to choose the important features. It basically uses a trained supervised classifier to select features. We can implement this feature selection technique with the help of `ExtraTreeClassifier` class of scikit-learn Python library.

In [6]:
from pandas import read_csv
from sklearn.ensemble import ExtraTreesClassifier

path = 'diabetes.csv'
df = read_csv(path)
array= df.values

Next, we will separate array into input and output components:

In [7]:
X = array[:,0:8]
Y = array[:,8]

In [8]:
model = ExtraTreesClassifier()
model.fit(X, Y)
print(model.feature_importances_)

[0.10845564 0.23958055 0.09753501 0.08167779 0.07423091 0.13954826
 0.11894709 0.14002475]


From the output, we can observe that there are scores for each attribute. The higher the score, higher is the importance of that attribute.