# Feature Engineering

## 1) What is Feature?
Consider an example of data stored in sql database tables. Table is made up of
rows and columns. Table contains Integer data, String data, Date fields etc. Now
consider the date column. We want to do some analysis, but this field is not directly
useful. So, first we write a program (or script) to extract day on any particular date
and create a separate column with that information. Now we have 7 days (Monday,
Sunday etc.) stored in a new column. Again, we create a script to check: is a day is
weekend or a weekday. We create another field is_weekend. It contains True if the
day is weekend otherwise False. We can use this for our analysis or to build
predictive model. This is Feature

## 2) Why Feature Engineering?
In Machine Learning to get success in any modelling technique
we need good features in the data. Features are very important to increase the
predictive power of any model. When we try to solve real world problem, we may
not always get the best features. As, Features may exist with a lot of problems like
missing values, outlier, different type, error in data collection etc. Before training
any machine learning model, we have to clean, transform and find right set of
features.
We can take an example of cooking food. We have given a lot of ingredient and we
have to find which is needed to solve our problem. Without selection of right
ingredient, we cannot make perfect food. Same is feature engineering in data
science process.
Data Scientist while solving any problem, spend more than half of their time in
selecting the right features. Some of the benefits of spending time in feature
engineering:
1. Make simple model to perform much better than complex model.
2. Reduce model selection time.They increase predictive power of simple
models.
3. Reduce training time by simplifying the model.
It is all about decomposing the data in an intelligent fashion. We will discuss all the
important aspects of feature engineering in this chapter.

## 3) Feature Selection
Feature selection are the techniques to select subset of the features from the data.
It is different from the feature extracted where we have created new features. In
this we find most useful features from the data itself. Feature selection is important
because:
1. Reduces Training Time: Model creation is faster due to less features.
2. Easy to explain and interpret features.
3. Better Generalized model: Predictive model behaves nearly same on generalized data.
4. Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
5. Improves Accuracy: Less misleading data means modeling accuracy improves.

Feature selection is basically a search problem. We have to find ways to select
features which can produce better results. Different techniques which are used in
feature selection are:

#### 1. Filter Methods: 
These methods are based on score of features by some
statistical test. Each feature is evaluated with the outcome on statistical test
(like pearson correlation, chi-squared test etc.) and a score is generated. Later
features are ranked based on their score and lower score features are removed.

#### 2. Wrapper Methods: 
These methods use machine learning algorithm to find
the best feature. First, different subsets of features are created. Now with
these set of features machine learning algorithm is trained on sample data
and model performance is evaluated. The set of features which gives best
performance are considered as selected features which are unconcerned with the variable types. These methods will take
more time because they do actual training of algorithm with different set of
features. RFE is a good example of a wrapper feature selection method.

#### 3. Intrinsic feature selection methods
There are some machine learning algorithms that perform feature selection automatically as part of learning the model. We might refer to these techniques as intrinsic feature selection methods.
This includes algorithms such as penalized regression models like Lasso and decision trees, including ensembles of decision trees like random forest.

 #### Note: These previous methods are almost always supervised and are evaluated based on the performance of a resulting model on a hold out dataset.

#### 4. Embedded methods: 
These methods will select best feature during the training itself. Some of the
examples are ridge and lasso regression which we will see in later chapters.


One way to think about feature selection methods are in terms of supervised and unsupervised methods.
The difference has to do with whether features are selected based on the target variable or not. Unsupervised feature selection techniques ignores the target variable, such as methods that remove redundant variables using correlation. Supervised feature selection techniques use the target variable, such as methods that remove irrelevant variables.

## Statistics for Filter-Based Feature Selection Methods

It is common to use correlation type statistical measures between input and output variables as the basis for filter feature selection.

Common data types include numerical (such as height) and categorical (such as a label), although each may be further subdivided such as integer and floating point for numerical variables, and boolean, ordinal, or nominal for categorical variables.

Common input variable data types:
#### Numerical Variables
- Integer Variables.
- Floating Point Variables.

#### Categorical Variables.
- Boolean Variables (dichotomous).
- Ordinal Variables.
- Nominal Variables.

The more that is known about the data type of a variable, the easier it is to choose an appropriate statistical measure for a filter-based feature selection method.

In this section, we will consider two broad categories of variable types: numerical and categorical; also, the two main groups of variables to consider: input and output.

Input variables are those that are provided as input to a model. In feature selection, it is this group of variables that we wish to reduce in size. Output variables are those for which a model is intended to predict, often called the response variable.

The type of response variable typically indicates the type of predictive modeling problem being performed. For example, a numerical output variable indicates a regression predictive modeling problem, and a categorical output variable indicates a classification predictive modeling problem.

- Numerical Output: Regression predictive modeling problem.
- Categorical Output: Classification predictive modeling problem.

The statistical measures used in filter-based feature selection are generally calculated one input variable at a time with the target variable. As such, they are referred to as univariate statistical measures

# Numerical Input, Numerical Output

### Pearson’s correlation coefficient (linear).

In [67]:
import pandas as pd
import numpy as np
from scipy.stats import stats
from scipy.stats import chi2_contingency
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sklearn.feature_selection import RFE
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from statsmodels.stats.multicomp import (pairwise_tukeyhsd,
                                         MultiComparison)
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score


In [2]:
cars = pd.read_csv("mtcars.csv")
numCols = ["cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"]
numOutput = cars["mpg"]
cars.head()

Unnamed: 0.1,Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [3]:
def pearsonrCorrelation(df,numInputCols,numOutput,numOfFeatu):
    cofList = []
    for inputCol in numInputCols:
        pearsonr_cof, p_value = stats.pearsonr(numOutput,df[inputCol])
        cofList.append((inputCol,pearsonr_cof))
        cofList.sort(key=lambda col:col[1])
    return cofList[:numOfFeatu]   

In [4]:
pearsonrCorrelation(cars,numCols,numOutput,3)

[('wt', -0.8676593765172279),
 ('cyl', -0.8521619594266132),
 ('disp', -0.8475513792624787)]

### Spearman’s rank coefficient (nonlinear)

In [5]:
def spearmanrCorrelation(df,numInputCols,numOutput,numOfFeatu):
    cofList = []
    for inputCol in numInputCols:
        pearsonr_cof, p_value = stats.spearmanr(numOutput,df[inputCol])
        cofList.append((inputCol,pearsonr_cof))
        cofList.sort(key=lambda col:col[1])
    return cofList[:numOfFeatu]

In [6]:
spearmanrCorrelation(cars,numCols,numOutput,3)


[('cyl', -0.9108013108624786),
 ('disp', -0.9088823637364655),
 ('hp', -0.8946646457499626)]

## Numerical Input, Categorical Output

### ANOVA correlation coefficient (linear).

In [7]:
df = pd.read_csv("PlantGrowth.csv")
colInputs = ['weight']
colOutput = "group"
df.sample(10)

Unnamed: 0.1,Unnamed: 0,weight,group
16,17,6.03,trt1
28,29,5.8,trt2
13,14,3.59,trt1
29,30,5.26,trt2
15,16,3.83,trt1
9,10,5.14,ctrl
21,22,5.12,trt2
2,3,5.18,ctrl
3,4,6.11,ctrl
12,13,4.41,trt1


In [8]:
def anovaCorOneWay(df,colInputs,colOutput,numOfFeatu):
    cofList = []
    for colInput in colInputs:
        mod = ols(colInput+' ~ '+colOutput,data=df).fit() 
        aov_table = sm.stats.anova_lm(mod,typ=2)
        cofList.append((colInput,aov_table['PR(>F)'][0]))
        cofList.sort(key=lambda col:col[1])
    return cofList[numOfFeatu:]

In [9]:
anovaCorOneWay(df,colInputs,colOutput,1)


[]

In [10]:
d = pd.read_csv("https://reneshbedre.github.io/assets/posts/anova/twowayanova.txt", sep="\t")
d_melt = pd.melt(d, id_vars=['Genotype'], value_vars=['1_year', '2_year', '3_year'])
# replace column names
d_melt.columns = ['Genotype', 'years', 'value']
inputCols = ['Genotype', 'years']
outputCol = "value"
d_melt.sample(5)

Unnamed: 0,Genotype,years,value
8,C,1_year,4.41
51,F,3_year,9.84
9,D,1_year,3.75
6,C,1_year,3.99
13,E,1_year,2.01


In [11]:
def anovaCortwoWay(df,colInputs,colOutput):
    model = ols(colOutput+' ~  C('+colInputs[0]+' ):C( '+colInputs[1] +' )', data=df).fit()
    anova_table = sm.stats.anova_lm(model, typ=2)
    #anova_table['PR(>F)'][0] must be lower than 0.05
    return (colInputs,anova_table['PR(>F)'][0])

anovaCortwoWay(d_melt,inputCols,outputCol)

(['Genotype', 'years'], 4.293214281878206e-21)

Now, we know that genotype and time (years) differences are statistically significant, but ANOVA does not tell which genotype and time (years) are significantly different from each other. To know the pairs of significant different genotype and time (years), perform multiple pairwise comparison (Post-hoc comparison) analysis using Tukey HSD test.

In [12]:
###### A terminer #####
def testAnova(df,inputCol,outputCol):
    MultiComp = MultiComparison(df[outputCol],
                                df[inputCol])
    return MultiComp.tukeyhsd().summary()

testAnova(d_melt,'Genotype','value')


group1,group2,meandiff,p-adj,lower,upper,reject
A,B,2.04,0.5304,-1.5094,5.5894,False
A,C,2.7333,0.22,-0.816,6.2827,False
A,D,2.56,0.2847,-0.9894,6.1094,False
A,E,0.72,0.9,-2.8294,4.2694,False
A,F,2.5733,0.2793,-0.976,6.1227,False
B,C,0.6933,0.9,-2.856,4.2427,False
B,D,0.52,0.9,-3.0294,4.0694,False
B,E,-1.32,0.8679,-4.8694,2.2294,False
B,F,0.5333,0.9,-3.016,4.0827,False
C,D,-0.1733,0.9,-3.7227,3.376,False


## Categorical Input, Categorical Output

### Chi-Squared test (for independence)

In [13]:
def chiSquared(df,inputCol,outputCol):
    table = pd.crosstab(df[inputCol],df[outputCol])
    return chi2_contingency(table.values)[1]

chiSquared(cars,'cyl','vs')

2.3232347637946903e-05

The P-value obtained from chi-square test for independence is significant (P<0.05), and therefore, we conclude that there is a significant association between cyl and vs

## Feature selection with RFE

In [77]:
def evaluate_model(model,X,y):
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
    return scores

In [78]:
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
X = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age']
Y = "class"
cols = X+[Y]
df = pd.read_csv(url, names=cols)
def RfeMethod(df,inputCols,numFeatures,algorithm,outputCol):
    array = df.values
    X = array[:,0:len(inputCols)]
    Y = array[:,len(inputCols)]
    if algorithm == "logisticRegression":
        model = LogisticRegression(solver='lbfgs')
    elif algorithm == "decisionTreeClassifier":
        model = DecisionTreeClassifier()
    elif algorithm == "DecisionTreeRegressor":
        model = DecisionTreeRegressor()
    elif algorithm == "RandomForestClassifier":
        model = RandomForestClassifier()
    elif algorithm == "GradientBoostingClassifier":
        model = GradientBoostingClassifier()
        print(evaluate_model(model,df[inputCols],df[outputCol]))
    rfe = RFE(model,numFeatures)
    fit = rfe.fit(X, Y)
    selectedFeMask = list(fit.support_)
    selectedFe = [inputCols[i] for i in range(len(inputCols)) if selectedFeMask[i]]
    return selectedFe

print(RfeMethod(df,X,3,"logisticRegression",Y))

print(RfeMethod(df,X,3,"decisionTreeClassifier",Y))

print(RfeMethod(df,X,3,"DecisionTreeRegressor",Y))

print(RfeMethod(df,X,3,"GradientBoostingClassifier",Y))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


['preg', 'mass', 'pedi']
['plas', 'mass', 'pedi']
['plas', 'mass', 'pedi']
[0.74025974 0.81818182 0.74025974 0.81818182 0.76623377 0.81818182
 0.75324675 0.74025974 0.80263158 0.69736842 0.7012987  0.75324675
 0.66233766 0.76623377 0.71428571 0.77922078 0.72727273 0.77922078
 0.76315789 0.76315789 0.68831169 0.76623377 0.76623377 0.7012987
 0.85714286 0.72727273 0.72727273 0.75324675 0.77631579 0.82894737]
['plas', 'mass', 'age']


## Feature Importance with ExtraTreesClassifier

In [57]:
def ExtraTreesClassifierMethod(df,inputCols,numFeatures):
    array = df.values
    X = array[:,0:len(inputCols)]
    Y = array[:,len(inputCols)]
    model = ExtraTreesClassifier(n_estimators=10)
    model.fit(X, Y)
    scores = list(model.feature_importances_)
    listF = []
    for i in range(len(scores)):
        listF.append((inputCols[i],scores[i]))
    listF.sort(key=lambda col:col[1])
    return listF[-numFeatures:]

ExtraTreesClassifierMethod(df,X,3)

[('age', 0.14524514648294043),
 ('mass', 0.14836462684830268),
 ('plas', 0.21677198482406776)]