### FEATURE ENGINEERING IN PYTHON

Feature Engineering is a blanket term that covers the various operations that are performed on the features (variables) to make them fit for different learning algorithms. Feature Engineering helps in increasing the accuracy of the model as by tweaking the features of the data, the performance of the models can be improved which ultimately influences the final result.



Feature Transformation,
 Feature Scaling,
 Feature Construction,
 Feature Reduction

### Feature Scaling

Feature scaling is conducted to standardize the independent features. This is done because the range of raw data may vary widely. Some predictive models such as KNN and K-means consider Euclidean distance and it is important for them to have the features on the same scale.

There are mainly two ways of performing Scaling on features:--

Min Max Scaler,
    Z Scores (Standardization)

In [47]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [48]:
df=pd.DataFrame({
    'income_in_1000s':[20,25,30,35,38,40,45,50,60,65],
    'Age':[22,23,24,27,28,29,31,33,35,37],
    'Sex':['M','F','F','M','M','M','F','F','F','F']    
})

In [49]:
## Min Max Scaler

In [50]:
from sklearn.preprocessing import MinMaxScaler

In [51]:
scaler= MinMaxScaler()

In [52]:
scaler.fit(df[['income_in_1000s']])

MinMaxScaler(copy=True, feature_range=(0, 1))

In [53]:
df['MinMax_Transformed_income']=scaler.transform(df[['income_in_1000s']])

In [54]:
df

Unnamed: 0,income_in_1000s,Age,Sex,MinMax_Transformed_income
0,20,22,M,0.0
1,25,23,F,0.111111
2,30,24,F,0.222222
3,35,27,M,0.333333
4,38,28,M,0.4
5,40,29,M,0.444444
6,45,31,F,0.555556
7,50,33,F,0.666667
8,60,35,F,0.888889
9,65,37,F,1.0


### Z Score (Standardization)

In [55]:
from sklearn.preprocessing import StandardScaler

In [56]:
scaler=StandardScaler()
scaler.fit(df[['income_in_1000s']])

StandardScaler(copy=True, with_mean=True, with_std=True)

In [57]:
df['StandardScaler_Transformed_income']=scaler.transform(df[['income_in_1000s']])

In [58]:
df

Unnamed: 0,income_in_1000s,Age,Sex,MinMax_Transformed_income,StandardScaler_Transformed_income
0,20,22,M,0.0,-1.509945
1,25,23,F,0.111111,-1.146977
2,30,24,F,0.222222,-0.78401
3,35,27,M,0.333333,-0.421042
4,38,28,M,0.4,-0.203262
5,40,29,M,0.444444,-0.058075
6,45,31,F,0.555556,0.304893
7,50,33,F,0.666667,0.66786
8,60,35,F,0.888889,1.393795
9,65,37,F,1.0,1.756762


In [59]:
df['Zscore_manually_calculated'] = (df['income_in_1000s']-df['income_in_1000s'].mean())/df['income_in_1000s'].std()
df

Unnamed: 0,income_in_1000s,Age,Sex,MinMax_Transformed_income,StandardScaler_Transformed_income,Zscore_manually_calculated
0,20,22,M,0.0,-1.509945,-1.432459
1,25,23,F,0.111111,-1.146977,-1.088118
2,30,24,F,0.222222,-0.78401,-0.743777
3,35,27,M,0.333333,-0.421042,-0.399436
4,38,28,M,0.4,-0.203262,-0.192831
5,40,29,M,0.444444,-0.058075,-0.055095
6,45,31,F,0.555556,0.304893,0.289247
7,50,33,F,0.666667,0.66786,0.633588
8,60,35,F,0.888889,1.393795,1.32227
9,65,37,F,1.0,1.756762,1.666611


In [60]:
from sklearn.preprocessing import OneHotEncoder
encoding=OneHotEncoder(sparse=False)  ### this sparse part should be false
from sklearn.preprocessing import LabelEncoder
label= LabelEncoder()

In [62]:
label.fit(df['Sex'])

LabelEncoder()

In [63]:
a =label.transform(df['Sex'])
a

array([1, 0, 0, 1, 1, 1, 0, 0, 0, 0])

In [204]:
a = a.reshape(len(a),1)
encoding.fit(a)

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


OneHotEncoder(categorical_features=None, categories=None, drop=None,
              dtype=<class 'numpy.float64'>, handle_unknown='error',
              n_values=None, sparse=False)

In [205]:
df['One_hot_encoding']=encoding.transform(a)

In [200]:
df

Unnamed: 0,income_in_1000s,Age,Sex,MinMax_Transformed_income,StandardScaler_Transformed_income,Zscore_manually_calculated,One_hot_encoding
0,20,22,M,0.0,-1.509945,-1.432459,0.0
1,25,23,F,0.111111,-1.146977,-1.088118,1.0
2,30,24,F,0.222222,-0.78401,-0.743777,1.0
3,35,27,M,0.333333,-0.421042,-0.399436,0.0
4,38,28,M,0.4,-0.203262,-0.192831,0.0
5,40,29,M,0.444444,-0.058075,-0.055095,0.0
6,45,31,F,0.555556,0.304893,0.289247,1.0
7,50,33,F,0.666667,0.66786,0.633588,1.0
8,60,35,F,0.888889,1.393795,1.32227,1.0
9,65,37,F,1.0,1.756762,1.666611,1.0


### Feature Reduction

There are various methods of reducing the number of features and in this blog, the methods that are explored are as follows –

Feature Selection,
  Feature Extraction,
  Factor Analysis

Feature Selection
As discussed in the Theory of Feature Selection, there are mainly three ways to do feature selection – 
Filter Methods, 
Wrapper Methods and
Embedded Methods.

##### Wrapper Methods

It is important to note that Wrapper methods are in a way part of modeling only and should be discussed under the respective section, however, as they can be used as a feature reduction technique, we will be exploring them here only.

In Wrapper Method, the selection of features is done while running the model. You can perform stepwise/backward/forward selection or recursive feature elimination. In Python, however, when using Wrapper methods, we usually use only RFE (Recursive Feature Elimination) technique to select and reduce features and that’s what we are going to use.

 

In [5]:
df=pd.read_csv('/home/vinay/Downloads/dataset tricks/tanic.csv')

In [6]:
df.dropna(inplace=True)

In [7]:
df

Unnamed: 0.1,Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,0,3,male,22.0,1,0,7.2500,S
1,1,1,1,female,38.0,1,0,71.2833,C
2,2,1,3,female,26.0,0,0,7.9250,S
3,3,1,1,female,35.0,1,0,53.1000,S
4,4,0,3,male,35.0,0,0,8.0500,S
...,...,...,...,...,...,...,...,...,...
885,885,0,3,female,39.0,0,5,29.1250,Q
886,886,0,2,male,27.0,0,0,13.0000,S
887,887,1,1,female,19.0,0,0,30.0000,S
889,889,1,1,male,26.0,0,0,30.0000,C


In [8]:
## first lets perform some label encoding on the dataset

In [14]:
from sklearn.preprocessing import OneHotEncoder
onehotencoder= OneHotEncoder(sparse=False)
from sklearn.preprocessing import LabelEncoder
labelencoder=LabelEncoder()

In [15]:
label=labelencoder.fit_transform(df['Sex'])

In [16]:
label= label.reshape(len(label),1)
df['Sex_Female'] = onehotencoder.fit_transform(label)

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [17]:
df

Unnamed: 0.1,Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Sex_Female
0,0,0,3,male,22.0,1,0,7.2500,S,0.0
1,1,1,1,female,38.0,1,0,71.2833,C,1.0
2,2,1,3,female,26.0,0,0,7.9250,S,1.0
3,3,1,1,female,35.0,1,0,53.1000,S,1.0
4,4,0,3,male,35.0,0,0,8.0500,S,0.0
...,...,...,...,...,...,...,...,...,...,...
885,885,0,3,female,39.0,0,5,29.1250,Q,1.0
886,886,0,2,male,27.0,0,0,13.0000,S,0.0
887,887,1,1,female,19.0,0,0,30.0000,S,1.0
889,889,1,1,male,26.0,0,0,30.0000,C,0.0


In [18]:
df.drop(['Sex','Embarked','Unnamed: 0'],inplace=True,axis=1)

In [19]:
df

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_Female
0,0,3,22.0,1,0,7.2500,0.0
1,1,1,38.0,1,0,71.2833,1.0
2,1,3,26.0,0,0,7.9250,1.0
3,1,1,35.0,1,0,53.1000,1.0
4,0,3,35.0,0,0,8.0500,0.0
...,...,...,...,...,...,...,...
885,0,3,39.0,0,5,29.1250,1.0
886,0,2,27.0,0,0,13.0000,0.0
887,1,1,19.0,0,0,30.0000,1.0
889,1,1,26.0,0,0,30.0000,0.0


In [20]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [21]:
linear=LinearRegression()


In [22]:
X= df.iloc[ : ,1: ]

In [23]:
Y=df['Survived']

In [24]:
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3,random_state=11)

### Initialise RFE

We first initialize the RFE function which we to have imported from the sklearn.feature_selection. In this step, we specify the number of variables that we require in the output.

In [22]:
from sklearn.feature_selection import RFE ## in this way we have imported  the library
rfe=RFE(linear,2) # in this line we are intializing the RFE and we are telling that we want the most relevant 3 features 

In [23]:
rfe.fit(X_train,Y_train) ## in this line we have given the dataset to fit

RFE(estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                               normalize=False),
    n_features_to_select=2, step=1, verbose=0)

In [24]:
list(X_train.columns[rfe.support_]) ### now  in this line we can see that we have found the most relevant features

['Pclass', 'Sex_Female']

Note--that RFE can only be applied on sklearn estimators i.e. only on the models that are present in sklearn package. 

### Embedded Methods
Embedded Methods use regularization algorithms to improve the accuracy of the models. Again, just like wrapper methods, this technique is used while building models and in a way is a part of modeling only and should be discussed under the modeling section but it is being explored under the Data Preparation section as we are using it for feature reduction. Embedded methods tell us about the best features that can be selected as per their importance which is deduced by the value of their coefficients. There two types of Regularization – Lasso and Ridge. The value of alpha can be changed as per your requirement. Alpha is equal to 0 for Ridge and 1 for Lasso. The coefficients that we get from running the model are the deciding factors for feature selection. (Note that alpha in Python is equivalent to lambda in R. In R, alpha defines whether to perform Lasso or Ridge regressio

In [29]:
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

In [30]:
df

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_Female
0,0,3,22.0,1,0,7.2500,0.0
1,1,1,38.0,1,0,71.2833,1.0
2,1,3,26.0,0,0,7.9250,1.0
3,1,1,35.0,1,0,53.1000,1.0
4,0,3,35.0,0,0,8.0500,0.0
...,...,...,...,...,...,...,...
885,0,3,39.0,0,5,29.1250,1.0
886,0,2,27.0,0,0,13.0000,0.0
887,1,1,19.0,0,0,30.0000,1.0
889,1,1,26.0,0,0,30.0000,0.0


In [41]:
X= df.iloc[ : ,1: ]

In [40]:
Y=df['Survived']

In [47]:
# The bigger the alpha the less features that will be selected.
# Then I use the selectFromModel object from sklearn, which
# will select the features which coefficients are non-zero

selectformmodel= SelectFromModel(Lasso(alpha=0.005,random_state=0)) 
selectformmodel.fit(X_train,Y_train)

SelectFromModel(estimator=Lasso(alpha=0.005, copy_X=True, fit_intercept=True,
                                max_iter=1000, normalize=False, positive=False,
                                precompute=False, random_state=0,
                                selection='cyclic', tol=0.0001,
                                warm_start=False),
                max_features=None, norm_order=1, prefit=False, threshold=None)

In [49]:
selectformmodel.get_support()

array([ True,  True,  True,  True,  True,  True])

In [53]:
selected_feat = X_train.columns[(selectformmodel.get_support())]
selected_feat 

Index(['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex_Female'], dtype='object')

### Select k best

In [54]:
### a another approach is also known K means best

In [26]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [30]:
best_feature = SelectKBest(score_func=chi2,k=2)
fit=best_feature.fit(X_train,Y_train)

In [38]:
fit.scores_   # from here we can see all the scores for all the coloumns now we can make it in the form of dataframe

array([1.88125591e+01, 3.51170059e+01, 7.02656327e-01, 6.54053348e+00,
       2.65849161e+03, 9.62622046e+01])

In [39]:
pd.DataFrame({
    'fit_scores':fit.scores_,
    'coloumns':X_train.columns
    
})

Unnamed: 0,fit_scores,coloumns
0,18.812559,Pclass
1,35.117006,Age
2,0.702656,SibSp
3,6.540533,Parch
4,2658.491608,Fare
5,96.262205,Sex_Female


In [40]:
## now we can see that the below two features are highly correlated