# Feature Engineering Techniques Notebook

## Goal: to templatize various feature engineering/data pre-processing techniques used in ML applications

### Context

Feature engineering is performed for improving accuracy and training speed of the model in machine learning applications. It is a process of selecting and transforming attributes and is performed as part of data pre-processing in order to make it easier for the ML algorithm to interpret the input data.
Most commonly used feature engineering techniques include one-hot encoding, binning, normalization, data imputation and standardization

We will now implement the following feature engineering sections as shown in the code below:
1. Standardization (Scaling)
2. Normalization (Scaling)
3. One-Hot Encoding (Encoding)
4. Label Encoding (Encoding)
5. Data Imputation
6. Binning

### 1. Standardization

Standardization is a process during which the values of the feature are rescaled so that they have the properties of a standard normal distrubution with μ = 0 and σ = 1, where μ is the mean (average value of the feature, averaged over all examples in the dataset) and σ is the standard deviation from the mean.


Let us now look at a use case where we apply the standardiaztion technique on selected (numerical variables) in a data set

In [None]:
#sample dataset
import pandas as pd
data = pd.DataFrame({'Name' : ['A', 'V','C'], 'Age' : [18, 92,98], 'Weight' : [68, 59,49], 'Class' : [1, 0,1]})
data.head()

In [4]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [5]:
numerical_features = data[['Age', 'Weight']].copy()

In [6]:
numerical_features.head()

Unnamed: 0,Age,Weight
0,18,68
1,92,59
2,98,49


In [7]:
numerical_features_array = scaler.fit_transform(numerical_features)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


In [8]:
numerical_features_array

array([[-1.41100443,  1.20270298],
       [ 0.62304092,  0.04295368],
       [ 0.78796352, -1.24565666]])

In [9]:
numerical_features = pd.DataFrame(numerical_features_array, index=numerical_features.index, columns = numerical_features.columns)
numerical_features.head()

Unnamed: 0,Age,Weight
0,-1.411004,1.202703
1,0.623041,0.042954
2,0.787964,-1.245657


In [10]:
data = data.drop(['Age', 'Weight'], axis = 1)
data.head()

Unnamed: 0,Name,Class
0,A,1
1,V,0
2,C,1


In [11]:
data = pd.concat([data,numerical_features], axis = 1)

In [12]:
data.head()

Unnamed: 0,Name,Class,Age,Weight
0,A,1,-1.411004,1.202703
1,V,0,0.623041,0.042954
2,C,1,0.787964,-1.245657


As seen from the dataframe above, the numerical variables Age and Weight are standardized

### 2. Normalization

Normalization is another well known feature processing technique of converting an actual range of values which a numerical feature can take, into a standard range of values. These values are typically in the interval of -1 to 1 or 0 to 1

Let us now look at a use case where we apply the normalization technique on selected (numerical variables) using the same dataframe used in the previous section

In [16]:
#sample dataset
import pandas as pd
data = pd.DataFrame({'Name' : ['A', 'V','C'], 'Age' : [18, 92,98], 'Weight' : [68, 59,49], 'Class' : [1, 0,1]})
data.head()

Unnamed: 0,Name,Age,Weight,Class
0,A,18,68,1
1,V,92,59,0
2,C,98,49,1


In [17]:
from sklearn.preprocessing import MinMaxScaler
scaling = MinMaxScaler()

In [18]:
numerical_features = data[['Age', 'Weight']].copy()

In [19]:
numerical_features.head()

Unnamed: 0,Age,Weight
0,18,68
1,92,59
2,98,49


In [20]:
numerical_features_array = scaling.fit_transform(numerical_features)

  return self.partial_fit(X, y)


In [21]:
numerical_features_array

array([[0.        , 1.        ],
       [0.925     , 0.52631579],
       [1.        , 0.        ]])

In [22]:
numerical_features = pd.DataFrame(numerical_features_array, index=numerical_features.index, columns = numerical_features.columns)
numerical_features.head()

Unnamed: 0,Age,Weight
0,0.0,1.0
1,0.925,0.526316
2,1.0,0.0


In [23]:
data = data.drop(['Age', 'Weight'], axis = 1)
data.head()

Unnamed: 0,Name,Class
0,A,1
1,V,0
2,C,1


In [24]:
data = pd.concat([data,numerical_features], axis = 1)

In [25]:
data.head()

Unnamed: 0,Name,Class,Age,Weight
0,A,1,0.0,1.0
1,V,0,0.925,0.526316
2,C,1,1.0,0.0


As seen from the dataframe above, the numerical variables Age and Weight are normalized

### 3. One-Hot encoding

One-hot encoding is another popular feature engineering technique used to transform categorical variables. When some features such as say 'country' or 'colors' are present in the dataset, they can be transformed into several binary ones that represent a sparse matrix

The code section below demonstrates the one-hot encoding technique in action!

In [30]:
#sample dataset
import pandas as pd
data = pd.DataFrame({'Region' : ['France', 'Spain','Germany'], 'Age' : [18, 92,98], 'Weight' : [68, 59,49], 'Class' : [1, 0,1], 'Gender' : ['Male', 'Female','Female']})
data.head()

Unnamed: 0,Region,Age,Weight,Class,Gender
0,France,18,68,1,Male
1,Spain,92,59,0,Female
2,Germany,98,49,1,Female


In [31]:
#one hot encoding for categorical variables
final_data = pd.get_dummies(data)

In [32]:
final_data.shape

(3, 8)

In [33]:
final_data.head()

Unnamed: 0,Age,Weight,Class,Region_France,Region_Germany,Region_Spain,Gender_Female,Gender_Male
0,18,68,1,1,0,0,0,1
1,92,59,0,0,0,1,1,0
2,98,49,1,0,1,0,1,0


Hence it can be seen from the above dataframe that the categorical variables Region and Gender have been encoded into their equivalent binary representation using one-hot encoding technique

### 4. Label Encoding

Label encoding is another encoding technique where each categorical class of a variable is replaced with a different number. 

The code sections below demonstrate a use case where we will use Label encoding to encode a categorical variable

In [71]:
#sample dataset
import pandas as pd
data = pd.DataFrame({'Gender' : ['Male', 'Female','Female'], 'Age' : [18, 92,98], 'Weight' : [68, 59,49], 'Class' : [1, 0,1]})
data.head()

Unnamed: 0,Gender,Age,Weight,Class
0,Male,18,68,1
1,Female,92,59,0
2,Female,98,49,1


In [72]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

In [76]:
encoder.fit(data[['Gender']])

LabelEncoder()

In [77]:
data[['Gender']] = encoder.transform(data[['Gender']])

In [78]:
data.head()

Unnamed: 0,Gender,Age,Weight,Class
0,1,18,68,1
1,0,92,59,0
2,0,98,49,1


The transformed data frame above shows the classes 'Male' and 'Female' of the variable 'Gender' encoded into 0 and 1 respectively

### 5. Data Imputation 

Data imputation is another very useful feature engineering technique where in any missing values of a partiucular variable is replaced with either its mean/median(for numerical variables) or mode(for categorical variables)

We will now use the data imputation technique to take care of missing values in the variable 'Age' by replacing it with the median of all observations of that particular variable as shown in the code section below

In [45]:
#sample dataset
import pandas as pd
data = pd.DataFrame({'Region' : ['France', 'Spain','Germany'], 'Age' : ['NaN',92,98], 'Weight' : [68, 59,49], 'Class' : [1, 0,1], 'Gender' : ['Male', 'Female','Female']})
data.head()

Unnamed: 0,Region,Age,Weight,Class,Gender
0,France,,68,1,Male
1,Spain,92.0,59,0,Female
2,Germany,98.0,49,1,Female


In [46]:
import numpy as np
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'median')

In [51]:
imputer.fit(data[['Age']])

SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='median', verbose=0)

In [53]:
data[['Age']] = imputer.transform(data[['Age']])

In [54]:
data.head()

Unnamed: 0,Region,Age,Weight,Class,Gender
0,France,95.0,68,1,Male
1,Spain,92.0,59,0,Female
2,Germany,98.0,49,1,Female


The transformed dataframe as seen from the above cell shows that the missing values in the variable 'Age' was replaced with the median values of all observations. The same operation can be performed to replace the missing values with either mean or mode by changing the 'strategy' parameter while creating the 'imputer' object

### 6. Binning

Binning (AKA bucketing) is the process of converting a continuous feature into multiple binary features called 'bins' that are based on a range of values. For example: instead of representing age as a single valued feature, the range of values of the variable can be divided into 5 bins.

Let us see how we can implement binning on a data frame as shown in the code sections below

In [89]:
#sample dataset
import pandas as pd
data = pd.DataFrame({'Name' : ['A', 'B','C','D','E','F','G','H','I','J','K','L'], 'Score' : [100,80,54,65,25,87,67,98,81,50,25,57]})
data.head()

Unnamed: 0,Name,Score
0,A,100
1,B,80
2,C,54
3,D,65
4,E,25


In [83]:
#defining bins
bins = [0, 25, 50, 75, 100]
# names of group
group_names = ['fail', 'average', 'good', 'brilliant']

In [87]:
data['grade'] = pd.cut(data['Score'], bins, labels = group_names)

In [88]:
data.head()

Unnamed: 0,Name,Score,grade
0,A,100,brilliant
1,B,80,brilliant
2,C,54,good
3,D,65,good
4,E,25,fail


# THE END