<center><h2>Feature Engineering</h2></center>

<center><img src="https://codesachin.files.wordpress.com/2016/06/the-how-and-why-of-feature-engineering-5-638.jpg" width="85%"/></center>


By The End Of This Session You Should Be Able To:
----

- Explain why Feature Engineering (FE) is important
- List different common FE methods
- Describe the advantages and disadvantages of the common methods

<center><img src="images/new_oil.jpg" width="100%"/></center>

<center><img src="images/refine.png" width="75%"/></center>

“data is the fuel of machine learning.” This isn’t quite true: data is like the crude oil of machine learning which means it has to be refined into features — predictor variables — to be useful for training a model.



<center><h2>What is Feature Engineering (FE)?</h2></center>

<center><h2> Representing data is the best way possible for a ML algorithm.</h2></center>

<center><h2>The process of formulating the most appropriate features given the goal, the algorithm, and the raw data.</h2></center>

Why is FE important?
------

<center><img src="images/pipeline.png" width="75%"/></center>

ML algorithms assume digital, numeric inputs.

The most valuable information in the world is often not numeric and might not even be digital.

Image Source: https://www.safaribooksonline.com/library/view/feature-engineering-for/9781491953235/ch01.html

<center><h2>What are the most common data types that are not ML ready?</h2></center>

<center><h2>Brian's experience with matching people to jobs...</h2></center>

There is a field of study called "Measurement Theory" that studies the limits and proper way to measure/encode information.

<center><img src="images/succes.jpg" width="100%"/></center>

Feature Engineering is the work of Data Scientists
----

<center><img src="images/Data_Science_VD.png" width="55%"/></center>

 Source: https://codesachin.wordpress.com/2016/06/25/non-mathematical-feature-engineering-techniques-for-data-science/

Feature Engineering is the work of Data Scientists
----

- Takes the most time
- Has a big impact on modeling and business value
- Has not been automated, requires humans and domain knowledge

Source: https://elitedatascience.com/feature-engineering

3 common approaches to FE
-----

1. Hand crafted rules
2. Learned models
3. Stacking

Hand crafted rules
----

A good place to start

Very common

Requires domain expertise

Examples: The "magic numbers" in filtering / thresholding

For example, electrical signal processing

Learned
----

Apply ML to Feature Engineering

Typically unsupervised (e.g., dimension reduction or clustering) 

In NLP, peform topic modeling (i.e., clustering) then classification within clusters

Stacking
-----

<center><img src="https://1.bp.blogspot.com/-S8ss-zVfpRM/V1qKcxfCvNI/AAAAAAAAD0I/8UUFyrE4MqQYYuWSxrOOvX3zRfw93nCLwCLcB/s1600/Stacking.png" width="75%"/></center>

The outputs of one model become the inputs of another model

`Pipeline = [Transformer, Transformer, Transformer]`

More about this next session

What are common FE techniques?
-----

- Handling Missing Values
- Vectorizing
- Filtering / Thresholding
- Binning
- Transforming
    - Rescaling 
- Feature selection

Missing Values
-----

All data has missing values. Just deal with it.



Be very careful that the missing data is not systematic of the effect you are studying.

Maybe missing data could be a feature (not a bug).

["Absence of evidence is not evidence of absence."](https://en.wikipedia.org/wiki/Evidence_of_absence)

(An incomplete) list of techniques:

- Drop rows (instances)
- Drop columns (features)
- Impute values


Ways to impute values
-----

1. Go get the missing data!
1. Sample from existing values 
1. Calculated mean of existing values
1. Fit a model on other features to estimate missing value
    - Regression is often used because it is multivariate mean estimation. 
    - k-NN works well.
1. Deep Learning

Source: https://www.theanalysisfactor.com/seven-ways-to-make-up-data-common-methods-to-imputing-missing-data/

Vectorizing
-------

<center><img src="https://ds055uzetaobb.cloudfront.net/image_optimizer/898ad3880c0fc382f91462f416bc3d126481aa36.png" width="75%"/></center>

Could be any data, mostly applies to Text & Images. 

word2vec (and variations) are very good approaches for text.

Deep Learning works very well for images.

Filtering / Thresholding
------

Always perform EDA (especially univariate)

Remove out-of-bound values

What is the easiest way to handle out-of-bound values?
-----

Prevent them!

Drop down menus are the best.

Type-ahead is very, very nice

<center><img src="https://runt-of-the-web.com/wordpress/wp-content/uploads/2017/12/onety-one.png" width="75%"/></center>

Before type ahead, 20-25% of Google search queries had spelling mistakes.

Type-ahead improved the business metrics more than any ML model.

In [3]:
reset -fs

In [4]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn

import warnings
warnings.filterwarnings('ignore')

palette = "Dark2"
%matplotlib inline

Demo: Bounding values with closures
----

In [9]:
def make_bound_func(min_value, max_value):
    "Define a bound function with a certain min and max"
    def bound(value):
        "Limit value between the min and max"
        return min(max_value, max(value, min_value))
    return bound

In [10]:
bound_rbg = make_bound_func(min_value=0, max_value=255)

assert bound_rbg(42) == 42
assert bound_rbg(-1) == 0
assert bound_rbg(256) == 255

In [11]:
from math import inf

bound_non_negative = make_bound_func(min_value=0, max_value=inf)

assert bound_non_negative(42) == 42
assert bound_non_negative(-1) == 0
assert bound_non_negative(1_000_000) == 1_000_000

Defining Outliers / Anomaly Detection
------

Again, can be done by hand or machine learned.



<center><img src="https://www.researchgate.net/profile/Mustafa_Aljumaily2/publication/321682378/figure/fig1/AS:569320483033088@1512747988945/Figure-1-anomaly-detection.png" width="50%"/></center>

__2 types of outliers__:

1. Generated by the same statistical process as your data (just unusual spread)
2. Generated by a __different__ statistical process as your data

Binning, aka univariate clustering
-----

<center><img src="https://i.stack.imgur.com/XNIQd.jpg" width="75%"/></center>

Discretize continuous values into a smaller number of "bins".

Why would you purposely lose information by downsampling your data?
 -----

1. It makes sense for your goal (e.g., categorize people's age by decade)
1. Improve signal-to-noise ratio (e.g., GPS data)

Fitting a model to bins reduces the impact that small fluctuates in the data has on the model, often small fluctuates are just noise. Each bin "smooths" out the fluctuates/noises in sections of the data.

Source: https://datascience.stackexchange.com/questions/19782/what-is-the-rationale-for-discretization-of-continuous-features-and-when-should/23860#23860

Transforming
----

- Linear: 
    - Preserves the operations of addition and scalar multiplication
    - Example: Normalization & Standardization
    - Generally fine


- Nonlinear:
    - Does __NOT__ preserves the operations of addition and scalar multiplication 
    - Examples: squaring a variable
    - "With great power, comes great responsibility"

Rescaling
------

Often features are orders of magnitude different from each other.

Several ML algorithms are sensitive to feature scaling. 

Source: https://stats.stackexchange.com/questions/244507/what-algorithms-need-feature-scaling-beside-from-svm

Check for understanding
-----

Which specific algorithms are sensitive to feature scaling?

What is common among them?

k-NN, SVM, and neural networks are sensitive to feature transformations.

They exploit distances or similarities between data samples.

__Naive Bayes (a graphical model) and  Tree-based models learn features independently.__

What's the difference between Normalization and Standardization?
------

__Normalization__ 
-----

Rescales the values into a range of [0,1].  

$$ X_{changed} = \frac{X - X_{min}}{X_{max}-X_{min}} $$ 

Useful where all values need to be on the same positive scale. 

However, the outliers from the data set are compressed.

__Standardization__
-------

Rescales data to have a mean ($\mu$) of 0 and standard deviation ($\sigma$) of 1 (unit variance).

$$ X_{changed} = \frac{X - \mu}{\sigma} $$        

Retains outlier values



Source: https://stats.stackexchange.com/questions/10289/whats-the-difference-between-normalization-and-standardization

Feature Selection: What is one way to remove uninformative features?
-------

<center><h2> L1 regularization</h2></center>

<center><img src="https://i.stack.imgur.com/7YKyum.png" width="55%"/></center> 

 <center><img src="https://i.stack.imgur.com/59fltm.png" width="55%"/></center>

 Source: https://stats.stackexchange.com/questions/74542/why-does-the-lasso-provide-variable-selection

scikit-learn's `feature_selection` module
-----

Let's take a look [https://scikit-learn.org/stable/modules/feature_selection.html](https://scikit-learn.org/stable/modules/feature_selection.html)

DL Does Automatic Feature Learning
------

<center><img src="http://adilmoujahid.com/images/traditional-ml-deep-learning-2.png" width="1500"/></center>


 

Source: http://adilmoujahid.com/posts/2016/06/introduction-deep-learning-python-caffe/

Summary
------

- Feature Engineering (FE) might have bigger impact than algorithm selection and model tuning
- FE is impactive when working in new domains, especially ones that are not digital native.
- The goal of FE is change the raw data to best possible feature for a ML algorithm.
- It is both an art (heuristic based) and a science (systematic and use ML)

<br>