# Feature Engineering

### Table of Contents
1. [What is feature engineering?](#1.-What-is-feature-engineering?)
2. [Categorical data](#2.-Categorical-data)
3. [Continuous data](#3.-Continuous-data)
4. [Text and image data](#4.-Text-and-image-data)
5. [Missing data](#5.-Missing-data)
6. [Reference](#6.-Reference)

### 1. What is feature engineering?

Feature engineering is the process of generating __meaningful__ and __variable__ attributes in your data so your model can learn from them. Traditional machine learning models expect a single vector for every observation in your sample data, so you can think of features as the elements of those vectors.

Feature engineering is more than just cleaning or prepping your data. It is a __creative__ process that requires __domain knowledge and understanding of the problem__ so that your dataset is capable of yielding accurate models.

Kaggle competition winners are often differentiated by how they engineer features. We have seen winning models be one of two kinds:
1. Models made with deep understanding of the problem and detailed / precise feature engineering
2. Models made with extremely deep neural networks and complex ensemble solutions
  - These models sacrifice explainability by not performing feature engineering

#### Let's brainstorm about what a feature might be / look like.

#### So how do I gain domain knowledge?

#### Date Example

What are some features we could generate with datetime data?

####  Movie Feature Brainstorm

Without thinking about what data we have, what data would we want to understand if a movie will be highly rated?

Who should we ask about what makes a good movie?


### 2. Categorical data

In categorical data we can start off with __one-hot encoding__ and __label encoding__. Understanding if we have ordinal or non-ordered data is our first step. From here we can generate interaction data points as well!

In [7]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, LabelBinarizer
import warnings

warnings.filterwarnings("ignore")

In [10]:
id = "109om-_4CDEAYco3vrsklrEDcfHsj0AkX"
url_ = f"https://drive.google.com/uc?export=download&id={id_}"

movies = pd.read_csv(url_)
movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


In [12]:
movies.content_rating.unique()

array(['R', 'PG-13', 'PG', 'G'], dtype=object)

In [11]:
## Keep only movies with ratings R, PG-13, PG, or G
ratings = ['G','PG','PG-13','R']
movies = movies[(movies.content_rating.isin(ratings))].reset_index(drop=True)

In [13]:
lab_enc = LabelEncoder()
lab_enc.fit(ratings)
movies['rating_enc'] = lab_enc.transform(movies.content_rating)
movies[['content_rating','rating_enc']].head(15)

Unnamed: 0,content_rating,rating_enc
0,R,3
1,R,3
2,R,3
3,PG-13,2
4,R,3
5,PG-13,2
6,R,3
7,R,3
8,PG-13,2
9,PG-13,2


In [15]:
one_hot_enc = LabelBinarizer()
one_hot_enc.fit(movies.genre)
movies_ohe = one_hot_enc.transform(movies.genre)
movies_ohe = pd.DataFrame(movies_ohe)
movies_ohe = pd.concat([movies, movies_ohe],axis = 1)
movies_ohe.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list,rating_enc,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...",3,0,0,0,0,0,1,0,0,0,0,0,0,0,0
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']",3,0,0,0,0,0,1,0,0,0,0,0,0,0,0
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv...",3,0,0,0,0,0,1,0,0,0,0,0,0,0,0
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E...",2,1,0,0,0,0,0,0,0,0,0,0,0,0,0
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L....",3,0,0,0,0,0,1,0,0,0,0,0,0,0,0


In [16]:
movies.genre.value_counts()

Drama        226
Action       123
Comedy       121
Crime         97
Biography     73
Adventure     63
Animation     55
Horror        19
Mystery       10
Western        6
Sci-Fi         4
Thriller       4
Family         2
Fantasy        1
Name: genre, dtype: int64

In [17]:
movies_one_hot = pd.get_dummies(movies.genre, drop_first=True)
movies = movies.join(movies_one_hot)
movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list,rating_enc,Adventure,Animation,Biography,Comedy,Crime,Drama,Family,Fantasy,Horror,Mystery,Sci-Fi,Thriller,Western
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...",3,0,0,0,0,1,0,0,0,0,0,0,0,0
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']",3,0,0,0,0,1,0,0,0,0,0,0,0,0
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv...",3,0,0,0,0,1,0,0,0,0,0,0,0,0
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E...",2,0,0,0,0,0,0,0,0,0,0,0,0,0
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L....",3,0,0,0,0,1,0,0,0,0,0,0,0,0


### 3. Continuous data

Engineering features from continuous data often requires specific domain knowledge.

Common transformations to apply continuous data, though, include:
1. Square root
2. Log
3. Squaring
4. Interaction (V1 * V2)
5. `+` / `-`
6. Scaling
  * Normalization
  * Standardization

In [18]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [19]:
scaler_standard = StandardScaler()
scaler_normalize = MinMaxScaler()

movies['duration_ss'] = scaler_standard.fit_transform(movies[['duration']])
movies['duration_no'] = scaler_normalize.fit_transform(movies[['duration']])
movies[['duration','duration_ss','duration_no']].head(20)

Unnamed: 0,duration,duration_ss,duration_no
0,142,0.790552,0.421965
1,175,2.088826,0.612717
2,200,3.072367,0.757225
3,152,1.183968,0.479769
4,154,1.262651,0.491329
5,201,3.111708,0.763006
6,195,2.875659,0.728324
7,139,0.672527,0.404624
8,178,2.206851,0.630058
9,148,1.026602,0.456647


#### Binning

Binning is cutting continuous variables at certain intervals and bucketing them into groups representing that interval.

Why would we want to bin data?

In [20]:
pd.cut(movies.duration, bins=3, labels=['short', 'medium', 'long']).head(15)

0     medium
1     medium
2       long
3     medium
4     medium
5       long
6       long
7     medium
8     medium
9     medium
10     short
11    medium
12    medium
13    medium
14    medium
Name: duration, dtype: category
Categories (3, object): ['short' < 'medium' < 'long']

In [21]:
pd.cut(movies.duration, [0,90,120,180,movies.duration.max()], labels=False).head(15)

0     2
1     2
2     3
3     2
4     2
5     3
6     3
7     2
8     2
9     2
10    2
11    2
12    2
13    2
14    2
Name: duration, dtype: int64

#### Data Selection

Can we use feature engineering to get at data that we want but don't have access to?

What about in a scenario where we have to come to our client with data requests?

### 4. Text and image data

Text and images are __unstructured__ data - they don't follow our normal representation of rows and columns (at least out of the box, anyway).

To handle this type of data, we have to figure out how to represent them as __vectors__. The goal is to have a vector representation for each observation.

What is an observation for text data?

What is an observation for visual data?

### 5. Missing data

Missing data can be handled in many different ways. Each of them come with pros and cons - you're sacrificing something (e.g., explainability) by filling in missing data.

The first and __most important__ question when filling missing data is __why__ it was missing in the first place.
- Does a missing value mean the value should have been `0`?
- Does a missing value mean that the record is an outlier? 
- Are missing values correlated with other columns (e.g., no ratings are ever taken for this genre of movie)?

There are some machine learning models that can't handle missing data, so once you understand why data is missing, you can handle it with one of the following: 
1. Replace with `0`
2. Drop rows of data with missing values
3. Drop columns of data with missing values
4. Replace missing values in a column with the mean, median, mode, or another summary metric (e.g., 25th percentile) 
5. Build a model to predict what the value should have been given all of the other data available
6. MICE
7. Factorization

and more!

### 6. Reference

https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/

https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114

https://www.kdnuggets.com/2018/12/feature-engineering-explained.html

https://www.kdnuggets.com/2019/02/quick-guide-feature-engineering.html