# Techniques of Feature Engineering


## Intro

The first question to ask is what is feature engineering and why it is needed for a Data Scientiest? It is hard to imagine a machine learning algorithm without data. Actually all machine learning algorithms use input data to create some output of it. Data usually in a form of stractured columns in excell/csv files, and thouse columns values are treated as features. Algorithms need this features to work properly. Lets say you have data of travelers and you want to predict if they will come back to that airport. 
 <img src="trav.png" />
There are 2 given features in this data set, "date arrival" and "date departure", this dates a lon my not be to much handly for an algorithms them self, but if we calculated amount of time a traveler spent in a city and days of departure we may have more meaningfull information about that traveler for an algorithm or we call it a model. Maybe users that did not spend much time would come back to see more seight seeing or the opposite they may not like the stay and rerutned before they planned. That is why it is important to select/create features out of the data for your model.


"The features you use influence more than everything else the result. No algorithm alone, to my knowledge, can supplement the information gain given by correct feature engineering."
— Luca Massaron

A difinition from wiki::
"Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive. The need for manual feature engineering can be obviated by automated feature learning." 


To understand how important feature engineering is we can take a look at <a href="https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#70a816a06f63">Forbes</a> surve based on it 80% of time is spent on data massaging

<img src="survey.png" />


It is also called data "data munging" or preparations that may take up to 95% of your time being a Data Scientiest, which may be very boaring, that is why we will take a look at some automation technics of it, but first lets understand the basics and the main technics of feature engineering.
we wil be using some python libraries.

In [2]:
# pandas and numpy for data manipulation
import pandas as pd
import numpy as np

# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Suppress warnings from pandas
import warnings
warnings.filterwarnings('ignore')

plt.style.use('fivethirtyeight')

# Lets get some basic hands on feeling with manual feature engineering. We will use Home Credit Default Risk data
* bureau: information about client's previous loans with other financial institutions reported to Home Credit. Each previous loan has its own row.
* bureau_balance: monthly information about the previous loans. Each month has its own row.
Manual feature engineering can be a tedious process and often relies on domain expertise. As I am not a domain expert in loands and what cause to a default i will try to generate good amount of features and let model decide. Later we can use PCA or feature reduction using the feature importance from the model.

In [5]:
# Read in bureau
bureau = pd.read_csv('input/bureau.csv')
bureau.head()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,currency 1,-497,0,-153.0,-153.0,,0,91323.0,0.0,,0.0,Consumer credit,-131,
1,215354,5714463,Active,currency 1,-208,0,1075.0,,,0,225000.0,171342.0,,0.0,Credit card,-20,
2,215354,5714464,Active,currency 1,-203,0,528.0,,,0,464323.5,,,0.0,Consumer credit,-16,
3,215354,5714465,Active,currency 1,-203,0,,,,0,90000.0,,,0.0,Credit card,-16,
4,215354,5714466,Active,currency 1,-629,0,1197.0,,77674.5,0,2700000.0,,,0.0,Consumer credit,-21,


In [8]:
# Groupby the client id (SK_ID_CURR), count the number of previous loans, and rename the column
previous_loan_counts = bureau.groupby('SK_ID_CURR', as_index=False)['SK_ID_BUREAU'].count().rename(columns = {'SK_ID_BUREAU': 'previous_loan_counts'})
previous_loan_counts.head()

Unnamed: 0,SK_ID_CURR,previous_loan_counts
0,100001,7
1,100002,8
2,100003,4
3,100004,2
4,100005,3


In [9]:
# Join to the training dataframe
train = pd.read_csv('input/application_train.csv')
train = train.merge(previous_loan_counts, on = 'SK_ID_CURR', how = 'left')

# Fill the missing values with 0 
train['previous_loan_counts'] = train['previous_loan_counts'].fillna(0)
train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,previous_loan_counts
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,8.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,,,,,,,0.0
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,1.0



# Feature engineering technics
## Think of feature engineering methods, we can list some of the most common practices:

1. Missing values
2. One-Hot encoding
3. Extrime cases visualaze data 
4. Bucketing
5. Apply a log function
6. Row aggregation 
7. Data manipulation
8. Embedding


    1.Missing values are one of the most common problems you can encounter when you try to prepare your data for machine learning. The reason for the missing values might be human errors, interruptions in the data flow, privacy concerns, and so on. Whatever is the reason, missing values affect the performance of the machine learning models.
Most of the machine learning algorithms would not accept missing data so we have to handle it. The most simple solution would be to drop rows or columns with 75% of missing data

In [5]:
threshold = 0.75
#Dropping columns with missing value rate higher than threshold
data = data[data.columns[data.isnull().mean() < threshold]]

#Dropping rows with missing value rate higher than threshold
data = data.loc[data.isnull().mean(axis=1) < threshold]

NameError: name 'data' is not defined

In [None]:
# pandas and numpy for data manipulation
import pandas as pd
import numpy as np

# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Suppress warnings from pandas
import warnings
warnings.filterwarnings('ignore')

plt.style.use('fivethirtyeight')

In [11]:
!pip install featuretools

Collecting featuretools
[?25l  Downloading https://files.pythonhosted.org/packages/84/59/b4e5c75351063b032ff77f383a5296daddd6a6aeb11b92500571878e64a2/featuretools-0.7.1-py3-none-any.whl (209kB)
[K    100% |████████████████████████████████| 215kB 6.4MB/s ta 0:00:01
Collecting click>=7.0.0 (from featuretools)
  Using cached https://files.pythonhosted.org/packages/fa/37/45185cb5abbc30d7257104c434fe0b07e5a195a6847506c074527aa599ec/Click-7.0-py2.py3-none-any.whl
Collecting scikit-learn>=0.20.0 (from featuretools)
[?25l  Downloading https://files.pythonhosted.org/packages/7e/90/bfe484adb16cdad078967b2b480517ec6c1180137e26fb4b015bcb090226/scikit_learn-0.20.3-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (8.0MB)
[K    100% |████████████████████████████████| 8.0MB 4.1MB/s eta 0:00:01
Collecting psutil>=5.4.8 (from featuretools)
[?25l  Downloading https://files.pythonhosted.org/packages/c6/c1/beed5e4eaa1345901b595048fab1c85aee64

In [15]:
import featuretools as ft

ImportError: cannot import name 'future_set_exc_info'