## **AUTOMATIC Feature engineering**

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


# Running this (by clicking run or pressing Shift+Enter) will list the files in the input directo
import os
print(os.listdir("data"))

['train_u6lujuX_CVtuZ9i.csv', 'Datasets-master', 'test_ds.csv', 'nyc-social-media-usage', '.ipynb_checkpoints', 'max_ent.RDS', 'nyc-social-media-usage.csv']


## Automatic Feature Creation using featuretools:

In [2]:
#!pip install featuretools

In [3]:
import featuretools as ft
data = ft.demo.load_mock_customer()
customers_df = data["customers"]
customers_df.head()

Unnamed: 0,customer_id,zip_code,join_date,date_of_birth
0,1,60091,2011-04-17 10:48:33,1994-07-18
1,2,13244,2012-04-15 23:31:04,1986-08-18
2,3,13244,2011-08-13 15:42:34,2003-11-21
3,4,60091,2011-04-08 20:08:14,2006-08-15
4,5,60091,2010-07-17 05:27:50,1984-07-28


In [4]:
sessions_df = data['sessions']
transactions_df = data["transactions"]
transactions_df.head(5)

Unnamed: 0,transaction_id,session_id,transaction_time,product_id,amount
0,298,1,2014-01-01 00:00:00,5,127.64
1,2,1,2014-01-01 00:01:05,2,109.48
2,308,1,2014-01-01 00:02:10,3,95.06
3,116,1,2014-01-01 00:03:15,4,78.92
4,371,1,2014-01-01 00:04:20,3,31.54


In [5]:
# Create new entityset
es = ft.EntitySet(id = 'customers')

In [6]:
# Create an entity from the customers dataframe
#Let us add our dataframes to it. The order of adding dataframes is not important. 
#To add a dataframe to an existing entityset, we do the below operation.

## PARAMETERS

#* entity_id: This is just a name. Put it as customers.
#* dataframe name set as customers_df
#* index : This argument takes as input the primary key in the table
#* time_index : The time index is defined as the first time that any information from a row can be used. For customers, it is the joining date. For transactions, it will be the transaction time.
#* variable_types: This is used to specify if a particular variable must be handled differently. In our Dataframe, we have the zip_code variable, and we want to treat it differently, so we use this.

#These are the different variable types we could use:
#[featuretools.variable_types.variable.Datetime,
# featuretools.variable_types.variable.Numeric,
# featuretools.variable_types.variable.Timedelta,
# featuretools.variable_types.variable.Categorical,
# featuretools.variable_types.variable.Text,
# featuretools.variable_types.variable.Ordinal,
# featuretools.variable_types.variable.Boolean,
# featuretools.variable_types.variable.LatLong,
# featuretools.variable_types.variable.ZIPCode,
# featuretools.variable_types.variable.IPAddress,
# featuretools.variable_types.variable.EmailAddress,
# featuretools.variable_types.variable.URL,
# featuretools.variable_types.variable.PhoneNumber,
# featuretools.variable_types.variable.DateOfBirth,
# featuretools.variable_types.variable.CountryCode,
# featuretools.variable_types.variable.SubRegionCode,
# featuretools.variable_types.variable.FilePath]

In [7]:
es = es.entity_from_dataframe(entity_id = 'customers', dataframe = customers_df, 
                              index = 'customer_id', time_index = 'join_date' ,variable_types =  {"zip_code": ft.variable_types.ZIPCode})

In [8]:
#TRANSACTIONS
es = es.entity_from_dataframe(entity_id="transactions",
                                 dataframe=transactions_df,
                                 index="transaction_id",
                               time_index="transaction_time",
                               variable_types={"product_id": ft.variable_types.Categorical})

In [9]:
#SESSIONS
es = es.entity_from_dataframe(entity_id="sessions",
            dataframe=sessions_df,
            index="session_id", time_index = 'session_start')
es

Entityset: customers
  Entities:
    customers [Rows: 5, Columns: 4]
    transactions [Rows: 500, Columns: 5]
    sessions [Rows: 35, Columns: 4]
  Relationships:
    No relationships

## **Relationships**

All three dataframes but no relationships. By relationships, I mean that my bucket doesn’t know that customer_id in customers_df and session_df are the same columns.

In [10]:
cust_relationship = ft.Relationship(es["customers"]["customer_id"],
                       es["sessions"]["customer_id"])

# Add the relationship to the entity set
es = es.add_relationship(cust_relationship)

In [11]:
sess_relationship = ft.Relationship(es["sessions"]["session_id"],
                       es["transactions"]["session_id"])

# Add the relationship to the entity set
es = es.add_relationship(sess_relationship)

In [12]:
es

Entityset: customers
  Entities:
    customers [Rows: 5, Columns: 4]
    transactions [Rows: 500, Columns: 5]
    sessions [Rows: 35, Columns: 4]
  Relationships:
    sessions.customer_id -> customers.customer_id
    transactions.session_id -> sessions.session_id

## **CREATE FEATURES**

In [13]:
feature_matrix, feature_defs = ft.dfs(entityset=es,
                                        target_entity="customers",max_depth = 3) #THREE LEVEL AGREGATION

In [14]:
feature_matrix

Unnamed: 0_level_0,zip_code,COUNT(sessions),NUM_UNIQUE(sessions.device),MODE(sessions.device),SUM(transactions.amount),STD(transactions.amount),MAX(transactions.amount),SKEW(transactions.amount),MIN(transactions.amount),MEAN(transactions.amount),...,MEAN(sessions.NUM_UNIQUE(transactions.WEEKDAY(transaction_time))),MEAN(sessions.NUM_UNIQUE(transactions.YEAR(transaction_time))),NUM_UNIQUE(sessions.MODE(transactions.WEEKDAY(transaction_time))),NUM_UNIQUE(sessions.MODE(transactions.MONTH(transaction_time))),NUM_UNIQUE(sessions.MODE(transactions.DAY(transaction_time))),NUM_UNIQUE(sessions.MODE(transactions.YEAR(transaction_time))),MODE(sessions.MODE(transactions.WEEKDAY(transaction_time))),MODE(sessions.MODE(transactions.MONTH(transaction_time))),MODE(sessions.MODE(transactions.DAY(transaction_time))),MODE(sessions.MODE(transactions.YEAR(transaction_time)))
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5,60091,6,3,mobile,6349.66,44.09563,149.02,-0.025941,7.55,80.375443,...,1,1,1,1,1,1,2,1,1,2014
4,60091,8,3,mobile,8727.68,45.068765,149.95,-0.036348,5.73,80.070459,...,1,1,1,1,1,1,2,1,1,2014
1,60091,8,3,mobile,9025.62,40.442059,139.43,0.019698,5.81,71.631905,...,1,1,1,1,1,1,2,1,1,2014
3,13244,6,3,desktop,6236.62,43.683296,149.15,0.41823,5.89,67.06043,...,1,1,1,1,1,1,2,1,1,2014
2,13244,7,3,desktop,7200.28,37.705178,146.81,0.098259,8.73,77.422366,...,1,1,1,1,1,1,2,1,1,2014


In [15]:
len(feature_defs)

117

In [16]:
feature_defs

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: NUM_UNIQUE(sessions.device)>,
 <Feature: MODE(sessions.device)>,
 <Feature: SUM(transactions.amount)>,
 <Feature: STD(transactions.amount)>,
 <Feature: MAX(transactions.amount)>,
 <Feature: SKEW(transactions.amount)>,
 <Feature: MIN(transactions.amount)>,
 <Feature: MEAN(transactions.amount)>,
 <Feature: COUNT(transactions)>,
 <Feature: NUM_UNIQUE(transactions.product_id)>,
 <Feature: MODE(transactions.product_id)>,
 <Feature: DAY(date_of_birth)>,
 <Feature: DAY(join_date)>,
 <Feature: YEAR(date_of_birth)>,
 <Feature: YEAR(join_date)>,
 <Feature: MONTH(date_of_birth)>,
 <Feature: MONTH(join_date)>,
 <Feature: WEEKDAY(date_of_birth)>,
 <Feature: WEEKDAY(join_date)>,
 <Feature: SUM(sessions.MAX(transactions.amount))>,
 <Feature: SUM(sessions.MIN(transactions.amount))>,
 <Feature: SUM(sessions.SKEW(transactions.amount))>,
 <Feature: SUM(sessions.MEAN(transactions.amount))>,
 <Feature: SUM(sessions.NUM_UNIQUE(transactions.produc

# **Lets talk about categorical features**

This module does not handle categorical variables

In [17]:
sessions_df.head()

Unnamed: 0,session_id,customer_id,device,session_start
0,1,2,desktop,2014-01-01 00:00:00
1,2,5,mobile,2014-01-01 00:17:20
2,3,4,mobile,2014-01-01 00:28:10
3,4,1,mobile,2014-01-01 00:44:25
4,5,4,mobile,2014-01-01 01:11:30


## One hot encoding

In [18]:
pd.get_dummies(sessions_df['device'],drop_first=True).head()

Unnamed: 0,mobile,tablet
0,0,0
1,1,0
2,1,0
3,1,0
4,1,0


## Ordinal encoding

There is an order of importance between categories

In [19]:
map_dict = {'mobile':0,'tablet':1,'desktop':2}
def map_values(x):
    return map_dict[x]
sessions_df['device'] = sessions_df['device'].apply(lambda x: map_values(x))

sessions_df.head()

Unnamed: 0,session_id,customer_id,device,session_start
0,1,2,2,2014-01-01 00:00:00
1,2,5,0,2014-01-01 00:17:20
2,3,4,0,2014-01-01 00:28:10
3,4,1,0,2014-01-01 00:44:25
4,5,4,0,2014-01-01 01:11:30


## LabelEncoder

What a label encoder essentially does is that it sees the first value in the column and converts it to 0, next value to 1 and so on. 

In [20]:
from sklearn.preprocessing import LabelEncoder
# create a labelencoder object
le = LabelEncoder()
# fit and transform on the data
sessions_df['device_le'] = le.fit_transform(sessions_df['device'])
sessions_df.head()

Unnamed: 0,session_id,customer_id,device,session_start,device_le
0,1,2,2,2014-01-01 00:00:00,2
1,2,5,0,2014-01-01 00:17:20,0
2,3,4,0,2014-01-01 00:28:10,0
3,4,1,0,2014-01-01 00:44:25,0
4,5,4,0,2014-01-01 01:11:30,0


## BinaryEncoder

BinaryEncoder is another method that one can use to encode categorical variables. It is an excellent method to use if you have many levels in a column. While we can encode a column with 1024 levels using 1023 columns using One Hot Encoding, using Binary encoding we can do it by just using ten columns.

In [23]:
#!pip install category_encoders

In [25]:
from category_encoders.binary import BinaryEncoder
# create a Binaryencoder object
be = BinaryEncoder(cols = ['device'])
# fit and transform on the data
players = be.fit_transform(sessions_df)

In [27]:
players.head()

Unnamed: 0,session_id,customer_id,device_0,device_1,device_2,session_start,device_le
0,1,2,0,0,1,2014-01-01 00:00:00,2
1,2,5,0,1,0,2014-01-01 00:17:20,0
2,3,4,0,1,0,2014-01-01 00:28:10,0
3,4,1,0,1,0,2014-01-01 00:44:25,0
4,5,4,0,1,0,2014-01-01 01:11:30,0


## HashingEncoder
One can think of Hashing Encoder as a black box function that converts a string to a number between 0 to some prespecified value.

In [28]:
from category_encoders.hashing import HashingEncoder
# create a HashingEncoder object
he = HashingEncoder(cols = ['device'])
# fit and transform on the data
players = he.fit_transform(sessions_df)

In [29]:
players.head()

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,session_id,customer_id,session_start,device_le
0,0,0,0,0,1,0,0,0,1,2,2014-01-01 00:00:00,2
1,0,0,1,0,0,0,0,0,2,5,2014-01-01 00:17:20,0
2,0,0,1,0,0,0,0,0,3,4,2014-01-01 00:28:10,0
3,0,0,1,0,0,0,0,0,4,1,2014-01-01 00:44:25,0
4,0,0,1,0,0,0,0,0,5,4,2014-01-01 01:11:30,0


## **Target/Mean Encoding**
Target encoding is the process of replacing a categorical value with the mean of the target variable.

**References:**
* http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging/target-encoding.html
* https://www.kdnuggets.com/2019/06/hitchhikers-guide-feature-extraction.html

# OTHER EXAMPLES OF FEATURE ENGINEERING

* https://becominghuman.ai/good-feature-building-techniques-tricks-for-kaggle-my-kaggle-code-repository-c953b934f1e6
* https://www.kdnuggets.com/2019/06/hitchhikers-guide-feature-extraction.html