# Automated Feature Engineering with Featuretools

_Automated feature engineering aims to help the data scientist by automatically creating many candidate features out of a dataset from which the best can be selected and used for training._ In this notebook we use [Featuretools](https://docs.featuretools.com/). 

Featuretools which is designed to generate features relational datasets. Let's use Featureools to engineer features from the Home Credit data set.

In [4]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
from quilt.data.avare import homecredit
import featuretools as ft
import pandas as pd
pd.set_option('display.max_columns', 125)

# Overview: Featuretools

To use Featuretools, we must encode the data types for each table and relationships among the tables. The tables and relationships are depicted here:

![featuretools.png](attachment:featuretools.png)

# Load

In [5]:
from quilt.data.avare import homecredit

# Validate Data Types

Featuretools, infers type using Pandas.

Sometimes, the inferred ones may not be what you expect. We could avoid this problem by including the types as an argument in the read_csv, but - we did not use read_csv() with Quilt.

To handle the mismatch, we perform a bit of housekeeping. We use this explicit mapping in later with featuretools.

In [10]:
# read metadata file , white space in the some colums ...bump...removed manually
description = pd.read_excel('data/HomeCredit_columns_description.xlsx', sheet_name='Sheet1',usecols=[2,3,4])
description.head()

Unnamed: 0,Table,Row,Type
0,application_train,SK_ID_CURR,categorical
1,application_train,TARGET,categorical
2,application_train,NAME_CONTRACT_TYPE,categorical
3,application_train,CODE_GENDER,categorical
4,application_train,FLAG_OWN_CAR,categorical


In [36]:
overlap_df = frames["application_train"].merge(frames["bureau"], on="SK_ID_CURR", how='inner')
skids = pd.unique(overlap_df['SK_ID_CURR'])
lst = skids.tolist()

In [37]:
import random 
random.seed(a=1)
asample = sample(lst, 100)
print(asample)

[195194, 144723, 280662, 183553, 451266, 418732, 434780, 369080, 248882, 166482, 445821, 120195, 376237, 406668, 101532, 415911, 288687, 262147, 172446, 325143, 121788, 115925, 118138, 106586, 370211, 253524, 398930, 120688, 257177, 410455, 451530, 265179, 345090, 263619, 255085, 425761, 305341, 115329, 394719, 170882, 231828, 310065, 185667, 335953, 455010, 398924, 234608, 315115, 301313, 454019, 378747, 124594, 440405, 272006, 386278, 393387, 222805, 360173, 365589, 161258, 411244, 176496, 216192, 378688, 362593, 447133, 121073, 432651, 130891, 318797, 378898, 220890, 219680, 456030, 260840, 108766, 241428, 264554, 386422, 343759, 350450, 425464, 290781, 104080, 372000, 191528, 245662, 401768, 139790, 441021, 358510, 241656, 392740, 443758, 352933, 393463, 345363, 101134, 334822, 424704]


In [38]:
# rename types in data description to python types
python_cat_dtype = 'object'
python_num_dtype = 'float64'
description.replace('categorical', python_cat_dtype, inplace=True)
description.replace('numerical', python_num_dtype, inplace=True)

merged = {}
frames = {}
lst = ['POS_CASH_balance','application_train','bureau','bureau_balance',
       'credit_card_balance',
       'installments_payments','previous_application']

for key in lst:
    
    print(key)
    
    df = homecredit[key]().copy(deep=True)
     
    if ( key == 'previous_application' ):
        dropcols = ['RATE_INTEREST_PRIVILEGED','RATE_INTEREST_PRIMARY']
        df.drop(dropcols, axis=1, inplace=True)
    
    # select types for the target cols
    types = description[(description.Table == key)]
    
    # select the target columns 
    targetcols = pd.DataFrame(df.columns, columns=['Row'])
    
    #print(targetcols)
    #print(types)
    
    # perform join:
    targetcols = targetcols.merge(types, how='left')

    #print(merged)
    
    # batch overide inferred categoricals 
    catcols = targetcols.loc[(targetcols.Type == python_cat_dtype),'Row'].values.tolist()
    df[catcols] = df[catcols].astype(python_cat_dtype)

    # batch overide inferred numericals
    numcols = targetcols.loc[(targetcols.Type == python_num_dtype),'Row'].values.tolist()
    df[numcols] = df[numcols].astype(python_num_dtype)

    frames[key] = df
    merged[key] = targetcols

POS_CASH_balance
application_train
bureau
bureau_balance
credit_card_balance
installments_payments
previous_application


# Merge and Subsample

ensure alignment among with primary key table contains target concept!

## EntitySet

[Entity Set](https://docs.featuretools.com/generated/featuretools.EntitySet.entity_from_dataframe.html#featuretools-entityset-entity-from-dataframe) represents a  set of database tables, as shown in the image above. 


In [39]:
# create an entity set
es = ft.EntitySet(id="homecredit_data")

## helper function :  create a dictionary feature tool types
def as_dict_featuretools(df):
    
    # df has two columns: Row(column name) Type (a python dtype)
    categorical = 'object'
    numeric = 'float64'

    # rename types
    df.replace(numeric, ft.variable_types.Numeric, inplace=True)
    df.replace(categorical, ft.variable_types.Categorical, inplace=True)

    # convert to dict
    tuples = dict([*zip(df.Row.values, df.Type.values)])
    return tuples

## (A) : Application Entity

In [None]:

#x = homecredit["application_train"]()
# create a subset of the data
#x.loc[x["SK_ID_CURR"].isin(skids)]

In [42]:
table_name = 'application_train'
index = 'SK_ID_CURR'

variable_types = as_dict_featuretools(merged[table_name])

x = frames[table_name]
asubset = x.loc[x["SK_ID_CURR"].isin(asample)]

es = es.entity_from_dataframe(dataframe=asubset,
                              entity_id=table_name,
                              index=index,
                              variable_types = variable_types)

## (B) : Bureau  Entity

In [43]:
table_name = 'bureau'
index = 'SK_ID_BUREAU'

variable_types = as_dict_featuretools(merged[table_name])
 
x = frames[table_name]
asubset = x.loc[x["SK_ID_CURR"].isin(asample)]

es = es.entity_from_dataframe(dataframe=asubset,
                              entity_id=table_name,
                              index=index,
                              variable_types=variable_types)


In [44]:
es["bureau"].variables[0:20]

[<Variable: SK_ID_BUREAU (dtype = index)>,
 <Variable: SK_ID_CURR (dtype = categorical)>,
 <Variable: CREDIT_ACTIVE (dtype = categorical)>,
 <Variable: CREDIT_CURRENCY (dtype = categorical)>,
 <Variable: DAYS_CREDIT (dtype = numeric)>,
 <Variable: CREDIT_DAY_OVERDUE (dtype = numeric)>,
 <Variable: DAYS_CREDIT_ENDDATE (dtype = numeric)>,
 <Variable: DAYS_ENDDATE_FACT (dtype = numeric)>,
 <Variable: AMT_CREDIT_MAX_OVERDUE (dtype = numeric)>,
 <Variable: CNT_CREDIT_PROLONG (dtype = numeric)>,
 <Variable: AMT_CREDIT_SUM (dtype = numeric)>,
 <Variable: AMT_CREDIT_SUM_DEBT (dtype = numeric)>,
 <Variable: AMT_CREDIT_SUM_LIMIT (dtype = numeric)>,
 <Variable: AMT_CREDIT_SUM_OVERDUE (dtype = numeric)>,
 <Variable: CREDIT_TYPE (dtype = categorical)>,
 <Variable: DAYS_CREDIT_UPDATE (dtype = numeric)>,
 <Variable: AMT_ANNUITY (dtype = numeric)>]

## (C) : Bureau Balance Entity

In [157]:
frames['bureau_balance'].head()

Unnamed: 0,INDEX,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
4044793,0,5897855.0,-53.0,X
22812152,1,5992461.0,-83.0,0
25741451,2,6112224.0,-61.0,0
24716366,3,5552491.0,-61.0,X
899375,4,6591883.0,-29.0,C


In [119]:
table_name = 'bureau_balance'
index = 'INDEX'

variable_types = as_dict_featuretools(merged[table_name])
es = es.entity_from_dataframe(dataframe=frames[table_name],
                              entity_id=table_name,
                              index=index,
                              make_index=True,
                              variable_types=variable_types)

print(len(variable_types))

4


In [175]:
table_name = 'previous_application'
index = 'SK_ID_PREV'

variable_types = as_dict_featuretools(merged[table_name])
es = es.entity_from_dataframe(dataframe=frames[table_name],
                              entity_id=table_name,
                              index=index,
                              variable_types=variable_types)

print(len(variable_types))

35


# Relations

In [45]:
## Relation A-B
new_relationship = ft.Relationship(es["application_train"]["SK_ID_CURR"],
                                    es["bureau"]["SK_ID_CURR"])
es = es.add_relationship(new_relationship)

## Relation B-C
#new_relationship = ft.Relationship(es["bureau"]["SK_ID_BUREAU"],
#es["bureau_balance"]["INDEX"])
es

Entityset: homecredit_data
  Entities:
    application_train [Rows: 100, Columns: 122]
    bureau [Rows: 609, Columns: 17]
  Relationships:
    bureau.SK_ID_CURR -> application_train.SK_ID_CURR

In [46]:
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="application_train")

In [48]:
feature_matrix
#feature_matrix.loc[feature_matrix["SK_ID_CURR"].isin(asample)]

Unnamed: 0_level_0,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,...,STD(bureau.CNT_CREDIT_PROLONG),STD(bureau.AMT_CREDIT_SUM),STD(bureau.AMT_CREDIT_SUM_DEBT),STD(bureau.AMT_CREDIT_SUM_LIMIT),STD(bureau.AMT_CREDIT_SUM_OVERDUE),STD(bureau.DAYS_CREDIT_UPDATE),STD(bureau.AMT_ANNUITY),MAX(bureau.DAYS_CREDIT),MAX(bureau.CREDIT_DAY_OVERDUE),MAX(bureau.DAYS_CREDIT_ENDDATE),MAX(bureau.DAYS_ENDDATE_FACT),MAX(bureau.AMT_CREDIT_MAX_OVERDUE),MAX(bureau.CNT_CREDIT_PROLONG),MAX(bureau.AMT_CREDIT_SUM),MAX(bureau.AMT_CREDIT_SUM_DEBT),MAX(bureau.AMT_CREDIT_SUM_LIMIT),MAX(bureau.AMT_CREDIT_SUM_OVERDUE),MAX(bureau.DAYS_CREDIT_UPDATE),MAX(bureau.AMT_ANNUITY),SKEW(bureau.DAYS_CREDIT),SKEW(bureau.CREDIT_DAY_OVERDUE),SKEW(bureau.DAYS_CREDIT_ENDDATE),SKEW(bureau.DAYS_ENDDATE_FACT),SKEW(bureau.AMT_CREDIT_MAX_OVERDUE),SKEW(bureau.CNT_CREDIT_PROLONG),SKEW(bureau.AMT_CREDIT_SUM),SKEW(bureau.AMT_CREDIT_SUM_DEBT),SKEW(bureau.AMT_CREDIT_SUM_LIMIT),SKEW(bureau.AMT_CREDIT_SUM_OVERDUE),SKEW(bureau.DAYS_CREDIT_UPDATE),SKEW(bureau.AMT_ANNUITY),MIN(bureau.DAYS_CREDIT),MIN(bureau.CREDIT_DAY_OVERDUE),MIN(bureau.DAYS_CREDIT_ENDDATE),MIN(bureau.DAYS_ENDDATE_FACT),MIN(bureau.AMT_CREDIT_MAX_OVERDUE),MIN(bureau.CNT_CREDIT_PROLONG),MIN(bureau.AMT_CREDIT_SUM),MIN(bureau.AMT_CREDIT_SUM_DEBT),MIN(bureau.AMT_CREDIT_SUM_LIMIT),MIN(bureau.AMT_CREDIT_SUM_OVERDUE),MIN(bureau.DAYS_CREDIT_UPDATE),MIN(bureau.AMT_ANNUITY),MEAN(bureau.DAYS_CREDIT),MEAN(bureau.CREDIT_DAY_OVERDUE),MEAN(bureau.DAYS_CREDIT_ENDDATE),MEAN(bureau.DAYS_ENDDATE_FACT),MEAN(bureau.AMT_CREDIT_MAX_OVERDUE),MEAN(bureau.CNT_CREDIT_PROLONG),MEAN(bureau.AMT_CREDIT_SUM),MEAN(bureau.AMT_CREDIT_SUM_DEBT),MEAN(bureau.AMT_CREDIT_SUM_LIMIT),MEAN(bureau.AMT_CREDIT_SUM_OVERDUE),MEAN(bureau.DAYS_CREDIT_UPDATE),MEAN(bureau.AMT_ANNUITY),COUNT(bureau),NUM_UNIQUE(bureau.CREDIT_ACTIVE),NUM_UNIQUE(bureau.CREDIT_CURRENCY),NUM_UNIQUE(bureau.CREDIT_TYPE),MODE(bureau.CREDIT_ACTIVE),MODE(bureau.CREDIT_CURRENCY),MODE(bureau.CREDIT_TYPE)
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1
101134,0,Cash loans,F,Y,N,0.0,135000.0,450000.0,23107.5,450000.0,Unaccompanied,State servant,Secondary / secondary special,Married,House / apartment,0.004849,-15175.0,-3046.0,-3141.0,-3160.0,12.0,1,1,0,1,1,0,Managers,2.0,2,2,FRIDAY,16.0,0,0,0,0,1,1,Hotel,0.493555,0.410219,0.513694,,,,,,,,,,,,,,,,,,,,...,0.333333,1.505046e+05,150980.878492,0.000000,0.0,484.906721,12346.058156,-345.0,0.0,30983.0,-309.0,0.000,1.0,532192.500,398205.000,0.000,0.0,-4.0,21384.000,-0.460411,0.0,2.885833,-0.609373,,3.0,0.603410,1.366175,0.000000,0.0,-1.077227,1.732051,-2440.0,0.0,-1435.0,-1435.0,0.000,0.0,29295.000,0.000,0.000,0.0,-1429.0,0.000,-1215.333333,0.0,3562.111111,-777.166667,0.000000,0.111111,2.273226e+05,98521.312500,0.000000,0.0,-453.555556,7128.000,9,2,1,2,Closed,currency 1,Consumer credit
101532,0,Cash loans,M,Y,Y,1.0,225000.0,152820.0,16587.0,135000.0,Unaccompanied,Working,Higher education,Married,House / apartment,0.030755,-14814.0,-630.0,-9081.0,-3649.0,9.0,1,1,0,1,0,0,Laborers,3.0,2,2,FRIDAY,11.0,0,0,0,0,1,1,Self-employed,,0.248579,0.656158,,,,,,,,,,,,,,,,,,,,...,0.000000,3.135553e+05,256214.239992,0.000000,0.0,525.726692,,-224.0,0.0,31086.0,-648.0,0.000,0.0,878904.000,849766.500,0.000,0.0,-9.0,,-0.614623,0.0,3.604765,-1.212303,0.000000,0.0,1.402204,3.316625,0.000000,0.0,-0.124185,,-2686.0,0.0,-2593.0,-2593.0,0.000,0.0,18432.000,0.000,0.000,0.0,-1496.0,,-1272.142857,0.0,1526.642857,-1306.272727,0.000000,0.000000,2.287385e+05,77251.500000,0.000000,0.0,-680.357143,,14,2,1,2,Closed,currency 1,Consumer credit
104080,0,Cash loans,F,Y,N,0.0,90000.0,450000.0,24412.5,450000.0,Unaccompanied,Working,Secondary / secondary special,Married,House / apartment,0.020246,-13846.0,-2497.0,-1907.0,-531.0,11.0,1,1,0,1,0,0,Drivers,2.0,3,3,THURSDAY,7.0,0,0,0,0,0,0,Construction,0.619078,0.285126,0.199771,,,,,,,,,,,,,,,,,,,,...,0.000000,9.782020e+04,45117.272944,0.000000,0.0,583.774104,4717.327700,-500.0,0.0,160.0,-317.0,17787.915,0.0,405803.565,150750.000,0.000,0.0,-26.0,8950.500,-0.262649,0.0,-0.591880,-0.800864,0.263532,0.0,1.490129,2.834686,0.000000,0.0,-0.580774,-0.271052,-2120.0,0.0,-1754.0,-1754.0,0.000,0.0,28309.500,0.000,0.000,0.0,-1638.0,0.000,-1203.833333,0.0,-643.916667,-878.500000,7607.649375,0.000000,1.534344e+05,17326.500000,0.000000,0.0,-704.750000,4972.500,12,2,1,2,Closed,currency 1,Consumer credit
106586,0,Cash loans,F,Y,Y,0.0,225000.0,644427.0,27301.5,576000.0,Unaccompanied,Working,Secondary / secondary special,Civil marriage,House / apartment,0.030755,-15044.0,-627.0,-195.0,-421.0,14.0,1,1,0,1,0,0,Sales staff,2.0,2,2,TUESDAY,15.0,0,0,0,0,0,0,Self-employed,0.616086,0.465707,0.504681,,,,,,,,,,,,,,,,,,,,...,0.000000,1.183465e+05,24469.889864,46744.605000,0.0,482.488083,,-434.0,0.0,1348.0,-294.0,0.000,0.0,328500.000,84766.185,140233.815,0.0,-27.0,,0.353019,0.0,0.739205,1.246134,0.000000,0.0,1.446286,3.464102,3.000000,0.0,-0.501331,,-1854.0,0.0,-1779.0,-1779.0,0.000,0.0,0.000,0.000,0.000,0.0,-1773.0,,-1170.500000,0.0,-540.545455,-1200.750000,0.000000,0.000000,8.216400e+04,7063.848750,15581.535000,0.0,-752.750000,,12,2,1,2,Closed,currency 1,Consumer credit
108766,0,Cash loans,F,N,Y,0.0,202500.0,1436850.0,42142.5,1125000.0,Unaccompanied,Working,Lower secondary,Single / not married,House / apartment,0.011657,-15441.0,-2453.0,-7186.0,-3941.0,,1,1,0,1,1,0,Sales staff,1.0,1,1,TUESDAY,11.0,0,0,0,0,0,0,Self-employed,,0.209170,0.562060,,,,,,,,,,,,,,,,,,,,...,0.000000,1.417467e+05,134979.073870,0.000000,0.0,474.576033,,-490.0,0.0,854.0,-603.0,0.000,0.0,427500.000,381100.500,0.000,0.0,-16.0,,-0.924841,0.0,-0.657225,-1.197314,,0.0,-0.323067,2.421833,0.000000,0.0,-0.537472,,-2687.0,0.0,-1348.0,-1413.0,0.000,0.0,45000.000,0.000,0.000,0.0,-1297.0,,-1324.500000,0.0,-74.750000,-851.400000,0.000000,0.000000,2.498921e+05,61818.187500,0.000000,0.0,-500.875000,,8,2,1,2,Closed,currency 1,Consumer credit
115329,1,Cash loans,M,Y,N,0.0,180000.0,252000.0,20038.5,252000.0,Unaccompanied,Commercial associate,Secondary / secondary special,Married,With parents,0.003069,-12056.0,-1655.0,-1044.0,-4249.0,12.0,1,1,0,1,0,0,Sales staff,2.0,3,3,WEDNESDAY,13.0,0,0,0,0,0,0,Trade: type 3,,0.512939,0.105473,,,,,,,,,,,,,,,,,,,,...,0.000000,1.003724e+05,116813.686699,,0.0,0.000000,,-15.0,0.0,593.0,,,0.0,271323.000,283680.000,,0.0,-10.0,,,,,,,,,,,,,,-138.0,0.0,532.0,,,0.0,129375.000,118480.500,,0.0,-10.0,,-76.500000,0.0,562.500000,,,0.000000,2.003490e+05,201080.250000,,0.0,-10.000000,,2,1,1,1,Active,currency 1,Consumer credit
115925,0,Revolving loans,M,N,Y,1.0,63000.0,270000.0,13500.0,270000.0,Unaccompanied,Working,Secondary / secondary special,Married,With parents,0.020246,-12243.0,-777.0,-9187.0,-1922.0,,1,1,0,1,0,0,Laborers,3.0,3,3,MONDAY,6.0,0,0,0,0,0,0,Business Entity Type 3,0.256007,0.619750,0.474051,,,,,,,,,,,,,,,,,,,,...,0.000000,3.594900e+04,41365.746699,0.000000,0.0,2.828427,,-30.0,0.0,336.0,-24.0,0.000,0.0,109339.560,58500.000,0.000,0.0,-24.0,,,,,,,,,,,,,,-754.0,0.0,-24.0,-24.0,0.000,0.0,58500.000,0.000,0.000,0.0,-28.0,,-392.000000,0.0,156.000000,-24.000000,0.000000,0.000000,8.391978e+04,29250.000000,0.000000,0.0,-26.000000,,2,2,1,1,Active,currency 1,Consumer credit
118138,0,Cash loans,F,N,Y,1.0,135000.0,765261.0,32422.5,684000.0,Unaccompanied,Commercial associate,Higher education,Married,House / apartment,0.035792,-12823.0,-100.0,-182.0,-2635.0,,1,1,0,1,1,0,Sales staff,3.0,2,2,TUESDAY,16.0,0,0,0,0,1,1,Business Entity Type 3,0.504858,0.180976,0.595456,,,,,,,,,,,,,,,,,,,,...,0.000000,1.078279e+06,691030.347919,0.000000,0.0,901.121110,,-32.0,0.0,1325.0,-187.0,8820.000,0.0,2826000.000,2095933.500,0.000,0.0,-8.0,,-0.254306,0.0,-0.158897,0.027433,2.645751,0.0,1.053620,2.838835,0.000000,0.0,-0.558881,,-2520.0,0.0,-2184.0,-2213.0,0.000,0.0,34901.550,0.000,0.000,0.0,-2213.0,,-1237.272727,0.0,-486.818182,-1204.625000,1260.000000,0.000000,8.323142e+05,281673.990000,0.000000,0.0,-896.636364,,11,2,1,3,Closed,currency 1,Consumer credit
120195,0,Cash loans,M,N,Y,1.0,225000.0,539230.5,28363.5,409500.0,Unaccompanied,Working,Higher education,Married,House / apartment,0.006852,-10525.0,-78.0,-2159.0,-2993.0,,1,1,0,1,0,0,Managers,3.0,3,3,MONDAY,8.0,0,0,0,0,0,0,Business Entity Type 3,0.168925,0.465990,0.493863,0.0351,0.0559,0.9613,0.4696,,0.00,0.1724,0.1250,0.1667,0.0146,,0.0478,,0.0000,0.0357,0.0580,0.9613,0.4904,,...,0.000000,1.038093e+06,849516.677913,3530.229754,0.0,181.081596,,-203.0,0.0,4999.0,-118.0,16308.000,0.0,2430000.000,2252376.000,9340.110,0.0,-14.0,,-1.093696,0.0,1.247336,-2.552403,0.866912,0.0,0.569876,2.644811,2.645751,0.0,-0.719915,,-2442.0,0.0,-2076.0,-2078.0,0.000,0.0,45000.000,0.000,0.000,0.0,-530.0,,-950.400000,0.0,716.555556,-509.714286,6960.375000,0.000000,9.404991e+05,326017.285714,1334.301429,0.0,-188.900000,,10,2,1,3,Closed,currency 1,Consumer credit
120688,0,Cash loans,F,N,N,2.0,315000.0,1159515.0,37534.5,1012500.0,Family,Working,Secondary / secondary special,Married,House / apartment,0.046220,-14211.0,-866.0,-4439.0,-4147.0,,1,1,0,1,0,0,Laborers,4.0,1,1,WEDNESDAY,10.0,0,1,1,0,0,0,Government,0.501227,0.746871,0.456110,0.0082,,0.9707,,,0.00,0.0690,0.0417,,0.0454,,,,,0.0084,,0.9707,,,...,0.000000,6.905936e+05,658484.900371,0.000000,0.0,973.737724,,-223.0,0.0,1603.0,-31.0,105.300,0.0,2250000.000,2082312.090,0.000,0.0,-9.0,,0.782370,0.0,1.445132,1.139669,3.162278,0.0,3.151799,3.162278,0.000000,0.0,0.748779,,-2894.0,0.0,-2682.0,-2711.0,0.000,0.0,27193.095,0.000,0.000,0.0,-2711.0,,-1780.800000,0.0,-1324.500000,-1682.000000,10.530000,0.000000,2.862819e+05,208231.209000,0.000000,0.0,-1514.600000,,10,2,1,1,Closed,currency 1,Consumer credit


In [28]:
feature_matrix.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 100145 to 456226
Data columns (total 199 columns):
NAME_CONTRACT_TYPE                     object
CODE_GENDER                            object
FLAG_OWN_CAR                           object
FLAG_OWN_REALTY                        object
CNT_CHILDREN                           float64
AMT_INCOME_TOTAL                       float64
AMT_CREDIT                             float64
AMT_ANNUITY                            float64
AMT_GOODS_PRICE                        float64
NAME_TYPE_SUITE                        object
NAME_INCOME_TYPE                       object
NAME_EDUCATION_TYPE                    object
NAME_FAMILY_STATUS                     object
NAME_HOUSING_TYPE                      object
REGION_POPULATION_RELATIVE             float64
DAYS_BIRTH                             float64
DAYS_EMPLOYED                          float64
DAYS_REGISTRATION                      float64
DAYS_ID_PUBLISH                        float64
O

Run time on all data:

CPU times: user 27min 32s, sys: 32.1 s, total: 28min 4s
Wall time: 29min 56s

Problem: must handle missing data in order for features to be computed!!

If we impute missing data for real numbers - we could introduce bias 

In [397]:
# reset
table_name = 'bureau_balance'
#frames[table_name].reset_index()
#frames[table_name].drop('INDEX', axis=1, inplace=True)
frames[table_name]

Unnamed: 0,INDEX,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
0,0,5715448,0,C
1,1,5715448,-1,C
2,2,5715448,-2,C
3,3,5715448,-3,C
4,4,5715448,-4,C
5,5,5715448,-5,C
6,6,5715448,-6,C
7,7,5715448,-7,C
8,8,5715448,-8,C
9,9,5715448,-9,0


In [393]:
# reset
table_name = 'bureau_balance'
homecredit[table_name]().reset_index()
homecredit[table_name]().drop('INDEX', axis=1, inplace=True)
homecredit[table_name]()

KeyError: "['INDEX'] not found in axis"

In [396]:
table_name = 'bureau_balance'
homecredit[table_name]()

Unnamed: 0,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
0,5715448,0,C
1,5715448,-1,C
2,5715448,-2,C
3,5715448,-3,C
4,5715448,-4,C
5,5715448,-5,C
6,5715448,-6,C
7,5715448,-7,C
8,5715448,-8,C
9,5715448,-9,0


In [None]:
data = ft.demo.load_mock_customer()
data['transactions'].head()

In [None]:
# specify entities
entities = {
    "application_train" : (frames["application_train"], "SK_ID_CURR"),
    "bureau" : (frames["bureau"], "SK_ID_BUREAU") 
}
# specify relations
relationships = [("application_train", "SK_ID_CURR", "bureau", "SK_ID_CURR")]

# feature matrix
feature_matrix_customers, features_defs = ft.dfs(entities=entities,
                                                 relationships=relationships,
                                                 target_entity="application_train")

In [24]:
import featuretools as ft

data = ft.demo.load_mock_customer()

transactions_df = data["transactions"].merge(data["sessions"]).merge(data["customers"])


transactions_df['transaction_id'].sort_values()

63       1
1        2
484      3
154      4
305      5
485      6
347      7
276      8
27       9
9       10
222     11
120     12
178     13
486     14
76      15
298     16
431     17
270     18
348     19
147     20
34      21
97      22
226     23
449     24
238     25
183     26
58      27
90      28
10      29
115     30
      ... 
362    471
462    472
429    473
263    474
62     475
404    476
335    477
424    478
372    479
450    480
395    481
454    482
437    483
285    484
108    485
5      486
230    487
299    488
421    489
148    490
420    491
116    492
488    493
261    494
13     495
172    496
470    497
244    498
46     499
472    500
Name: transaction_id, Length: 500, dtype: int64