# Automated Feature Engineering with Featuretools

_Automated feature engineering aims to help the data scientist by automatically creating many candidate features out of a dataset from which the best can be selected and used for training._ In this notebook we use [Featuretools](https://docs.featuretools.com/). 

Featuretools which is designed to generate features relational datasets. Let's use Featureools to engineer features from the Home Credit data set.

In [2]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
from quilt.data.avare import homecredit
import featuretools as ft
import pandas as pd
pd.set_option('display.max_columns', 125)

# Overview: Featuretools

To use Featuretools, we must encode the data types for each table and relationships among the tables. The tables and relationships are depicted here:

![featuretools.png](attachment:featuretools.png)

# Load and Sample Data

In [3]:
from quilt.data.avare import homecredit

In [4]:
frames = {}
size = 5000
factor = 2
rand=1


for key, val in homecredit._items():
    
    frames[key] = val().copy(deep=True)
    
    if ( key == 'previous_application' ):
        dropcols = ['RATE_INTEREST_PRIVILEGED','RATE_INTEREST_PRIMARY']
        frames[key].drop(dropcols, axis=1, inplace=True)
    
    frames[key].dropna(inplace=True)
    
    #print(len(frames[key]))

    if (key == 'application_train' ):
        frames[key] = frames[key].sample(n=size, random_state=rand)
    else: 
        frames[key] = frames[key].sample(n=size*factor, random_state=rand)

    #print(len(frames[key]))
                           
popped = frames['application_train'].pop('TARGET')    

10000
5000
10000
10000
10000
10000
10000


# Overide Panda Inferred Data Types

Although types are inferred by Panda, we need to validate them, since the inferred ones may not be what you expect. We could avoid this problem by including the types as an argument in the read_csv, but - we did not use read_csv() with Quilt.

It turns out that some of the columns in the [Metadata file](data/HomeCredit_columns_description.txt) do not match, one-to-one, with the columns in the data files.

To handle the mismatch, we perform a bit of housekeeping.

In [5]:
# read metadata file , white space in the some colums ...bump...removed manually
description = pd.read_excel('data/HomeCredit_columns_description.xlsx', sheet_name='Sheet1',usecols=[2,3,4])

# rename types in data description to python types
python_cat_dtype = 'object'
python_num_dtype = 'float64'
description.replace('categorical', python_cat_dtype, inplace=True)
description.replace('numerical', python_num_dtype, inplace=True)

merged = {}
for table in frames.keys():
    
    df = frames[table]
    
    # select types for the target cols
    types = description[(description.Table == table)]
    # select the target columns 
    targetcols = pd.DataFrame(df.columns, columns=['Row'])
   
    # perform join:
    targetcols = targetcols.merge(types, how='left')

    merged[table] = targetcols
    
    # batch overide inferred categoricals 
    catcols = targetcols.loc[(targetcols.Type == python_cat_dtype),'Row'].values.tolist()
    df[catcols] = df[catcols].astype(python_cat_dtype)

    # batch overide inferred numericals
    numcols = targetcols.loc[(targetcols.Type == python_num_dtype),'Row'].values.tolist()
    df[numcols] = df[numcols].astype(python_num_dtype)

    

In [217]:
#frames['application_train'].SK_ID_CURR = frames['application_train'].SK_ID_CURR.astype('category')

#frames['bureau'].SK_ID_CURR = frames['bureau'].SK_ID_CURR.astype('category')
#frames['bureau'].SK_ID_BUREAU = frames['bureau'].SK_ID_BUREAU.astype('category')

#frames['bureau_balance'].SK_ID_CURR = frames['bureau_balance'].SK_ID_BUREAU.astype('category')

#frames['previous_application'].SK_ID_CURR = frames['previous_application'].SK_ID_CURR.astype('category')
#frames['previous_application'].SK_ID_PREV = frames['previous_application'].SK_ID_PREV.astype('category')

  


# Entity Set and Entities

## Entity Set

[Entity Set](https://docs.featuretools.com/generated/featuretools.EntitySet.entity_from_dataframe.html#featuretools-entityset-entity-from-dataframe) represents a  set of database tables, as shown in the image above. 


In [223]:
# create an entity set
es = ft.EntitySet(id="application_entity_set")

## helper function to create a dictionary of column: data type mappings
# df has two columns: Row(column name) Type (a python dtype)
def as_dict_featuretools(df):
 
    categorical = 'object'
    numeric = 'float64'

    # rename types
    df.replace(numeric, ft.variable_types.Numeric, inplace=True)
    df.replace(categorical, ft.variable_types.Categorical, inplace=True)

    # convert to dict
    tuples = dict([*zip(df.Row.values, df.Type.values)])
    return tuples

## (A) : Application Entity

In [224]:
table_name = 'application_train'
index = 'SK_ID_CURR'

variable_types = as_dict_featuretools(merged[table_name])

es = es.entity_from_dataframe(dataframe=frames[table_name],
                              entity_id=table_name,
                              index=index,
                              variable_types = variable_types)

## (B) : Bureau  Entity

In [225]:
table_name = 'bureau'
index = 'SK_ID_BUREAU'
variable_types = as_dict_featuretools(merged[table_name])
es = es.entity_from_dataframe(dataframe=frames[table_name],
                              entity_id=table_name,
                              index=index,
                              variable_types=variable_types)


In [227]:
frames['bureau'].head()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
273570,190381,5058520,Closed,currency 1,-1898.0,0.0,-1716.0,-1716.0,0.0,0.0,36162.27,0.0,0.0,0.0,Consumer credit,-753.0,14400.0
130127,219289,6570959,Closed,currency 1,-1006.0,0.0,-945.0,-945.0,0.0,0.0,565326.63,0.0,0.0,0.0,Consumer credit,-945.0,0.0
784230,318510,6088973,Closed,currency 1,-397.0,0.0,-32.0,-32.0,0.0,0.0,187092.0,0.0,0.0,0.0,Consumer credit,-32.0,45629.64
1631089,298111,6693169,Closed,currency 1,-1063.0,0.0,-940.0,-1002.0,0.0,0.0,33615.0,0.0,0.0,0.0,Consumer credit,-1001.0,0.0
1181103,161713,5522443,Closed,currency 1,-1477.0,0.0,-1265.0,-1293.0,0.0,0.0,79228.8,0.0,0.0,0.0,Consumer credit,-1293.0,23382.0


## (C) : Bureau Balance Entity

In [157]:
frames['bureau_balance'].head()

Unnamed: 0,INDEX,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
4044793,0,5897855.0,-53.0,X
22812152,1,5992461.0,-83.0,0
25741451,2,6112224.0,-61.0,0
24716366,3,5552491.0,-61.0,X
899375,4,6591883.0,-29.0,C


In [174]:
table_name = 'bureau_balance'
index = 'INDEX'

variable_types = as_dict_featuretools(merged[table_name])
es = es.entity_from_dataframe(dataframe=frames[table_name],
                              entity_id=table_name,
                              index=index,
                              make_index=True,
                              variable_types=variable_types)

print(len(variable_types))

4


In [175]:
table_name = 'previous_application'
index = 'SK_ID_PREV'

variable_types = as_dict_featuretools(merged[table_name])
es = es.entity_from_dataframe(dataframe=frames[table_name],
                              entity_id=table_name,
                              index=index,
                              variable_types=variable_types)

print(len(variable_types))

35


# Parent-Child Relations

In [228]:
## Relation A-B
new_relationship = ft.Relationship(es["application_train"]["SK_ID_CURR"],
                                    es["bureau"]["SK_ID_CURR"])
es = es.add_relationship(new_relationship)

## Relation B-C
#new_relationship = ft.Relationship(es["bureau"]["SK_ID_BUREAU"],
#es["bureau_balance"]["INDEX"])




In [229]:
%time feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="application_train")

CPU times: user 8.86 s, sys: 55.8 ms, total: 8.92 s
Wall time: 8.92 s


In [230]:
feature_matrix

Unnamed: 0_level_0,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,...,STD(bureau.CNT_CREDIT_PROLONG),STD(bureau.AMT_CREDIT_SUM),STD(bureau.AMT_CREDIT_SUM_DEBT),STD(bureau.AMT_CREDIT_SUM_LIMIT),STD(bureau.AMT_CREDIT_SUM_OVERDUE),STD(bureau.DAYS_CREDIT_UPDATE),STD(bureau.AMT_ANNUITY),MAX(bureau.DAYS_CREDIT),MAX(bureau.CREDIT_DAY_OVERDUE),MAX(bureau.DAYS_CREDIT_ENDDATE),MAX(bureau.DAYS_ENDDATE_FACT),MAX(bureau.AMT_CREDIT_MAX_OVERDUE),MAX(bureau.CNT_CREDIT_PROLONG),MAX(bureau.AMT_CREDIT_SUM),MAX(bureau.AMT_CREDIT_SUM_DEBT),MAX(bureau.AMT_CREDIT_SUM_LIMIT),MAX(bureau.AMT_CREDIT_SUM_OVERDUE),MAX(bureau.DAYS_CREDIT_UPDATE),MAX(bureau.AMT_ANNUITY),SKEW(bureau.DAYS_CREDIT),SKEW(bureau.CREDIT_DAY_OVERDUE),SKEW(bureau.DAYS_CREDIT_ENDDATE),SKEW(bureau.DAYS_ENDDATE_FACT),SKEW(bureau.AMT_CREDIT_MAX_OVERDUE),SKEW(bureau.CNT_CREDIT_PROLONG),SKEW(bureau.AMT_CREDIT_SUM),SKEW(bureau.AMT_CREDIT_SUM_DEBT),SKEW(bureau.AMT_CREDIT_SUM_LIMIT),SKEW(bureau.AMT_CREDIT_SUM_OVERDUE),SKEW(bureau.DAYS_CREDIT_UPDATE),SKEW(bureau.AMT_ANNUITY),MIN(bureau.DAYS_CREDIT),MIN(bureau.CREDIT_DAY_OVERDUE),MIN(bureau.DAYS_CREDIT_ENDDATE),MIN(bureau.DAYS_ENDDATE_FACT),MIN(bureau.AMT_CREDIT_MAX_OVERDUE),MIN(bureau.CNT_CREDIT_PROLONG),MIN(bureau.AMT_CREDIT_SUM),MIN(bureau.AMT_CREDIT_SUM_DEBT),MIN(bureau.AMT_CREDIT_SUM_LIMIT),MIN(bureau.AMT_CREDIT_SUM_OVERDUE),MIN(bureau.DAYS_CREDIT_UPDATE),MIN(bureau.AMT_ANNUITY),MEAN(bureau.DAYS_CREDIT),MEAN(bureau.CREDIT_DAY_OVERDUE),MEAN(bureau.DAYS_CREDIT_ENDDATE),MEAN(bureau.DAYS_ENDDATE_FACT),MEAN(bureau.AMT_CREDIT_MAX_OVERDUE),MEAN(bureau.CNT_CREDIT_PROLONG),MEAN(bureau.AMT_CREDIT_SUM),MEAN(bureau.AMT_CREDIT_SUM_DEBT),MEAN(bureau.AMT_CREDIT_SUM_LIMIT),MEAN(bureau.AMT_CREDIT_SUM_OVERDUE),MEAN(bureau.DAYS_CREDIT_UPDATE),MEAN(bureau.AMT_ANNUITY),COUNT(bureau),NUM_UNIQUE(bureau.CREDIT_ACTIVE),NUM_UNIQUE(bureau.CREDIT_CURRENCY),NUM_UNIQUE(bureau.CREDIT_TYPE),MODE(bureau.CREDIT_ACTIVE),MODE(bureau.CREDIT_CURRENCY),MODE(bureau.CREDIT_TYPE)
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1
100145,Cash loans,F,Y,Y,1.0,202500.0,260725.5,16789.5,198000.0,Family,Working,Secondary / secondary special,Separated,House / apartment,0.018850,-16282.0,-4375.0,-762.0,-1494.0,8.0,1,1,0,1,0,0,Laborers,2.0,2,2,TUESDAY,11.0,0,0,0,0,0,0,Self-employed,0.647045,0.746486,0.739412,0.0928,0.1000,0.9801,0.7280,0.0463,0.0000,0.2069,0.1667,0.2083,0.0437,0.0756,0.0903,0.0000,0.0000,0.0945,0.1038,0.9801,0.7387,0.0467,0.0000,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,,,,,
100190,Cash loans,M,Y,N,0.0,162000.0,263686.5,24781.5,238500.0,Unaccompanied,Commercial associate,Higher education,Married,House / apartment,0.022625,-13972.0,-4472.0,-464.0,-4529.0,3.0,1,1,0,1,1,0,Laborers,2.0,2,2,THURSDAY,16.0,0,0,0,0,0,0,Government,0.534999,0.585859,0.788681,0.3093,0.1973,0.9891,0.8504,0.0000,0.4000,0.2414,0.4583,0.5000,0.4101,0.2522,0.3564,0.0000,0.0168,0.3151,0.2047,0.9891,0.8563,0.0000,0.4028,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,,,,,
100295,Cash loans,M,Y,N,1.0,225000.0,1019205.0,31032.0,774000.0,Unaccompanied,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,0.072508,-11356.0,-602.0,-335.0,-3224.0,9.0,1,1,0,1,0,0,Laborers,2.0,1,1,MONDAY,14.0,1,0,1,0,0,0,Business Entity Type 3,0.262005,0.302394,0.463275,0.2402,0.1098,0.9916,0.8844,0.4682,0.4000,0.1724,0.5417,0.5000,0.0223,0.1942,0.2270,0.0077,0.0075,0.2447,0.1139,0.9916,0.8889,0.4724,0.4028,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,,,,,
100341,Cash loans,M,Y,Y,0.0,76500.0,545040.0,20677.5,450000.0,Unaccompanied,Working,Secondary / secondary special,Widow,House / apartment,0.031329,-20348.0,-7115.0,-1799.0,-2780.0,28.0,1,1,0,1,0,0,Laborers,1.0,2,2,TUESDAY,10.0,0,0,0,0,0,0,Industry: type 2,0.660390,0.647373,0.315472,0.0485,0.0328,0.9816,0.7484,0.0216,0.0000,0.0345,0.1667,0.2083,0.0120,0.0395,0.0218,0.0000,0.0000,0.0494,0.0340,0.9816,0.7583,0.0218,0.0000,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,,,,,
100343,Cash loans,M,Y,Y,0.0,315000.0,90000.0,4504.5,90000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,With parents,0.022800,-10935.0,-422.0,-5251.0,-3364.0,2.0,1,1,0,1,1,0,Drivers,1.0,2,2,SATURDAY,11.0,0,0,0,0,0,0,Business Entity Type 3,0.259823,0.581980,0.537070,0.2959,0.1433,0.9871,0.8232,0.0706,0.3200,0.2759,0.3333,0.3750,0.1440,0.2404,0.3213,0.0039,0.0086,0.3015,0.1487,0.9871,0.8301,0.0712,0.3222,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,,,,,
100363,Cash loans,F,Y,Y,1.0,360000.0,493497.0,48942.0,454500.0,Unaccompanied,Commercial associate,Higher education,Married,House / apartment,0.006629,-14882.0,-436.0,-1140.0,-4606.0,6.0,1,1,0,1,0,1,Core staff,3.0,2,2,MONDAY,11.0,0,0,0,0,0,0,Bank,0.735443,0.462205,0.540654,0.0784,0.0000,0.9950,0.9320,0.0194,0.0800,0.0690,0.2500,0.0417,0.0979,0.0630,0.1019,0.0618,0.0544,0.0798,0.0000,0.9950,0.9347,0.0195,0.0806,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,,,,,
100376,Cash loans,M,Y,Y,0.0,360000.0,254700.0,20250.0,225000.0,Unaccompanied,Working,Secondary / secondary special,Married,House / apartment,0.018801,-18831.0,-1342.0,-8691.0,-2067.0,17.0,1,1,0,1,0,0,Drivers,2.0,2,2,MONDAY,10.0,0,0,0,0,0,0,Transport: type 3,0.791412,0.445987,0.461482,0.4402,0.2371,0.9831,0.7688,0.1488,0.3200,0.2759,0.3333,0.3750,0.1312,0.3589,0.3408,0.0000,0.0000,0.4485,0.2460,0.9831,0.7779,0.1502,0.3222,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,,,,,
100395,Cash loans,M,Y,Y,0.0,225000.0,888840.0,32053.5,675000.0,Unaccompanied,Commercial associate,Secondary / secondary special,Married,Municipal apartment,0.072508,-16719.0,-687.0,-8892.0,-240.0,12.0,1,1,0,1,0,0,Drivers,2.0,1,1,FRIDAY,17.0,0,0,0,0,0,0,Business Entity Type 2,0.634742,0.775615,0.740799,0.0907,0.0701,0.9732,0.6328,0.0000,0.0000,0.1607,0.1667,0.0417,0.0000,0.0740,0.0673,0.0000,0.0180,0.0756,0.0457,0.9737,0.6537,0.0000,0.0000,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,,,,,
100425,Cash loans,F,Y,Y,2.0,90000.0,688500.0,20259.0,688500.0,Family,Working,Higher education,Married,House / apartment,0.025164,-12982.0,-2850.0,-3476.0,-5139.0,64.0,1,1,0,1,0,0,Core staff,4.0,2,2,SATURDAY,8.0,0,0,0,0,0,0,Kindergarten,0.340206,0.335586,0.710674,0.0082,0.0000,0.9608,0.4628,0.0008,0.0000,0.0345,0.0833,0.1250,0.0086,0.0067,0.0060,0.0000,0.0000,0.0084,0.0000,0.9608,0.4838,0.0008,0.0000,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,,,,,
100479,Revolving loans,M,Y,Y,0.0,90000.0,180000.0,9000.0,180000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.018801,-9496.0,-189.0,-4182.0,-2153.0,64.0,1,1,0,1,0,0,Laborers,1.0,2,2,SUNDAY,7.0,0,0,0,0,1,1,Business Entity Type 3,0.084902,0.573584,0.452534,0.0495,0.0525,0.9742,0.6464,0.0072,0.0000,0.1034,0.1250,0.0417,0.0130,0.0403,0.0397,0.0000,0.0000,0.0504,0.0545,0.9742,0.6602,0.0072,0.0000,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,,,,,


Run time on all data:

CPU times: user 27min 32s, sys: 32.1 s, total: 28min 4s
Wall time: 29min 56s

Problem: must handle missing data in order for features to be computed!!

If we impute missing data for real numbers - we could introduce bias 

In [231]:
feature_defs

[<Feature: NAME_CONTRACT_TYPE>,
 <Feature: CODE_GENDER>,
 <Feature: FLAG_OWN_CAR>,
 <Feature: FLAG_OWN_REALTY>,
 <Feature: CNT_CHILDREN>,
 <Feature: AMT_INCOME_TOTAL>,
 <Feature: AMT_CREDIT>,
 <Feature: AMT_ANNUITY>,
 <Feature: AMT_GOODS_PRICE>,
 <Feature: NAME_TYPE_SUITE>,
 <Feature: NAME_INCOME_TYPE>,
 <Feature: NAME_EDUCATION_TYPE>,
 <Feature: NAME_FAMILY_STATUS>,
 <Feature: NAME_HOUSING_TYPE>,
 <Feature: REGION_POPULATION_RELATIVE>,
 <Feature: DAYS_BIRTH>,
 <Feature: DAYS_EMPLOYED>,
 <Feature: DAYS_REGISTRATION>,
 <Feature: DAYS_ID_PUBLISH>,
 <Feature: OWN_CAR_AGE>,
 <Feature: FLAG_MOBIL>,
 <Feature: FLAG_EMP_PHONE>,
 <Feature: FLAG_WORK_PHONE>,
 <Feature: FLAG_CONT_MOBILE>,
 <Feature: FLAG_PHONE>,
 <Feature: FLAG_EMAIL>,
 <Feature: OCCUPATION_TYPE>,
 <Feature: CNT_FAM_MEMBERS>,
 <Feature: REGION_RATING_CLIENT>,
 <Feature: REGION_RATING_CLIENT_W_CITY>,
 <Feature: WEEKDAY_APPR_PROCESS_START>,
 <Feature: HOUR_APPR_PROCESS_START>,
 <Feature: REG_REGION_NOT_LIVE_REGION>,
 <Feature: REG

In [None]:
es.plot

In [308]:
homecredit.bureau_balance().head()

Unnamed: 0,SK_ID_BUREAU,MONTHS_BALANCE,STATUS,INDEX
0,5715448,0,C,0
1,5715448,-1,C,1
2,5715448,-2,C,2
3,5715448,-3,C,3
4,5715448,-4,C,4


Unnamed: 0,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
0,5715448,0,C
1,5715448,-1,C
2,5715448,-2,C
3,5715448,-3,C
4,5715448,-4,C
5,5715448,-5,C
6,5715448,-6,C
7,5715448,-7,C
8,5715448,-8,C
9,5715448,-9,0


In [397]:
# reset
table_name = 'bureau_balance'
#frames[table_name].reset_index()
#frames[table_name].drop('INDEX', axis=1, inplace=True)
frames[table_name]

Unnamed: 0,INDEX,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
0,0,5715448,0,C
1,1,5715448,-1,C
2,2,5715448,-2,C
3,3,5715448,-3,C
4,4,5715448,-4,C
5,5,5715448,-5,C
6,6,5715448,-6,C
7,7,5715448,-7,C
8,8,5715448,-8,C
9,9,5715448,-9,0


In [393]:
# reset
table_name = 'bureau_balance'
homecredit[table_name]().reset_index()
homecredit[table_name]().drop('INDEX', axis=1, inplace=True)
homecredit[table_name]()

KeyError: "['INDEX'] not found in axis"

In [396]:
table_name = 'bureau_balance'
homecredit[table_name]()

Unnamed: 0,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
0,5715448,0,C
1,5715448,-1,C
2,5715448,-2,C
3,5715448,-3,C
4,5715448,-4,C
5,5715448,-5,C
6,5715448,-6,C
7,5715448,-7,C
8,5715448,-8,C
9,5715448,-9,0


In [None]:
data = ft.demo.load_mock_customer()
data['transactions'].head()