# Using Impyute fastKNN to Imputate

KNN is an algorithm that is useful for matching a point with its closest k neighbors in a multi-dimensional space. It  useful for dealing with different types of missing data.

If you have a relatively small amout of data, with low dimensions, then KNN imputing is worth trying.

The Impyute package provides an implemention called [fastKNN](https://impyute.readthedocs.io/en/master/_modules/impyute/imputation/cs/fast_knn.html#fast_knn) - which is intended to be faster than fit+transform for each subset

Try it out with the Home Credit data set and compare this result to [DataWig](https://datawig.readthedocs.io/en/latest/source/userguide.html#introduction-to-imputer) or [MICE](https://www.statsmodels.org/stable/imputation.html)

In [8]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
from quilt.data.avare import homecredit

import pandas as pd
pd.set_option('display.max_columns', 125)
import random 
from impyute import fast_knn

# Load Data 

In [2]:
from quilt.data.avare import homecredit

# Create a Sample for Analysis
 
Randomly, select a set of ids from the application train table to use.   The sample is selected so that the ids present in the parent and all its child.  For the purpose of practicing,  it will avoid getting null results due to joins when  a primary key in a parent table does not exist as a foreign key in the child table.

In [3]:
# random subject of skidcurr ids HERE!!!
# TODO: create a list so this can be performed dynamically

sample_table_parent = 'application_train'
sample_table_child = 'bureau'
index_parent = 'SK_ID_CURR'

overlap_skidcurr = homecredit[sample_table_parent]().merge(homecredit[sample_table_child](), 
                                                           on=index_parent, how='inner')
print('Num. overlapping sk_id_curr: {}'.format(len((overlap_skidcurr))))
skids = pd.unique(overlap_skidcurr[index_parent])
random.seed(a=1)

sample_ids = random.sample(skids.tolist(), 100)
sample_df = pd.DataFrame({index_parent: sample_ids})
sample_df.head()

Num. overlapping sk_id_curr: 1465325


Unnamed: 0,SK_ID_CURR
0,195194
1,144723
2,280662
3,183553
4,451266


# Validate Data Types

The data types inferred by Pandas may not be what you expect. 

* The file `new_data_description_file` is used to override the data types inferred by Pandas
* The file contains manualy assigned Python data types for each column, in all the tables in the homecredit data set

## Manually Assign Data Types

In [4]:
description = pd.read_csv('data/new_data_description_file.csv')
description.head()

Unnamed: 0,Row,Table,Type
0,SK_ID_PREV,POS_CASH_balance,object
1,SK_ID_CURR,POS_CASH_balance,object
2,MONTHS_BALANCE,POS_CASH_balance,float64
3,CNT_INSTALMENT,POS_CASH_balance,float64
4,CNT_INSTALMENT_FUTURE,POS_CASH_balance,float64


## Override Inferred Data Types

In [5]:
python_cat_dtype = 'object'
python_num_dtype = 'float64'
overide_dtypes = {}

for table, node in homecredit._items():
    
    print(table)
    
    df = node()
    
    condtable = description.Table == table
    condcat = description.Type == python_cat_dtype
    condnum = description.Type == python_num_dtype
        
    catcols = description.loc[(condtable & condcat),'Row'].values.tolist()
    numcols = description.loc[(condtable & condnum),'Row'].values.tolist()
    
    df[catcols] = df[catcols].astype(python_cat_dtype) 
    df[numcols] = df[numcols].astype(python_num_dtype)
    
    ### append numeric to the categorical values in 
    overide_dtypes[table] = description.loc[(condtable & condcat),['Row','Type']].append(description.loc[(condtable & condnum),['Row','Type']], ignore_index=True)

POS_CASH_balance
application_train
bureau
bureau_balance
credit_card_balance
installments_payments
previous_application


In [6]:
homecredit.application_train().dtypes[0:6]

SK_ID_CURR            object
TARGET                object
NAME_CONTRACT_TYPE    object
CODE_GENDER           object
FLAG_OWN_CAR          object
FLAG_OWN_REALTY       object
dtype: object

# EntitySet

[Entity Set](https://docs.featuretools.com/generated/featuretools.EntitySet.entity_from_dataframe.html#featuretools-entityset-entity-from-dataframe) represents a  set of database tables, as shown in the image above. 


In [7]:
# create an entity set
es = ft.EntitySet(id="homecredit_data")

## (A) : Application Entity
Are the  types in the bureau entity overriden correctly? If we did not override the typesm feature tools would infer the types for us. 

If the types are inferred incorreclty, then the type of operation applied to the column would be incorrect.

In [43]:
table_name = 'application_train'
index = 'SK_ID_CURR'

variable_types = as_dict_featuretools(overide_dtypes[table_name])
df = homecredit[table_name]().merge(sample_df, on=index)

es = es.entity_from_dataframe(dataframe=df,
                              entity_id=table_name,
                              index=index,
                              variable_types = variable_types)

es[table_name].variables[0:20]

[<Variable: SK_ID_CURR (dtype = index)>,
 <Variable: TARGET (dtype = categorical)>,
 <Variable: NAME_CONTRACT_TYPE (dtype = categorical)>,
 <Variable: CODE_GENDER (dtype = categorical)>,
 <Variable: FLAG_OWN_CAR (dtype = categorical)>,
 <Variable: FLAG_OWN_REALTY (dtype = categorical)>,
 <Variable: NAME_TYPE_SUITE (dtype = categorical)>,
 <Variable: NAME_INCOME_TYPE (dtype = categorical)>,
 <Variable: NAME_EDUCATION_TYPE (dtype = categorical)>,
 <Variable: NAME_FAMILY_STATUS (dtype = categorical)>,
 <Variable: NAME_HOUSING_TYPE (dtype = categorical)>,
 <Variable: FLAG_MOBIL (dtype = categorical)>,
 <Variable: FLAG_EMP_PHONE (dtype = categorical)>,
 <Variable: FLAG_WORK_PHONE (dtype = categorical)>,
 <Variable: FLAG_CONT_MOBILE (dtype = categorical)>,
 <Variable: FLAG_PHONE (dtype = categorical)>,
 <Variable: FLAG_EMAIL (dtype = categorical)>,
 <Variable: OCCUPATION_TYPE (dtype = categorical)>,
 <Variable: REGION_RATING_CLIENT (dtype = categorical)>,
 <Variable: REGION_RATING_CLIENT_W_C

## (B) : Bureau  Entity

In [47]:
## PROBLEM BUREAU HAS NO ROWS!!! # sample: ids in child match primary key in the table contains target (story telling ***)
table_name = 'bureau'
index = 'SK_ID_BUREAU'
index_parent = 'SK_ID_CURR'

variable_types = as_dict_featuretools(overide_dtypes[table_name])
df = homecredit[table_name]().merge(sample_df, on=index_parent)

print(df.shape)

es = es.entity_from_dataframe(dataframe=df,
                              entity_id=table_name,
                              index=index,
                              variable_types = variable_types)
es[table_name].variables[0:20]

(609, 17)
  SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY  DAYS_CREDIT  \
0     398924      5725624        Closed      currency 1      -1150.0   
1     398924      5725625        Closed      currency 1       -790.0   
2     398924      5725626        Active      currency 1       -190.0   
3     343759      5202643        Closed      currency 1      -1868.0   
4     343759      5202644        Active      currency 1       -541.0   

   CREDIT_DAY_OVERDUE  DAYS_CREDIT_ENDDATE  DAYS_ENDDATE_FACT  \
0                 0.0               -790.0             -771.0   
1                 0.0               -190.0             -190.0   
2                 0.0                918.0                NaN   
3                 0.0              -1564.0            -1591.0   
4                 0.0              12917.0                NaN   

   AMT_CREDIT_MAX_OVERDUE  CNT_CREDIT_PROLONG  AMT_CREDIT_SUM  \
0                     NaN                 0.0        234000.0   
1                     0.0           

[<Variable: SK_ID_BUREAU (dtype = index)>,
 <Variable: SK_ID_CURR (dtype = categorical)>,
 <Variable: CREDIT_ACTIVE (dtype = categorical)>,
 <Variable: CREDIT_CURRENCY (dtype = categorical)>,
 <Variable: CREDIT_TYPE (dtype = categorical)>,
 <Variable: DAYS_CREDIT (dtype = numeric)>,
 <Variable: CREDIT_DAY_OVERDUE (dtype = numeric)>,
 <Variable: DAYS_CREDIT_ENDDATE (dtype = numeric)>,
 <Variable: DAYS_ENDDATE_FACT (dtype = numeric)>,
 <Variable: AMT_CREDIT_MAX_OVERDUE (dtype = numeric)>,
 <Variable: CNT_CREDIT_PROLONG (dtype = numeric)>,
 <Variable: AMT_CREDIT_SUM (dtype = numeric)>,
 <Variable: AMT_CREDIT_SUM_DEBT (dtype = numeric)>,
 <Variable: AMT_CREDIT_SUM_LIMIT (dtype = numeric)>,
 <Variable: AMT_CREDIT_SUM_OVERDUE (dtype = numeric)>,
 <Variable: DAYS_CREDIT_UPDATE (dtype = numeric)>,
 <Variable: AMT_ANNUITY (dtype = numeric)>]

# Relations

In [48]:
## Relation A-B
new_relationship = ft.Relationship(es["application_train"]["SK_ID_CURR"],
                                    es["bureau"]["SK_ID_CURR"])
es = es.add_relationship(new_relationship)
es



Entityset: homecredit_data
  Entities:
    application_train [Rows: 100, Columns: 122]
    bureau [Rows: 609, Columns: 17]
  Relationships:
    bureau.SK_ID_CURR -> application_train.SK_ID_CURR

# Generate Features

In [52]:
table = "application_train"
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity=table)

In [53]:
feature_matrix.head()

Unnamed: 0_level_0,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,FONDKAPREMONT_MODE,HOUSETYPE_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,...,STD(bureau.CNT_CREDIT_PROLONG),STD(bureau.AMT_CREDIT_SUM),STD(bureau.AMT_CREDIT_SUM_DEBT),STD(bureau.AMT_CREDIT_SUM_LIMIT),STD(bureau.AMT_CREDIT_SUM_OVERDUE),STD(bureau.DAYS_CREDIT_UPDATE),STD(bureau.AMT_ANNUITY),MAX(bureau.DAYS_CREDIT),MAX(bureau.CREDIT_DAY_OVERDUE),MAX(bureau.DAYS_CREDIT_ENDDATE),MAX(bureau.DAYS_ENDDATE_FACT),MAX(bureau.AMT_CREDIT_MAX_OVERDUE),MAX(bureau.CNT_CREDIT_PROLONG),MAX(bureau.AMT_CREDIT_SUM),MAX(bureau.AMT_CREDIT_SUM_DEBT),MAX(bureau.AMT_CREDIT_SUM_LIMIT),MAX(bureau.AMT_CREDIT_SUM_OVERDUE),MAX(bureau.DAYS_CREDIT_UPDATE),MAX(bureau.AMT_ANNUITY),SKEW(bureau.DAYS_CREDIT),SKEW(bureau.CREDIT_DAY_OVERDUE),SKEW(bureau.DAYS_CREDIT_ENDDATE),SKEW(bureau.DAYS_ENDDATE_FACT),SKEW(bureau.AMT_CREDIT_MAX_OVERDUE),SKEW(bureau.CNT_CREDIT_PROLONG),SKEW(bureau.AMT_CREDIT_SUM),SKEW(bureau.AMT_CREDIT_SUM_DEBT),SKEW(bureau.AMT_CREDIT_SUM_LIMIT),SKEW(bureau.AMT_CREDIT_SUM_OVERDUE),SKEW(bureau.DAYS_CREDIT_UPDATE),SKEW(bureau.AMT_ANNUITY),MIN(bureau.DAYS_CREDIT),MIN(bureau.CREDIT_DAY_OVERDUE),MIN(bureau.DAYS_CREDIT_ENDDATE),MIN(bureau.DAYS_ENDDATE_FACT),MIN(bureau.AMT_CREDIT_MAX_OVERDUE),MIN(bureau.CNT_CREDIT_PROLONG),MIN(bureau.AMT_CREDIT_SUM),MIN(bureau.AMT_CREDIT_SUM_DEBT),MIN(bureau.AMT_CREDIT_SUM_LIMIT),MIN(bureau.AMT_CREDIT_SUM_OVERDUE),MIN(bureau.DAYS_CREDIT_UPDATE),MIN(bureau.AMT_ANNUITY),MEAN(bureau.DAYS_CREDIT),MEAN(bureau.CREDIT_DAY_OVERDUE),MEAN(bureau.DAYS_CREDIT_ENDDATE),MEAN(bureau.DAYS_ENDDATE_FACT),MEAN(bureau.AMT_CREDIT_MAX_OVERDUE),MEAN(bureau.CNT_CREDIT_PROLONG),MEAN(bureau.AMT_CREDIT_SUM),MEAN(bureau.AMT_CREDIT_SUM_DEBT),MEAN(bureau.AMT_CREDIT_SUM_LIMIT),MEAN(bureau.AMT_CREDIT_SUM_OVERDUE),MEAN(bureau.DAYS_CREDIT_UPDATE),MEAN(bureau.AMT_ANNUITY),COUNT(bureau),NUM_UNIQUE(bureau.CREDIT_ACTIVE),NUM_UNIQUE(bureau.CREDIT_CURRENCY),NUM_UNIQUE(bureau.CREDIT_TYPE),MODE(bureau.CREDIT_ACTIVE),MODE(bureau.CREDIT_CURRENCY),MODE(bureau.CREDIT_TYPE)
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1
101134,0,Cash loans,F,Y,N,Unaccompanied,State servant,Secondary / secondary special,Married,House / apartment,1,1,0,1,1,0,Managers,2,2,FRIDAY,0,0,0,0,1,1,Hotel,,,,,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,135000.0,450000.0,23107.5,450000.0,0.004849,-15175.0,-3046.0,-3141.0,-3160.0,12.0,...,0.333333,150504.59325,150980.878492,0.0,0.0,484.906721,12346.058156,-345.0,0.0,30983.0,-309.0,0.0,1.0,532192.5,398205.0,0.0,0.0,-4.0,21384.0,-0.460411,0.0,2.885833,-0.609373,,3.0,0.60341,1.366175,0.0,0.0,-1.077227,1.732051,-2440.0,0.0,-1435.0,-1435.0,0.0,0.0,29295.0,0.0,0.0,0.0,-1429.0,0.0,-1215.333333,0.0,3562.111111,-777.166667,0.0,0.111111,227322.6,98521.3125,0.0,0.0,-453.555556,7128.0,9,2,1,2,Closed,currency 1,Consumer credit
101532,0,Cash loans,M,Y,Y,Unaccompanied,Working,Higher education,Married,House / apartment,1,1,0,1,0,0,Laborers,2,2,FRIDAY,0,0,0,0,1,1,Self-employed,,,,,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.0,225000.0,152820.0,16587.0,135000.0,0.030755,-14814.0,-630.0,-9081.0,-3649.0,9.0,...,0.0,313555.304578,256214.239992,0.0,0.0,525.726692,,-224.0,0.0,31086.0,-648.0,0.0,0.0,878904.0,849766.5,0.0,0.0,-9.0,,-0.614623,0.0,3.604765,-1.212303,0.0,0.0,1.402204,3.316625,0.0,0.0,-0.124185,,-2686.0,0.0,-2593.0,-2593.0,0.0,0.0,18432.0,0.0,0.0,0.0,-1496.0,,-1272.142857,0.0,1526.642857,-1306.272727,0.0,0.0,228738.4875,77251.5,0.0,0.0,-680.357143,,14,2,1,2,Closed,currency 1,Consumer credit
104080,0,Cash loans,F,Y,N,Unaccompanied,Working,Secondary / secondary special,Married,House / apartment,1,1,0,1,0,0,Drivers,3,3,THURSDAY,0,0,0,0,0,0,Construction,,,,,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,90000.0,450000.0,24412.5,450000.0,0.020246,-13846.0,-2497.0,-1907.0,-531.0,11.0,...,0.0,97820.202012,45117.272944,0.0,0.0,583.774104,4717.3277,-500.0,0.0,160.0,-317.0,17787.915,0.0,405803.565,150750.0,0.0,0.0,-26.0,8950.5,-0.262649,0.0,-0.59188,-0.800864,0.263532,0.0,1.490129,2.834686,0.0,0.0,-0.580774,-0.271052,-2120.0,0.0,-1754.0,-1754.0,0.0,0.0,28309.5,0.0,0.0,0.0,-1638.0,0.0,-1203.833333,0.0,-643.916667,-878.5,7607.649375,0.0,153434.38125,17326.5,0.0,0.0,-704.75,4972.5,12,2,1,2,Closed,currency 1,Consumer credit
106586,0,Cash loans,F,Y,Y,Unaccompanied,Working,Secondary / secondary special,Civil marriage,House / apartment,1,1,0,1,0,0,Sales staff,2,2,TUESDAY,0,0,0,0,0,0,Self-employed,,,,,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,225000.0,644427.0,27301.5,576000.0,0.030755,-15044.0,-627.0,-195.0,-421.0,14.0,...,0.0,118346.48356,24469.889864,46744.605,0.0,482.488083,,-434.0,0.0,1348.0,-294.0,0.0,0.0,328500.0,84766.185,140233.815,0.0,-27.0,,0.353019,0.0,0.739205,1.246134,0.0,0.0,1.446286,3.464102,3.0,0.0,-0.501331,,-1854.0,0.0,-1779.0,-1779.0,0.0,0.0,0.0,0.0,0.0,0.0,-1773.0,,-1170.5,0.0,-540.545455,-1200.75,0.0,0.0,82164.0,7063.84875,15581.535,0.0,-752.75,,12,2,1,2,Closed,currency 1,Consumer credit
108766,0,Cash loans,F,N,Y,Unaccompanied,Working,Lower secondary,Single / not married,House / apartment,1,1,0,1,1,0,Sales staff,1,1,TUESDAY,0,0,0,0,0,0,Self-employed,,,,,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,202500.0,1436850.0,42142.5,1125000.0,0.011657,-15441.0,-2453.0,-7186.0,-3941.0,,...,0.0,141746.720911,134979.07387,0.0,0.0,474.576033,,-490.0,0.0,854.0,-603.0,0.0,0.0,427500.0,381100.5,0.0,0.0,-16.0,,-0.924841,0.0,-0.657225,-1.197314,,0.0,-0.323067,2.421833,0.0,0.0,-0.537472,,-2687.0,0.0,-1348.0,-1413.0,0.0,0.0,45000.0,0.0,0.0,0.0,-1297.0,,-1324.5,0.0,-74.75,-851.4,0.0,0.0,249892.115625,61818.1875,0.0,0.0,-500.875,,8,2,1,2,Closed,currency 1,Consumer credit


# Discussion

* Could we have performed this just as easily with a set of joins and group-by? Practice in the Group-by notebook.
* How well does this approach scale when using a larger sample size?
* How well does the approach scale when including additional tables as entities?
* Have a look at the [Lessons Learned](https://medium.com/dataexplorations/tool-review-can-featuretools-simplify-the-process-of-feature-engineering-5d165100b0c3) by others.

In [None]:

#x = homecredit["application_train"]()
# create a subset of the data
#x.loc[x["SK_ID_CURR"].isin(skids)]