# Generating a Feature Matrix with Dask

This notebook walks through an example of how to generate a feature matrix using Featuretools and an entityset created from Dask dataframes. This example uses the Home Credit Default Risk dataset which can be obtained from [Kaggle](https://www.kaggle.com/c/home-credit-default-risk/data).

Before running this notebook, you should download the data and save all of the CSV files from the dataset into a directory called `data/homehome-credit-default-risk`. If you place the data in a different location you will need to update the code used to read in the CSV files so they can be found.

Set the `version` variable in the following cell to one of the keys in the primitive dictionary that follows to select a set of primitives to use. This selection will also set the number of workers in the Dask client as well as the blocksize used to create the Dask dataframes.

In [None]:
version = "v1"

In [None]:
# Dict: {version_key : ([trans_primitives], [agg_primtivies], num_workers, blocksize)}
primitive_dict = {
    "v1": (["and"], ["sum", "max"], 4, "40MB"), # 937 features
    "v2": (["and"], ["sum", "max", "min", "mean"], 1, "100MB"),  # 1545 features
    "v3": (["and", "add_numeric", "negate"], [], 4, "1MB"),  #5946 features
    "v4": (["and", "negate"], ["sum", "max", "min", "mean", "count", "any", "all"], 1, "100MB"),  #2083 features
}
trans_primitives = primitive_dict[version][0]
agg_primitives = primitive_dict[version][1]
num_workers = primitive_dict[version][2]
blocksize = primitive_dict[version][3]

In [None]:
import math
import os
from datetime import datetime

import dask.dataframe as dd
import numpy as np
import pandas as pd
from dask.distributed import Client

import featuretools as ft
import featuretools.variable_types as vtypes

from tqdm import tqdm

Create a Dask client with the correct number of workers for the primitives being used.

In [None]:
try:
    client.close()
except:
    pass
client = Client(n_workers=num_workers)
client

Create variable for input and output directories. If you have placed your data in a different location, update the values in the following cell accordingly.

In [None]:
input_dir = os.path.join("data", "home-credit-default-risk")
output_dir = os.path.join("data", "home-credit-default-risk", "output")

Read in the raw CSV data and store the data in Dask dataframes. The blocksize is set from the value contained in `primitive_dict` above. These values were found to work well on a MacBook Pro with 4 cores and 16GB of memory, but may need to be adjusted based on the available memory in your system.

In [None]:
%%time
# Read in the datasets and replace the anomalous values
app_train = dd.read_csv(os.path.join(input_dir, "application_train.csv"), blocksize=blocksize).replace({365243: np.nan})
app_test = dd.read_csv(os.path.join(input_dir, "application_test.csv"), blocksize=blocksize).replace({365243: np.nan})
bureau = dd.read_csv(os.path.join(input_dir, "bureau.csv"), blocksize=blocksize).replace({365243: np.nan})
bureau_balance = dd.read_csv(os.path.join(input_dir, "bureau_balance.csv"), blocksize=blocksize).replace({365243: np.nan})
cash = dd.read_csv(os.path.join(input_dir, "POS_CASH_balance.csv"), blocksize=blocksize).replace({365243: np.nan})
credit = dd.read_csv(os.path.join(input_dir, "credit_card_balance.csv"), blocksize=blocksize).replace({365243: np.nan})
previous = dd.read_csv(os.path.join(input_dir, "previous_application.csv"), blocksize=blocksize).replace({365243: np.nan})
installments = dd.read_csv(os.path.join(input_dir, "installments_payments.csv"), blocksize=blocksize).replace({365243: np.nan})

Perform a few cleanup operations on the data.

In [None]:
%%time
app_test['TARGET'] = np.nan
app = app_train.append(app_test[app_train.columns])

for index in ['SK_ID_CURR', 'SK_ID_PREV', 'SK_ID_BUREAU']:
    for dataset in [app, bureau, bureau_balance, cash, credit, previous, installments]:
        if index in list(dataset.columns):
            dataset[index] = dataset[index].fillna(0).astype(np.int64)

installments = installments.drop(columns=['SK_ID_CURR'])
credit = credit.drop(columns=['SK_ID_CURR'])
cash = cash.drop(columns=['SK_ID_CURR'])

The current implementation of Dask entities does not support inferring variable types. As a result, the user must specify the proper Featuretools variable types for all of the columns in the dataframes used to create the entities. The following cell sets the appropriate datatypes for our entities.

In [None]:
app_vtypes = {
    'SK_ID_CURR': ft.variable_types.variable.Index,
    'AMT_ANNUITY': ft.variable_types.variable.Numeric,
    'AMT_CREDIT': ft.variable_types.variable.Numeric,
    'AMT_GOODS_PRICE': ft.variable_types.variable.Numeric,
    'AMT_INCOME_TOTAL': ft.variable_types.variable.Numeric,
    'AMT_REQ_CREDIT_BUREAU_DAY': ft.variable_types.variable.Numeric,
    'AMT_REQ_CREDIT_BUREAU_HOUR': ft.variable_types.variable.Numeric,
    'AMT_REQ_CREDIT_BUREAU_MON': ft.variable_types.variable.Numeric,
    'AMT_REQ_CREDIT_BUREAU_QRT': ft.variable_types.variable.Numeric,
    'AMT_REQ_CREDIT_BUREAU_WEEK': ft.variable_types.variable.Numeric,
    'AMT_REQ_CREDIT_BUREAU_YEAR': ft.variable_types.variable.Numeric,
    'APARTMENTS_AVG': ft.variable_types.variable.Numeric,
    'APARTMENTS_MEDI': ft.variable_types.variable.Numeric,
    'APARTMENTS_MODE': ft.variable_types.variable.Numeric,
    'BASEMENTAREA_AVG': ft.variable_types.variable.Numeric,
    'BASEMENTAREA_MEDI': ft.variable_types.variable.Numeric,
    'BASEMENTAREA_MODE': ft.variable_types.variable.Numeric,
    'CNT_CHILDREN': ft.variable_types.variable.Numeric,
    'CNT_FAM_MEMBERS': ft.variable_types.variable.Numeric,
    'CODE_GENDER': ft.variable_types.variable.Categorical,
    'COMMONAREA_AVG': ft.variable_types.variable.Numeric,
    'COMMONAREA_MEDI': ft.variable_types.variable.Numeric,
    'COMMONAREA_MODE': ft.variable_types.variable.Numeric,
    'DAYS_BIRTH': ft.variable_types.variable.Numeric,
    'DAYS_EMPLOYED': ft.variable_types.variable.Numeric,
    'DAYS_ID_PUBLISH': ft.variable_types.variable.Numeric,
    'DAYS_LAST_PHONE_CHANGE': ft.variable_types.variable.Numeric,
    'DAYS_REGISTRATION': ft.variable_types.variable.Numeric,
    'DEF_30_CNT_SOCIAL_CIRCLE': ft.variable_types.variable.Numeric,
    'DEF_60_CNT_SOCIAL_CIRCLE': ft.variable_types.variable.Numeric,
    'ELEVATORS_AVG': ft.variable_types.variable.Numeric,
    'ELEVATORS_MEDI': ft.variable_types.variable.Numeric,
    'ELEVATORS_MODE': ft.variable_types.variable.Numeric,
    'EMERGENCYSTATE_MODE': ft.variable_types.variable.Categorical,
    'ENTRANCES_AVG': ft.variable_types.variable.Numeric,
    'ENTRANCES_MEDI': ft.variable_types.variable.Numeric,
    'ENTRANCES_MODE': ft.variable_types.variable.Numeric,
    'EXT_SOURCE_1': ft.variable_types.variable.Numeric,
    'EXT_SOURCE_2': ft.variable_types.variable.Numeric,
    'EXT_SOURCE_3': ft.variable_types.variable.Numeric,
    'FLAG_CONT_MOBILE': ft.variable_types.variable.Boolean,
    'FLAG_DOCUMENT_10': ft.variable_types.variable.Boolean,
    'FLAG_DOCUMENT_11': ft.variable_types.variable.Boolean,
    'FLAG_DOCUMENT_12': ft.variable_types.variable.Boolean,
    'FLAG_DOCUMENT_13': ft.variable_types.variable.Boolean,
    'FLAG_DOCUMENT_14': ft.variable_types.variable.Boolean,
    'FLAG_DOCUMENT_15': ft.variable_types.variable.Boolean,
    'FLAG_DOCUMENT_16': ft.variable_types.variable.Boolean,
    'FLAG_DOCUMENT_17': ft.variable_types.variable.Boolean,
    'FLAG_DOCUMENT_18': ft.variable_types.variable.Boolean,
    'FLAG_DOCUMENT_19': ft.variable_types.variable.Boolean,
    'FLAG_DOCUMENT_2': ft.variable_types.variable.Boolean,
    'FLAG_DOCUMENT_20': ft.variable_types.variable.Boolean,
    'FLAG_DOCUMENT_21': ft.variable_types.variable.Boolean,
    'FLAG_DOCUMENT_3': ft.variable_types.variable.Boolean,
    'FLAG_DOCUMENT_4': ft.variable_types.variable.Boolean,
    'FLAG_DOCUMENT_5': ft.variable_types.variable.Boolean,
    'FLAG_DOCUMENT_6': ft.variable_types.variable.Boolean,
    'FLAG_DOCUMENT_7': ft.variable_types.variable.Boolean,
    'FLAG_DOCUMENT_8': ft.variable_types.variable.Boolean,
    'FLAG_DOCUMENT_9': ft.variable_types.variable.Boolean,
    'FLAG_EMAIL': ft.variable_types.variable.Boolean,
    'FLAG_EMP_PHONE': ft.variable_types.variable.Boolean,
    'FLAG_MOBIL': ft.variable_types.variable.Boolean,
    'FLAG_OWN_CAR': ft.variable_types.variable.Categorical,
    'FLAG_OWN_REALTY': ft.variable_types.variable.Categorical,
    'FLAG_PHONE': ft.variable_types.variable.Boolean,
    'FLAG_WORK_PHONE': ft.variable_types.variable.Boolean,
    'FLOORSMAX_AVG': ft.variable_types.variable.Numeric,
    'FLOORSMAX_MEDI': ft.variable_types.variable.Numeric,
    'FLOORSMAX_MODE': ft.variable_types.variable.Numeric,
    'FLOORSMIN_AVG': ft.variable_types.variable.Numeric,
    'FLOORSMIN_MEDI': ft.variable_types.variable.Numeric,
    'FLOORSMIN_MODE': ft.variable_types.variable.Numeric,
    'FONDKAPREMONT_MODE': ft.variable_types.variable.Categorical,
    'HOUR_APPR_PROCESS_START': ft.variable_types.variable.Numeric,
    'HOUSETYPE_MODE': ft.variable_types.variable.Categorical,
    'LANDAREA_AVG': ft.variable_types.variable.Numeric,
    'LANDAREA_MEDI': ft.variable_types.variable.Numeric,
    'LANDAREA_MODE': ft.variable_types.variable.Numeric,
    'LIVE_CITY_NOT_WORK_CITY': ft.variable_types.variable.Boolean,
    'LIVE_REGION_NOT_WORK_REGION': ft.variable_types.variable.Boolean,
    'LIVINGAPARTMENTS_AVG': ft.variable_types.variable.Numeric,
    'LIVINGAPARTMENTS_MEDI': ft.variable_types.variable.Numeric,
    'LIVINGAPARTMENTS_MODE': ft.variable_types.variable.Numeric,
    'LIVINGAREA_AVG': ft.variable_types.variable.Numeric,
    'LIVINGAREA_MEDI': ft.variable_types.variable.Numeric,
    'LIVINGAREA_MODE': ft.variable_types.variable.Numeric,
    'NAME_CONTRACT_TYPE': ft.variable_types.variable.Categorical,
    'NAME_EDUCATION_TYPE': ft.variable_types.variable.Categorical,
    'NAME_FAMILY_STATUS': ft.variable_types.variable.Categorical,
    'NAME_HOUSING_TYPE': ft.variable_types.variable.Categorical,
    'NAME_INCOME_TYPE': ft.variable_types.variable.Categorical,
    'NAME_TYPE_SUITE': ft.variable_types.variable.Categorical,
    'NONLIVINGAPARTMENTS_AVG': ft.variable_types.variable.Numeric,
    'NONLIVINGAPARTMENTS_MEDI': ft.variable_types.variable.Numeric,
    'NONLIVINGAPARTMENTS_MODE': ft.variable_types.variable.Numeric,
    'NONLIVINGAREA_AVG': ft.variable_types.variable.Numeric,
    'NONLIVINGAREA_MEDI': ft.variable_types.variable.Numeric,
    'NONLIVINGAREA_MODE': ft.variable_types.variable.Numeric,
    'OBS_30_CNT_SOCIAL_CIRCLE': ft.variable_types.variable.Numeric,
    'OBS_60_CNT_SOCIAL_CIRCLE': ft.variable_types.variable.Numeric,
    'OCCUPATION_TYPE': ft.variable_types.variable.Categorical,
    'ORGANIZATION_TYPE': ft.variable_types.variable.Categorical,
    'OWN_CAR_AGE': ft.variable_types.variable.Numeric,
    'REGION_POPULATION_RELATIVE': ft.variable_types.variable.Numeric,
    'REGION_RATING_CLIENT': ft.variable_types.variable.Numeric,
    'REGION_RATING_CLIENT_W_CITY': ft.variable_types.variable.Numeric,
    'REG_CITY_NOT_LIVE_CITY': ft.variable_types.variable.Boolean,
    'REG_CITY_NOT_WORK_CITY': ft.variable_types.variable.Boolean,
    'REG_REGION_NOT_LIVE_REGION': ft.variable_types.variable.Boolean,
    'REG_REGION_NOT_WORK_REGION': ft.variable_types.variable.Boolean,
    'TARGET': ft.variable_types.variable.Numeric,
    'TOTALAREA_MODE': ft.variable_types.variable.Numeric,
    'WALLSMATERIAL_MODE': ft.variable_types.variable.Categorical,
    'WEEKDAY_APPR_PROCESS_START': ft.variable_types.variable.Categorical,
    'YEARS_BEGINEXPLUATATION_AVG': ft.variable_types.variable.Numeric,
    'YEARS_BEGINEXPLUATATION_MEDI': ft.variable_types.variable.Numeric,
    'YEARS_BEGINEXPLUATATION_MODE': ft.variable_types.variable.Numeric,
    'YEARS_BUILD_AVG': ft.variable_types.variable.Numeric,
    'YEARS_BUILD_MEDI': ft.variable_types.variable.Numeric,
    'YEARS_BUILD_MODE': ft.variable_types.variable.Numeric
}

bureau_vtypes = {
    'SK_ID_BUREAU': ft.variable_types.variable.Index,
    'SK_ID_CURR': ft.variable_types.variable.Id,
    'CREDIT_ACTIVE': ft.variable_types.variable.Categorical,
    'CREDIT_CURRENCY': ft.variable_types.variable.Categorical,
    'DAYS_CREDIT': ft.variable_types.variable.Numeric,
    'CREDIT_DAY_OVERDUE': ft.variable_types.variable.Numeric,
    'DAYS_CREDIT_ENDDATE': ft.variable_types.variable.Numeric,
    'DAYS_ENDDATE_FACT': ft.variable_types.variable.Numeric,
    'AMT_CREDIT_MAX_OVERDUE': ft.variable_types.variable.Numeric,
    'CNT_CREDIT_PROLONG': ft.variable_types.variable.Numeric,
    'AMT_CREDIT_SUM': ft.variable_types.variable.Numeric,
    'AMT_CREDIT_SUM_DEBT': ft.variable_types.variable.Numeric,
    'AMT_CREDIT_SUM_LIMIT': ft.variable_types.variable.Numeric,
    'AMT_CREDIT_SUM_OVERDUE': ft.variable_types.variable.Numeric,
    'CREDIT_TYPE': ft.variable_types.variable.Categorical,
    'DAYS_CREDIT_UPDATE': ft.variable_types.variable.Numeric,
    'AMT_ANNUITY': ft.variable_types.variable.Numeric
}

previous_vtypes = {
    'SK_ID_PREV': ft.variable_types.variable.Index,
    'SK_ID_CURR': ft.variable_types.variable.Id,
    'NAME_CONTRACT_TYPE': ft.variable_types.variable.Categorical,
    'AMT_ANNUITY': ft.variable_types.variable.Numeric,
    'AMT_APPLICATION': ft.variable_types.variable.Numeric,
    'AMT_CREDIT': ft.variable_types.variable.Numeric,
    'AMT_DOWN_PAYMENT': ft.variable_types.variable.Numeric,
    'AMT_GOODS_PRICE': ft.variable_types.variable.Numeric,
    'WEEKDAY_APPR_PROCESS_START': ft.variable_types.variable.Categorical,
    'HOUR_APPR_PROCESS_START': ft.variable_types.variable.Numeric,
    'FLAG_LAST_APPL_PER_CONTRACT': ft.variable_types.variable.Categorical,
    'NFLAG_LAST_APPL_IN_DAY': ft.variable_types.variable.Boolean,
    'RATE_DOWN_PAYMENT': ft.variable_types.variable.Numeric,
    'RATE_INTEREST_PRIMARY': ft.variable_types.variable.Numeric,
    'RATE_INTEREST_PRIVILEGED': ft.variable_types.variable.Numeric,
    'NAME_CASH_LOAN_PURPOSE': ft.variable_types.variable.Categorical,
    'NAME_CONTRACT_STATUS': ft.variable_types.variable.Categorical,
    'DAYS_DECISION': ft.variable_types.variable.Numeric,
    'NAME_PAYMENT_TYPE': ft.variable_types.variable.Categorical,
    'CODE_REJECT_REASON': ft.variable_types.variable.Categorical,
    'NAME_TYPE_SUITE': ft.variable_types.variable.Categorical,
    'NAME_CLIENT_TYPE': ft.variable_types.variable.Categorical,
    'NAME_GOODS_CATEGORY': ft.variable_types.variable.Categorical,
    'NAME_PORTFOLIO': ft.variable_types.variable.Categorical,
    'NAME_PRODUCT_TYPE': ft.variable_types.variable.Categorical,
    'CHANNEL_TYPE': ft.variable_types.variable.Categorical,
    'SELLERPLACE_AREA': ft.variable_types.variable.Numeric,
    'NAME_SELLER_INDUSTRY': ft.variable_types.variable.Categorical,
    'CNT_PAYMENT': ft.variable_types.variable.Numeric,
    'NAME_YIELD_GROUP': ft.variable_types.variable.Categorical,
    'PRODUCT_COMBINATION': ft.variable_types.variable.Categorical,
    'DAYS_FIRST_DRAWING': ft.variable_types.variable.Numeric,
    'DAYS_FIRST_DUE': ft.variable_types.variable.Numeric,
    'DAYS_LAST_DUE_1ST_VERSION': ft.variable_types.variable.Numeric,
    'DAYS_LAST_DUE': ft.variable_types.variable.Numeric,
    'DAYS_TERMINATION': ft.variable_types.variable.Numeric,
    'NFLAG_INSURED_ON_APPROVAL': ft.variable_types.variable.Numeric
}

bureau_balance_vtypes = {
    'bureaubalance_index': ft.variable_types.variable.Index,
    'SK_ID_BUREAU': ft.variable_types.variable.Id,
    'MONTHS_BALANCE': ft.variable_types.variable.Numeric,
    'STATUS': ft.variable_types.variable.Categorical
}

cash_vtypes = {
    'cash_index': ft.variable_types.variable.Index,
    'SK_ID_PREV': ft.variable_types.variable.Id,
    'MONTHS_BALANCE': ft.variable_types.variable.Numeric,
    'CNT_INSTALMENT': ft.variable_types.variable.Numeric,
    'CNT_INSTALMENT_FUTURE': ft.variable_types.variable.Numeric,
    'NAME_CONTRACT_STATUS': ft.variable_types.variable.Categorical,
    'SK_DPD': ft.variable_types.variable.Numeric,
    'SK_DPD_DEF': ft.variable_types.variable.Numeric
}

installments_vtypes = {
    'installments_index': ft.variable_types.variable.Index,
    'SK_ID_PREV': ft.variable_types.variable.Id,
    'NUM_INSTALMENT_VERSION': ft.variable_types.variable.Numeric,
    'NUM_INSTALMENT_NUMBER': ft.variable_types.variable.Numeric,
    'DAYS_INSTALMENT': ft.variable_types.variable.Numeric,
    'DAYS_ENTRY_PAYMENT': ft.variable_types.variable.Numeric,
    'AMT_INSTALMENT': ft.variable_types.variable.Numeric,
    'AMT_PAYMENT': ft.variable_types.variable.Numeric
}

credit_vtypes = {
    'credit_index': ft.variable_types.variable.Index,
    'SK_ID_PREV': ft.variable_types.variable.Id,
    'MONTHS_BALANCE': ft.variable_types.variable.Numeric,
    'AMT_BALANCE': ft.variable_types.variable.Numeric,
    'AMT_CREDIT_LIMIT_ACTUAL': ft.variable_types.variable.Numeric,
    'AMT_DRAWINGS_ATM_CURRENT': ft.variable_types.variable.Numeric,
    'AMT_DRAWINGS_CURRENT': ft.variable_types.variable.Numeric,
    'AMT_DRAWINGS_OTHER_CURRENT': ft.variable_types.variable.Numeric,
    'AMT_DRAWINGS_POS_CURRENT': ft.variable_types.variable.Numeric,
    'AMT_INST_MIN_REGULARITY': ft.variable_types.variable.Numeric,
    'AMT_PAYMENT_CURRENT': ft.variable_types.variable.Numeric,
    'AMT_PAYMENT_TOTAL_CURRENT': ft.variable_types.variable.Numeric,
    'AMT_RECEIVABLE_PRINCIPAL': ft.variable_types.variable.Numeric,
    'AMT_RECIVABLE': ft.variable_types.variable.Numeric,
    'AMT_TOTAL_RECEIVABLE': ft.variable_types.variable.Numeric,
    'CNT_DRAWINGS_ATM_CURRENT': ft.variable_types.variable.Numeric,
    'CNT_DRAWINGS_CURRENT': ft.variable_types.variable.Numeric,
    'CNT_DRAWINGS_OTHER_CURRENT': ft.variable_types.variable.Numeric,
    'CNT_DRAWINGS_POS_CURRENT': ft.variable_types.variable.Numeric,
    'CNT_INSTALMENT_MATURE_CUM': ft.variable_types.variable.Numeric,
    'NAME_CONTRACT_STATUS': ft.variable_types.variable.Categorical,
    'SK_DPD': ft.variable_types.variable.Numeric,
    'SK_DPD_DEF': ft.variable_types.variable.Numeric
}

Next we will create the entityset using the previously created Dask dataframes. The process for creating entities and entitysets from Dask dataframes is the same as the process of creating entities and entitysets from pandas dataframes, with the one exception being that the `variable_types` parameter must be used when creating entities from Dask dataframes.

In [None]:
%%time
es = ft.EntitySet(id='clients')

# Entities with a unique index
es = es.entity_from_dataframe(entity_id='app', dataframe=app, index='SK_ID_CURR',
                              variable_types=app_vtypes)

es = es.entity_from_dataframe(entity_id='bureau', dataframe=bureau, index='SK_ID_BUREAU',
                              variable_types=bureau_vtypes)

es = es.entity_from_dataframe(entity_id='previous', dataframe=previous, index='SK_ID_PREV',
                              variable_types=previous_vtypes)

# Entities that do not have a unique index
es = es.entity_from_dataframe(entity_id='bureau_balance', dataframe=bureau_balance,
                              make_index=True, index='bureaubalance_index',
                              variable_types=bureau_balance_vtypes)

es = es.entity_from_dataframe(entity_id='cash', dataframe=cash,
                              make_index=True, index='cash_index',
                              variable_types=cash_vtypes)

es = es.entity_from_dataframe(entity_id='installments', dataframe=installments,
                              make_index=True, index='installments_index',
                              variable_types=installments_vtypes)

es = es.entity_from_dataframe(entity_id='credit', dataframe=credit,
                              make_index=True, index='credit_index',
                              variable_types=credit_vtypes)

print("Adding relationships...")
# Relationship between app_train and bureau
r_app_bureau = ft.Relationship(es['app']['SK_ID_CURR'], es['bureau']['SK_ID_CURR'])

# Relationship between bureau and bureau balance
r_bureau_balance = ft.Relationship(es['bureau']['SK_ID_BUREAU'], es['bureau_balance']['SK_ID_BUREAU'])

# Relationship between current app and previous apps
r_app_previous = ft.Relationship(es['app']['SK_ID_CURR'], es['previous']['SK_ID_CURR'])

# Relationships between previous apps and cash, installments, and credit
r_previous_cash = ft.Relationship(es['previous']['SK_ID_PREV'], es['cash']['SK_ID_PREV'])
r_previous_installments = ft.Relationship(es['previous']['SK_ID_PREV'], es['installments']['SK_ID_PREV'])
r_previous_credit = ft.Relationship(es['previous']['SK_ID_PREV'], es['credit']['SK_ID_PREV'])

# Add in the defined relationships
es = es.add_relationships([r_app_bureau, r_bureau_balance, r_app_previous,
                           r_previous_cash, r_previous_installments, r_previous_credit])
# Print out the EntitySet
print(es)

Create a `cutoff_times` dataframe to pass to `ft.dfs()`. This step is optional and the `cutoff_time` parameter can be omitted when calling DFS. With the current implementation, supplying a cutoff time dataframe is slightly faster than supplying a single cutoff time value, although both approaches should work equivalently. This is something that should be improved in future updates of Featuretools.

In [None]:
%%time
cutoff_times = app["SK_ID_CURR"].to_frame().rename(columns={"SK_ID_CURR":"instance_id"})
cutoff_times["time"] = datetime.now()
cutoff_times = cutoff_times.compute()

Now, we can run `ft.dfs()` to generate the feature matrix. The feature matrix will be returned as a Dask dataframe.

In [None]:
%%time
fm, features = ft.dfs(entityset=es, target_entity="app",
                      trans_primitives=trans_primitives,
                      agg_primitives=agg_primitives,
                      where_primitives=[], seed_features=[],
                      max_depth=2, verbose=0,
                      cutoff_time=cutoff_times)

The feature matrix can now be saved to disk. Note, this process may take several minutes to complete, depending on the size of the feature matrix that was generated.

At times this process may fail due to memory issues. These issues can sometimes be resolved by using a smaller partition size when reading in the original CSV data so that Dask has smaller chunks of data to work with. Another potential solution is to use workers with more available memory.

In [None]:
%%time
fm.to_csv(os.path.join(output_dir, f"fm_{version}-*.csv"), index=False)

Alternatively, the feature matrix can be brought into memory by running `.compute()` on the Dask feature matrix returned from `ft.dfs()`. Note, this process may fail depending on the size of the feature matrix generated and the available system memory.

In [None]:
%%time
fm_computed = fm.compute()
print("Shape: {}".format(fm_computed.shape))
print("Memory: {} MB".format(fm_computed.memory_usage().sum() / 1000000))

In [None]:
fm_computed.head()

Now that we are finished, we can close our Dask client.

In [None]:
client.close()