# Featuretools Implementation with Dask

In this notebook we will use Dask to run deep feature synthesis on the entire dataset and generate the feature matrix. This operation is not feasible with a personal laptop on the entire Kaggle Home Credit dataset, but using Dask we can run the operations in parallel and complete this operation on a laptop in a reasonable time period.

In [1]:
# pandas and numpy for data manipulation
import pandas as pd
import numpy as np

# featuretools for automated feature engineering
import featuretools as ft

import featuretools.variable_types as vtypes

import sys
import psutil

import os

#### Load in Feature names

We already calculated the feature names, so it's simple to read them in. This avoids the need to have to recalculate the features on each partition.

In [9]:
featurenames = ft.load_features('../input/feature_names.txt')
print(len(featurenames))

1820


# Variable Types

Following are the variable types defined in the Automated Feature Engineering notebook. Defining them here prevents us from needing to define them for every entity separately. 

In [3]:
app_types = {'FLAG_CONT_MOBILE': vtypes.Boolean, 'FLAG_DOCUMENT_10': vtypes.Boolean, 'FLAG_DOCUMENT_11': vtypes.Boolean, 'FLAG_DOCUMENT_12': vtypes.Boolean, 'FLAG_DOCUMENT_13': vtypes.Boolean, 'FLAG_DOCUMENT_14': vtypes.Boolean, 'FLAG_DOCUMENT_15': vtypes.Boolean, 'FLAG_DOCUMENT_16': vtypes.Boolean, 'FLAG_DOCUMENT_17': vtypes.Boolean, 'FLAG_DOCUMENT_18': vtypes.Boolean, 'FLAG_DOCUMENT_19': vtypes.Boolean, 'FLAG_DOCUMENT_2': vtypes.Boolean, 'FLAG_DOCUMENT_20': vtypes.Boolean, 'FLAG_DOCUMENT_21': vtypes.Boolean, 'FLAG_DOCUMENT_3': vtypes.Boolean, 'FLAG_DOCUMENT_4': vtypes.Boolean, 'FLAG_DOCUMENT_5': vtypes.Boolean, 'FLAG_DOCUMENT_6': vtypes.Boolean, 'FLAG_DOCUMENT_7': vtypes.Boolean, 'FLAG_DOCUMENT_8': vtypes.Boolean, 'FLAG_DOCUMENT_9': vtypes.Boolean, 'FLAG_EMAIL': vtypes.Boolean, 'FLAG_EMP_PHONE': vtypes.Boolean, 'FLAG_MOBIL': vtypes.Boolean, 'FLAG_PHONE': vtypes.Boolean, 'FLAG_WORK_PHONE': vtypes.Boolean, 'LIVE_CITY_NOT_WORK_CITY': vtypes.Boolean, 'LIVE_REGION_NOT_WORK_REGION': vtypes.Boolean, 'REG_CITY_NOT_LIVE_CITY': vtypes.Boolean, 'REG_CITY_NOT_WORK_CITY': vtypes.Boolean, 'REG_REGION_NOT_LIVE_REGION': vtypes.Boolean, 'REG_REGION_NOT_WORK_REGION': vtypes.Boolean, 'REGION_RATING_CLIENT': vtypes.Ordinal, 'REGION_RATING_CLIENT_W_CITY': vtypes.Ordinal, 'HOUR_APPR_PROCESS_START': vtypes.Ordinal}

In [4]:
previous_types = {'NFLAG_LAST_APPL_IN_DAY': vtypes.Boolean, 
             'NFLAG_INSURED_ON_APPROVAL': vtypes.Boolean}

## Function to Create EntitySet from Partition 

The data has aleady been broken into 50 partitions so we will write a function that takes a single partition and creates the entity set for that data. This can then be passed into a function that calculates the feature matrix from the entityset.

In [5]:
def entityset_from_partition(path):
    """Create an EntitySet from a partition of data specified as a path"""
    
    app = pd.read_csv('%s/app.csv' % path)
    bureau = pd.read_csv('%s/bureau.csv' % path)
    bureau_balance = pd.read_csv('%s/bureau_balance.csv' % path)
    previous = pd.read_csv('%s/previous.csv' % path)
    credit = pd.read_csv('%s/credit.csv' % path)
    installments = pd.read_csv('%s/installments.csv' % path)
    cash = pd.read_csv('%s/cash.csv' % path)
    
    # Empty entityset
    es = ft.EntitySet(id = 'clients')
    
    # Entities with a unique index
    es = es.entity_from_dataframe(entity_id = 'app', dataframe = app, index = 'SK_ID_CURR',
                                  variable_types = app_types)

    es = es.entity_from_dataframe(entity_id = 'bureau', dataframe = bureau, index = 'SK_ID_BUREAU')

    es = es.entity_from_dataframe(entity_id = 'previous', dataframe = previous, index = 'SK_ID_PREV',
                                  variable_types = previous_types)

    # Entities that do not have a unique index
    es = es.entity_from_dataframe(entity_id = 'bureau_balance', dataframe = bureau_balance, 
                                  make_index = True, index = 'bureaubalance_index')

    es = es.entity_from_dataframe(entity_id = 'cash', dataframe = cash, 
                                  make_index = True, index = 'cash_index')

    es = es.entity_from_dataframe(entity_id = 'installments', dataframe = installments,
                                  make_index = True, index = 'installments_index')

    es = es.entity_from_dataframe(entity_id = 'credit', dataframe = credit,
                                  make_index = True, index = 'credit_index')
    
    # Relationship between app_train and bureau
    r_app_bureau = ft.Relationship(es['app']['SK_ID_CURR'], es['bureau']['SK_ID_CURR'])

    # Relationship between bureau and bureau balance
    r_bureau_balance = ft.Relationship(es['bureau']['SK_ID_BUREAU'], es['bureau_balance']['SK_ID_BUREAU'])

    # Relationship between current app and previous apps
    r_app_previous = ft.Relationship(es['app']['SK_ID_CURR'], es['previous']['SK_ID_CURR'])

    # Relationships between previous apps and cash, installments, and credit
    r_previous_cash = ft.Relationship(es['previous']['SK_ID_PREV'], es['cash']['SK_ID_PREV'])
    r_previous_installments = ft.Relationship(es['previous']['SK_ID_PREV'], es['installments']['SK_ID_PREV'])
    r_previous_credit = ft.Relationship(es['previous']['SK_ID_PREV'], es['credit']['SK_ID_PREV'])
    
    # Add in the defined relationships
    es = es.add_relationships([r_app_bureau, r_bureau_balance, r_app_previous,
                               r_previous_cash, r_previous_installments, r_previous_credit])

    return es

Let's test the function to make sure it can make an `EntitySet` from a data partition.

In [6]:
es1 = entityset_from_partition('../input/partitions/p1')
es1

Entityset: clients
  Entities:
    app [Rows: 7125, Columns: 122]
    bureau [Rows: 33478, Columns: 17]
    previous [Rows: 32795, Columns: 37]
    bureau_balance [Rows: 346658, Columns: 4]
    cash [Rows: 197046, Columns: 8]
    installments [Rows: 264809, Columns: 8]
    credit [Rows: 74219, Columns: 23]
  Relationships:
    bureau.SK_ID_CURR -> app.SK_ID_CURR
    bureau_balance.SK_ID_BUREAU -> bureau.SK_ID_BUREAU
    previous.SK_ID_CURR -> app.SK_ID_CURR
    cash.SK_ID_PREV -> previous.SK_ID_PREV
    installments.SK_ID_PREV -> previous.SK_ID_PREV
    credit.SK_ID_PREV -> previous.SK_ID_PREV

That works! The next step is to create a feature matrix for the partition from the `EntitySet` and the `features`. 

## Function to Create Featurematrix from EntitySet 

In [16]:
def feature_matrix_from_entityset(es, feature_names):
    """Run deep feature synthesis from an entityset and feature names"""

    feature_matrix = ft.calculate_feature_matrix(feature_names, 
                                                 entityset=es, 
                                                 n_jobs = 1, 
                                                 verbose = 0,
                                                 chunk_size = 1000)
    
    return feature_matrix

In [15]:
fm1 = feature_matrix_from_entityset(es1, featurenames)
fm1.shape

EntitySet scattered to workers in 10.319 seconds

Elapsed: 00:00 | Remaining: ? | Progress:   0%|          | Calculated: 0/8 chunks[A
Elapsed: 03:53 | Remaining: 27:17 | Progress:  12%|█▎        | Calculated: 1/8 chunks[A
Elapsed: 05:14 | Remaining: 18:46 | Progress:  25%|██▌       | Calculated: 2/8 chunks[A
Elapsed: 05:14 | Remaining: 10:58 | Progress:  38%|███▊      | Calculated: 3/8 chunks[A
Elapsed: 05:16 | Remaining: 06:10 | Progress:  50%|█████     | Calculated: 4/8 chunks[A
Elapsed: 05:16 | Remaining: 02:09 | Progress:  75%|███████▌  | Calculated: 6/8 chunks[A
Elapsed: 05:17 | Remaining: 00:45 | Progress:  88%|████████▊ | Calculated: 7/8 chunks[A
Elapsed: 05:17 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 8/8 chunks[A
[A

(7125, 1820)

We have all the parts needed to create our feature matrixes. The last step is to get Dask to run this in parallel. For the dask implementation, we'll set `n_jobs = 1` and `verbose = 0` because Dask already runs the calculation in parallel and because Dask should have a progress bar that we can view.

# Dask

We will use the Dask utility `delayed` to parallelize the operation. We iterate through each path in a list of the partitions and tell dask to first create the entity set from the partition, then create the featurematrix from the entityset, and finally, append the feature matrix to a list of feature matrices. The last step is to `concat`enate all of the feature matrices together to get one final matrix that we save to disk.

In [18]:
from dask import delayed
from dask.diagnostics import ProgressBar, Profiler, ResourceProfiler, CacheProfiler

from timeit import default_timer as timer

In [19]:
paths = ['../input/partitions/%s' % file for file in os.listdir('../input/partitions/')]
paths[:8]

['../input/partitions/p4',
 '../input/partitions/p3',
 '../input/partitions/p2',
 '../input/partitions/p5',
 '../input/partitions/p19',
 '../input/partitions/p26',
 '../input/partitions/p21',
 '../input/partitions/p28']

In [21]:
# Empty list of feature matrices
fms = []

# Iterate through the paths
for path in paths:
    if '.DS_Store' in path:
        next
    else:
        # Make the entityset
        es = delayed(entityset_from_partition)(path)

        # Make the feature matrix
        fm = delayed(feature_matrix_from_entityset)(es, feature_names = featurenames)
        fms.append(fm)
    
# Final operation will be to concatenate together all of the feature matrices
X = delayed(pd.concat)(fms, axis = 0)
X

Delayed('concat-4fe40087-a5b1-4847-8f5e-b4026d2b2554')

In [22]:
start = timer()

# Context for progress and profiling
with ProgressBar(), Profiler() as prof, ResourceProfiler(dt=0.25) as rprof, CacheProfiler() as cprof:
    feature_matrix = X.compute()

end = timer()

[######                                  ] | 15% Completed |  3hr 23min  9.4s



[########################################] | 100% Completed | 21hr 13min  7.1s


In [23]:
feature_matrix.shape

(356255, 1820)

In [24]:
# Save the feature matrix
feature_matrix.to_csv('../input/feature_matrix.csv', chunksize = 1000)

# Visualizations of Run

We can use Bokeh and the built in plotting capabilities of Dask to plot the resources used during the computation. This is not necessary for the project but might be interesting for understanding how Dask operates. 

In [53]:
import bokeh
from bokeh.io import output_notebook
output_notebook()

In [50]:
prof.visualize()

ImportError: cannot import name '_state'

In [39]:
with open('../input/prof.txt', 'w') as f:
    f.write(str(prof.results))
    
with open('../input/rprof.txt', 'w') as f:
    f.write(str(rprof.results))
    
with open('../input/cprof.txt', 'w') as f:
    f.write(str(cprof.results))

# Conclusions

In this notebook we used Dask to complete an operation that normally would have taken far more resources than we have available on our personal computer. Although featuretools supports parallel processing, there are still some issues that need to be worked out, and we can take matters into our own hands to get the job done! This will allow anyone with reasonable hardware to take advantage of featuretools! The next notebook is Sampling and Feature Selection where we limit reduce the numbers of features to allow for reasonable modeling times. 