![](../../images/featuretools.png)

# Featuretools Implementation with Dask

A simple run of Deep Feature Synthesis from the Automated Loan Repayment notebook takes about 25 hours on an AWS machine with 64 GB of RAM! Clearly we need a better approach for practical implementations of calculating a large feature matrix.

Featuretools does have support for parallel processing if you have multiple cores (which nearly every single laptop now does), but it currently sends the entire EntitySet to each process which means you might exhaust the memory on any one core. For example, that AWS machine has 8 GB per core, which might seem like a lot until you realize the EntitySet takes up about 11 GB and setting `n_jobs=-1` will cause an out of memory error. Therefore, we cannot use the parallel processing in Featuretools and instead have to build our own implementation with Dask. 

Fortunately, options such as [Dask](https://dask.pydata.org/en/latest/) make it easy to take advantage of multiple cores on our own machine. In this notebook, we'll see how to run Deep Feature Synthesis in about 3 hours on a personal laptop with 16 GB of RAM. 

<p align = "center">
    <img src = "../../images/dask_logo.png" width = "400">
</p>


## Roadmap

Following is our plan of action for implementing Dask

1. Convert `object` data types to `category`
    * This reduces memory consumption significantly
2. Create 100 partitions of data and save to disk
    * Each partition will contain data from all 7 seven tables for 1/100 of the client ids, `SK_ID_CURR`
    * Each partition can be used to make an EntitySet and hence a feature matrix
3. Write a function to take a partition and create an `EntitySet`
4. Write a function to take an `EntitySet` and calculate a `feature_matrix`
    * Since we already have the feature names, we can use `ft.calculate_feature_matrix`
5. Use Dask with system processes to generate feature matrices for 8 partitions at a time
    * Save these subset feature matrices to disk
    * Using proceses will start 8 workers, one for each core, with 2 GB of memory each
    * We can't generate the entire feature matrix at once using processes because the final feature matrix is too large to fit on a single core
6. Use Dask with threads to read in subset feature matrices and create final feature matrix
    * Using threads will start 1 worker with 16 GB of memory, enough to hold the entire feature matrix
    * Can save this feature matrix to disk for later use in a machine learning pipeline
    
This might seem like a lot of tasks, but each one is a relatively simple step. At the end, we'll have a working implementation of Dask that lets us take full advantage of our computing resources. While we could solve this whole problem by just renting a larger machine, this approach will give us a chance to learn about how to work with constraints and engineer a solution. Sometimes having too many resources can limit your creativity, and working with constraints forces us to be innovative! 

In [1]:
# pandas and numpy for data manipulation
import pandas as pd
import numpy as np

# featuretools for automated feature engineering
import featuretools as ft
import featuretools.variable_types as vtypes

# Utilities
import sys
import psutil
import os

from timeit import default_timer as timer

In [2]:
def convert_types(df):
    # Iterate through each column
    for c in df:
        
        # Convert ids and booleans to integers
        if ('SK_ID' in c):
            df[c] = df[c].fillna(0).astype(np.int32)
            
        # Convert objects to category
        elif (df[c].dtype == 'object') and (df[c].nunique() < df.shape[0]):
            df[c] = df[c].astype('category')
        
        # Booleans mapped to integers
        elif list(df[c].unique()) == [1, 0]:
            df[c] = df[c].astype(bool)
        
        # Float64 to float32
        elif df[c].dtype == float:
            df[c] = df[c].astype(np.float32)
            
        # Int64 to int32
        elif df[c].dtype == int:
            df[c] = df[c].astype(np.int32)
        
    return df

In [3]:
# Read in the datasets and replace the anomalous values
app_train = pd.read_csv('../input/application_train.csv').replace({365243: np.nan})
app_test = pd.read_csv('../input/application_test.csv').replace({365243: np.nan})
bureau = pd.read_csv('../input/bureau.csv').replace({365243: np.nan})
bureau_balance = pd.read_csv('../input/bureau_balance.csv').replace({365243: np.nan})
cash = pd.read_csv('../input/POS_CASH_balance.csv').replace({365243: np.nan})
credit = pd.read_csv('../input/credit_card_balance.csv').replace({365243: np.nan})
previous = pd.read_csv('../input/previous_application.csv').replace({365243: np.nan})
installments = pd.read_csv('../input/installments_payments.csv').replace({365243: np.nan})

app_test['TARGET'] = np.nan

# Join together training and testing
app = app_train.append(app_test, ignore_index = True, sort = True)

# Need `SK_ID_CURR` in every dataset
bureau_balance = bureau_balance.merge(bureau[['SK_ID_CURR', 'SK_ID_BUREAU']], 
                                      on = 'SK_ID_BUREAU', how = 'left')

print(f"""Total memory before converting types: \
{round(np.sum([x.memory_usage().sum() / 1e9 for x in 
[app, bureau, bureau_balance, cash, credit, previous, installments]]), 2)} gb.""")

# Convert types to reduce memory usage
app = convert_types(app)
bureau = convert_types(bureau)
bureau_balance = convert_types(bureau_balance)
cash = convert_types(cash)
credit = convert_types(credit)
previous = convert_types(previous)
installments = convert_types(installments)

print(f"""Total memory after converting types: \
{round(np.sum([x.memory_usage().sum() / 1e9 for x in 
[app, bureau, bureau_balance, cash, credit, previous, installments]]), 2)} gb.""")

# Set the index for locating
for dataset in [app, bureau, bureau_balance, cash, credit, previous, installments]:
    dataset.set_index('SK_ID_CURR', inplace = True)

Total memory before converting types: 4.38 gb.
Total memory after converting types: 2.06 gb.


In [4]:
print('Object memory usage.')
print(bureau['CREDIT_TYPE'].astype('object').memory_usage() / 1e9, 'gb')

print('Category memory usage.')
print(bureau['CREDIT_TYPE'].astype('category').memory_usage() / 1e9, 'gb')

print('Length of data: ', bureau.shape[0])
print('Number of unique categories: ', bureau['CREDIT_TYPE'].nunique())

Object memory usage.
0.027462848 gb
Category memory usage.
0.015448612 gb
Length of data:  1716428
Number of unique categories:  15


# Partitioning Data

Next, we partition the data into 104 separate datasets based on the client id, `SK_ID_CURR`. Each partition by itself can be used to make an `EntitySet` and later a feature matrix. One partition will contain seven data tables, each with only the data associated with the set clients. 104 partitions is sort of an arbitrary number and it might be worth exploring other options to see which works best.

In [5]:
def create_partition(user_list, partition):
    """Creates and saves a dataset with only the users in `user_list`."""
    
    # Make the directory
    directory = '../input/partitions/p%d' % (partition + 1)
    if os.path.exists(directory):
        return
    
    else:
        os.makedirs(directory)
        
        # Subset based on user list
        app_subset = app[app.index.isin(user_list)].copy().reset_index()
        bureau_subset = bureau[bureau.index.isin(user_list)].copy().reset_index()

        # Drop SK_ID_CURR from bureau_balance, cash, credit, and installments
        bureau_balance_subset = bureau_balance[bureau_balance.index.isin(user_list)].copy().reset_index(drop = True)
        cash_subset = cash[cash.index.isin(user_list)].copy().reset_index(drop = True)
        credit_subset = credit[credit.index.isin(user_list)].copy().reset_index(drop = True)
        previous_subset = previous[previous.index.isin(user_list)].copy().reset_index()
        installments_subset = installments[installments.index.isin(user_list)].copy().reset_index(drop = True)
        

        # Save data to the directory
        app_subset.to_csv('%s/app.csv' % directory, index = False)
        bureau_subset.to_csv('%s/bureau.csv' % directory, index = False)
        bureau_balance_subset.to_csv('%s/bureau_balance.csv' % directory, index = False)
        cash_subset.to_csv('%s/cash.csv' % directory, index = False)
        credit_subset.to_csv('%s/credit.csv' % directory, index = False)
        previous_subset.to_csv('%s/previous.csv' % directory, index = False)
        installments_subset.to_csv('%s/installments.csv' % directory, index = False)

        if partition % 10 == 0:
            print('Saved all files in partition {} to {}.'.format(partition + 1, directory))

In [6]:
# Break into 104 chunks
chunk_size = app.shape[0] // 103

# Construct an id list
id_list = [list(app.iloc[i:i+chunk_size].index) for i in range(0, app.shape[0], chunk_size)]

In [7]:
from itertools import chain

# Sanity check that we have not missed any ids
print('Number of ids in id_list:         {}.'.format(len(list(chain(*id_list)))))
print('Total length of application data: {}.'.format(len(app)))

Number of ids in id_list:         356255.
Total length of application data: 356255.


In [8]:
start = timer()
for i, ids in enumerate(id_list):
    # Create a partition based on the ids
    create_partition(ids, i)
    
end = timer()
print(f'Partitioning took {round(end - start)} seconds.')

Partitioning took 0 seconds.


__I already had the partitions made, but running the above cell took 1300 seconds (21 minutes) the first time. __

We can independently generate the feature matrix for each partition because the partition contains all the data for that group of clients. These partitioned feature matrices can then be joined together into larger feature matrices, and eventually one single matrix with all of the clients.

#### Load in Feature names

We already calculated the feature names, so we can read them in. This avoids the need to have to recalculate the features on each partition. Instead of using `ft.dfs`, if we have the feature names, we can use `ft.calculate_feature_matrix` and pass in the `EntitySet` and the feature names.

In [9]:
featurenames = ft.load_features('../input/features.txt')
print(len(featurenames))

1820


For each feature matrix, we'll make 1820 features! 

#### Variable Types

If the Automated notebook, we specified the variable types when adding entities to the entityset. However, since we already properly defined the data types for each column, Featuretools will now infer the correct variable type. For example, while before we have Booleans mapped to integers which would be interpreted as numeric, now the Booleans are represented as Booleans and hence will be correctly inferred by Featuretools.

In [10]:
# app_types = {'FLAG_CONT_MOBILE': vtypes.Boolean, 'FLAG_DOCUMENT_10': vtypes.Boolean, 'FLAG_DOCUMENT_11': vtypes.Boolean, 'FLAG_DOCUMENT_12': vtypes.Boolean, 'FLAG_DOCUMENT_13': vtypes.Boolean, 'FLAG_DOCUMENT_14': vtypes.Boolean, 'FLAG_DOCUMENT_15': vtypes.Boolean, 'FLAG_DOCUMENT_16': vtypes.Boolean, 'FLAG_DOCUMENT_17': vtypes.Boolean, 'FLAG_DOCUMENT_18': vtypes.Boolean, 'FLAG_DOCUMENT_19': vtypes.Boolean, 'FLAG_DOCUMENT_2': vtypes.Boolean, 'FLAG_DOCUMENT_20': vtypes.Boolean, 'FLAG_DOCUMENT_21': vtypes.Boolean, 'FLAG_DOCUMENT_3': vtypes.Boolean, 'FLAG_DOCUMENT_4': vtypes.Boolean, 'FLAG_DOCUMENT_5': vtypes.Boolean, 'FLAG_DOCUMENT_6': vtypes.Boolean, 'FLAG_DOCUMENT_7': vtypes.Boolean, 'FLAG_DOCUMENT_8': vtypes.Boolean, 'FLAG_DOCUMENT_9': vtypes.Boolean, 'FLAG_EMAIL': vtypes.Boolean, 'FLAG_EMP_PHONE': vtypes.Boolean, 'FLAG_MOBIL': vtypes.Boolean, 'FLAG_PHONE': vtypes.Boolean, 'FLAG_WORK_PHONE': vtypes.Boolean, 'LIVE_CITY_NOT_WORK_CITY': vtypes.Boolean, 'LIVE_REGION_NOT_WORK_REGION': vtypes.Boolean, 'REG_CITY_NOT_LIVE_CITY': vtypes.Boolean, 'REG_CITY_NOT_WORK_CITY': vtypes.Boolean, 'REG_REGION_NOT_LIVE_REGION': vtypes.Boolean, 'REG_REGION_NOT_WORK_REGION': vtypes.Boolean, 'REGION_RATING_CLIENT': vtypes.Ordinal, 'REGION_RATING_CLIENT_W_CITY': vtypes.Ordinal, 'HOUR_APPR_PROCESS_START': vtypes.Ordinal}
# previous_types = {'NFLAG_LAST_APPL_IN_DAY': vtypes.Boolean, 
#              'NFLAG_INSURED_ON_APPROVAL': vtypes.Boolean}

## Function to Create EntitySet from Partition 

The next function takes a single partition of data and make an `EntitySet`. We won't save these entitysets to disk, but instead will use them in Dask. Therefore, if we want to make any changes to the `EntitySet`, such as adding in interesting values or seed features, we can alter this function and remake the `EntitySet` without having to rewrite all the Entity Sets on disk. Writing the entity sets to disk would be another option if we are sure that they won't ever change. For greater flexibility, we write the data partitions to disk (as done above). 

In [11]:
def entityset_from_partition(path):
    """Create an EntitySet from a partition of data specified as a path."""
    
    # Read in data
    app = pd.read_csv('%s/app.csv' % path)
    bureau = pd.read_csv('%s/bureau.csv' % path)
    bureau_balance = pd.read_csv('%s/bureau_balance.csv' % path)
    previous = pd.read_csv('%s/previous.csv' % path)
    credit = pd.read_csv('%s/credit.csv' % path)
    installments = pd.read_csv('%s/installments.csv' % path)
    cash = pd.read_csv('%s/cash.csv' % path)
    
    # Empty entityset
    es = ft.EntitySet(id = 'clients')
    
    # Entities with a unique index
    es = es.entity_from_dataframe(entity_id = 'app', dataframe = app, index = 'SK_ID_CURR')

    es = es.entity_from_dataframe(entity_id = 'bureau', dataframe = bureau, index = 'SK_ID_BUREAU')

    es = es.entity_from_dataframe(entity_id = 'previous', dataframe = previous, index = 'SK_ID_PREV')

    # Entities that do not have a unique index
    es = es.entity_from_dataframe(entity_id = 'bureau_balance', dataframe = bureau_balance, 
                                  make_index = True, index = 'bureaubalance_index')

    es = es.entity_from_dataframe(entity_id = 'cash', dataframe = cash, 
                                  make_index = True, index = 'cash_index')

    es = es.entity_from_dataframe(entity_id = 'installments', dataframe = installments,
                                  make_index = True, index = 'installments_index')

    es = es.entity_from_dataframe(entity_id = 'credit', dataframe = credit,
                                  make_index = True, index = 'credit_index')
    
    # Relationship between app_train and bureau
    r_app_bureau = ft.Relationship(es['app']['SK_ID_CURR'], es['bureau']['SK_ID_CURR'])

    # Relationship between bureau and bureau balance
    r_bureau_balance = ft.Relationship(es['bureau']['SK_ID_BUREAU'], es['bureau_balance']['SK_ID_BUREAU'])

    # Relationship between current app and previous apps
    r_app_previous = ft.Relationship(es['app']['SK_ID_CURR'], es['previous']['SK_ID_CURR'])

    # Relationships between previous apps and cash, installments, and credit
    r_previous_cash = ft.Relationship(es['previous']['SK_ID_PREV'], es['cash']['SK_ID_PREV'])
    r_previous_installments = ft.Relationship(es['previous']['SK_ID_PREV'], es['installments']['SK_ID_PREV'])
    r_previous_credit = ft.Relationship(es['previous']['SK_ID_PREV'], es['credit']['SK_ID_PREV'])
    
    # Add in the defined relationships
    es = es.add_relationships([r_app_bureau, r_bureau_balance, r_app_previous,
                               r_previous_cash, r_previous_installments, r_previous_credit])

    return es

Let's test the function to make sure it can make an `EntitySet` from a data partition.

In [12]:
es1 = entityset_from_partition('../input/partitions/p1')
es1

Entityset: clients
  Entities:
    app [Rows: 3458, Columns: 122]
    bureau [Rows: 16097, Columns: 17]
    previous [Rows: 16204, Columns: 37]
    bureau_balance [Rows: 166374, Columns: 4]
    cash [Rows: 96632, Columns: 8]
    installments [Rows: 129130, Columns: 8]
    credit [Rows: 35694, Columns: 23]
  Relationships:
    bureau.SK_ID_CURR -> app.SK_ID_CURR
    bureau_balance.SK_ID_BUREAU -> bureau.SK_ID_BUREAU
    previous.SK_ID_CURR -> app.SK_ID_CURR
    cash.SK_ID_PREV -> previous.SK_ID_PREV
    installments.SK_ID_PREV -> previous.SK_ID_PREV
    credit.SK_ID_PREV -> previous.SK_ID_PREV

The function works as intended. The next step is to write a function that can take a single `EntitySet` and the `features` we want to build, and make a feature matrix. This is simple using `ft.calculate_feature_matrix`. 

# Function to Create Feature Matrix from EntitySet 

With the entity set and the feature names, generating the feature matrix is a one-liner in Featuretools. Since we are going to use Dask for parallelizing the operation, we'll set the number of jobs to 1. The `chunk_size` is an extremely important parameter, and I'd suggest experimenting with this to find the optimal value. Since we aren't getting any updates, it might make sense to set the chunk size as large as possible. We can try setting it to all of the instances in a given partition at once using the length of the dataset. This is probably the fastest way to make the feature matrix provided each one can fit entirely in memory. 

In [13]:
def feature_matrix_from_entityset(es, feature_names):
    """Run deep feature synthesis from an entityset and feature names"""

    feature_matrix = ft.calculate_feature_matrix(feature_names, 
                                                 entityset=es, 
                                                 n_jobs = 1, 
                                                 verbose = 0,
                                                 chunk_size = es['app'].df.shape[0])
    
    return feature_matrix

Below we test the function using the entityset from the first partition.

In [14]:
fm1 = feature_matrix_from_entityset(es1, featurenames)
fm1.shape

(3458, 1820)

We now have both parts needed to go from a data partition on disk to a feature matrix made using 1/104 of the data. Since we're going to be making 8 feature matrices as once, each one utilizing one core on our system and doing this 13 times, we might expect the process to take around 

The next step is to get Dask to run this in parallel using all 8 cores on our machine (or however many cores you have available).

# Dask

We will use the Dask utility `delayed` to parallelize the operation. First, we'll import delayed and set up a `Client` using processes, which will create one worker for each core on the machine. The memory limit of each worker will be the total system memory (16 gb) divided by the number of cores (8). 

We iterate through each path in a list of the partitions and tell dask to first create the entity set from the partition, then create the featurematrix from the entityset, and append the feature matrix to a list of feature matrices. Once we have the complete list of feature matrices, we can `concat` them all together to get one single feature matrix. 

However, we can't unfortunately do all the feature matrices in one operation because the final feature matrix is to large to fit on a single core. Therefore, we'll make 13 feature matrices (104 / 8) by making a feature matrix from 8 partitions at once. This is small enough to fit on a single core and can be saved to disk. After making the 13 subset feature matrices and saving them all to disk, we can start a new `Client` using threads. This will create a single worked utilizing all of the system memory. We can use this single worked to `concat`enate the individual feature matrices into a single matrix without exhausting the worker memory. Finally, this feature matrix is saved to disk for modeling.

The end result will be a single feature matrix on disk that we can load in and use in any standard machine learning pipeline. We also have the option to use the subset feature matrices in a machine learning pipeline using a method such as [Scikit-Learn's partial_fit](http://scikit-learn.org/stable/modules/scaling_strategies.html) if the classifier in question supports the method (Random Forests do not allow for incremental learning). 

In [15]:
from dask import delayed
from dask.distributed import Client

# Use all 8 cores
client = Client(processes = True)

In [16]:
client.ncores()

{'tcp://127.0.0.1:49329': 1,
 'tcp://127.0.0.1:49332': 1,
 'tcp://127.0.0.1:49334': 1,
 'tcp://127.0.0.1:49337': 1,
 'tcp://127.0.0.1:49338': 1,
 'tcp://127.0.0.1:49339': 1,
 'tcp://127.0.0.1:49340': 1,
 'tcp://127.0.0.1:49345': 1}

## Visualizations of Dask

After starting a `Client`, if you have `Bokeh` installed, you can navigate to http://localhost:8787/ to view the status of the workers. Doing this on my machine (8 cores with 16 gb total RAM) gives me:

![](../images/process_workers.png)

Right now we aren't taxing our system very much! 

As the tasks complete, you will be able to see more information about the status of Dask and your machine. 

Next, let's create a list of paths of our partitions.

In [17]:
paths = ['../input/partitions/%s' % file for file in os.listdir('../input/partitions/') if '.' not in file]
paths[:8]

['../input/partitions/p100',
 '../input/partitions/p4',
 '../input/partitions/p3',
 '../input/partitions/p101',
 '../input/partitions/p2',
 '../input/partitions/p5',
 '../input/partitions/p19',
 '../input/partitions/p26']

We'll run through 8 partitions at once (because we have 8 cores), for each partition making the Entity Set and the Feature Matrix using one worker. This lets us take full advantage of all the cores on our machine. At any one time, there will be 8 feature matrices under construction. We made the partitions small enough that none of the feature matrices will be too large for the worker. 

The 8 feature matrices will then be `concat`enated into a single feature matrix that is saved to disk. We repeat this process 13 times (104/ 8) until we have accounted for all of the partitions. Then, we can restart a `Client` using threads and create a single feature matrix (which can fit into all of our system memory). 

In [27]:
list(range(9, len(paths) + 5, 8))

[9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97, 105]

In [18]:
start_index = 0
overall_start = timer()

# Iterate through 8 paths at a time
for i, end_index in enumerate(range(9, len(paths) + 5, 8)):
    
    # Subset to the 8 paths
    if end_index > len(paths):
        subset_paths = paths[start_index:]
    else:
        subset_paths = paths[start_index: end_index]
    
    # Empty list of feature matrices
    fms = []

    # Iterate through the paths
    for path in subset_paths:

        # Make the entityset
        es = delayed(entityset_from_partition)(path)

        # Make the feature matrix and add to the list
        fm = delayed(feature_matrix_from_entityset)(es, feature_names = featurenames)
        fms.append(fm)

    # Final operation will be to concatenate together all of the feature matrices
    X = delayed(pd.concat)(fms, axis = 0)
    
    print(f"Starting feature matrix {i}")
    start = timer()
    feature_matrix = X.compute()
    end = timer()
    
    print(f"Feature Matrix {i} complete, Time Elapsed: {round(end - start, 2)} seconds.")
    
    # Save the feature matrix to disk
    feature_matrix.to_csv('../input/fm/%s.csv' % i, index = True)
    
    # Start index becomes previous ending index
    start_index = end_index

Starting feature matrix 0
Feature Matrix 0 complete, Time Elapsed: 572.87 seconds.
Starting feature matrix 1
Feature Matrix 1 complete, Time Elapsed: 599.75 seconds.
Starting feature matrix 2
Feature Matrix 2 complete, Time Elapsed: 575.26 seconds.
Starting feature matrix 3
Feature Matrix 3 complete, Time Elapsed: 550.69 seconds.
Starting feature matrix 4
Feature Matrix 4 complete, Time Elapsed: 564.22 seconds.
Starting feature matrix 5
Feature Matrix 5 complete, Time Elapsed: 544.63 seconds.
Starting feature matrix 6
Feature Matrix 6 complete, Time Elapsed: 488.82 seconds.
Starting feature matrix 7
Feature Matrix 7 complete, Time Elapsed: 537.68 seconds.
Starting feature matrix 8
Feature Matrix 8 complete, Time Elapsed: 518.94 seconds.
Starting feature matrix 9
Feature Matrix 9 complete, Time Elapsed: 532.65 seconds.
Starting feature matrix 10
Feature Matrix 10 complete, Time Elapsed: 594.18 seconds.
Starting feature matrix 11
Feature Matrix 11 complete, Time Elapsed: 681.11 seconds.


If you have Bokeh installed, you can see quite a bit of system information. For example, here is the Directed Acyclic Graph of the tasks:

![](../images/task_graph.png)


We can also make sure that we're using all system resources if we take a look at the status:

![](../images/status.png)

Now our workers are being used. You can take some time to look over the profile to see what operations took the most time.

![](../images/profile.png)

Now we'll create a list of feature matrix paths. There should be a total of 13 stored feature matrices on disk.

In [19]:
# Base directory for feature matrices
base = '../input/fm/'
fm_paths = [base + p for p in os.listdir(base) if '.csv' in p]

For joining together all of these matrices, we can start a new `Client` and use `delayed` to read in all the matrices and append them to a list. These can then be concatenated together into a final feature matrix. 

In [20]:
# Start a new client with processes
client = Client(processes = False)

client.ncores()

{'inproc://192.168.1.113/58123/2': 8}

By specifying `processes=False`, we tell Dask to use threads. This has the result of creating one worker will all of the cores on our machine. Therefore, we don't have to worry about exceeding the memory of a worker on just one core.

In [21]:
# Empty list for feature matrices
fms = []

# Iterate through the feature matrices
for path in fm_paths:
    # Read in each dataframe and append to list
    X = delayed(pd.read_csv)(path, index_col = 0)
    fms.append(X)

# Concatenate all the matrices together (append rows)
fm_out = delayed(pd.concat)(fms, axis = 0)

# Time how long operate takes
start = timer()
feature_matrix = fm_out.compute()
end = timer()

print(f'Time elasped: {round(end - start, 2)} seconds.')
overall_end = timer()

  result = self.fn(*self.args, **self.kwargs)


Time elasped: 126.1 seconds.


In [22]:
feature_matrix.shape

(352797, 1820)

The final feature matrix is exactly the expected shape: the number of clients in `app` by the number of features. 

In [33]:
print(f'Total Time for Feature Matrix Calculation: {round(overall_end - overall_start, 2)} seconds.')

Total Time for Feature Matrix Calculation: 8135.78 seconds.


# Conclusions

Working with constraints, such as limited computing power, leads to innovation. In this notebook, we had to engineer a solution to calculating the feature matrix in Dask to complete the task in a reasonable amount of time on a personal machine. Our approach was as follows:

1. Partition the data into sets based on the clients
2. Write a function to generate an `EntitySet` from a partition
3. Write a function to create a `feature_matrix` from an `EntitySet`
4. Set up Dask to use all 8 cores to make a feature matrix from 8 partitions at once
5. Add the individual feature matrices together to create on feature matrix with all clients
6. Save the resulting feature matrix for a machine learning pipeline

Parallel processing allows us to take full advantage of our system's resources. Thanks to libraries such as Dask, we can run operations in parallel and reduce computation time by 10x or more. The same framework developed in this notebook can be applied to other data science and machine learning problems.