![](../../images/featuretools.png)

# Featuretools Implementation with Dask

A calculation of Deep Feature Synthesis from the Automated Loan Repayment notebook running on a single core takes about 25 hours on an AWS EC2 machine! Clearly we need a better approach for practical implementations of calculating a large feature matrix, one that allows us to use all cores of whatever machine we are using. 

Featuretools does have support for parallel processing if you have multiple cores (which nearly every single laptop now does), but it currently sends the entire EntitySet to each process which means you might exhaust the memory on any one core. For example, that AWS machine has 8 GB per core, which might seem like a lot until you realize the EntitySet takes up about 11 GB and setting `n_jobs=-1` will cause an out of memory error. Therefore, we cannot use the parallel processing in Featuretools and instead have to build our own implementation with Dask.  

Fortunately, options such as [Dask](https://dask.pydata.org/en/latest/) make it easy to take advantage of multiple cores on our own machine. In this notebook, we'll see how to run Deep Feature Synthesis in about 3 hours on a personal laptop with 8 cores and 16 GB of RAM. 

<p align = "center">
    <img src = "../../images/dask_logo.png" width = "400">
</p>


## Roadmap

Following is our plan of action for implementing Dask with Featuretools

1. Convert `object` data types to `category`
    * This reduces memory consumption significantly
2. Create 104 partitions of data and save to disk
    * Each partition will contain data from all 7 seven tables for 1/100 of the client ids, `SK_ID_CURR`
    * Each partition can be used to make an EntitySet and then a feature matrix
3. Write a function to take a partition and create an `EntitySet`
4. Write a function to take an `EntitySet` and calculate a `feature_matrix` that is saved to disk
5. Use Dask to parallelize 3. and 4. to create 104 feature matrices saved on disk
6. (Optionally) read in the individual feature matrices and combine into a single feature matrix
    
The general idea is to __take advantage of all our system resources by breaking one large problem into many smaller ones.__ Each of these smaller problems can be completed on one processor which means we can run multiple (8) of these problems at a time. 

At the end, we'll have a working implementation of Dask that lets us take full advantage of our computing resources. While a naive approach to this problem would just be renting a larger machine, that won't solve our problem because the bottleneck is not RAM, but using multiple cores at a time. __Sometimes having too many resources can limit your creativity, and working with constraints forces us to be innovative.__ 

In [1]:
# pandas and numpy for data manipulation
import pandas as pd
import numpy as np

# featuretools for automated feature engineering
import featuretools as ft
import featuretools.variable_types as vtypes

# Utilities
import sys
import psutil
import os

from timeit import default_timer as timer

## Convert Data Types

The first step is to convert all the data types we can. Using `category` instead of `object` can significantly reduced memory usage if the number of unique categories is much less than the number of observations. For more on the `category` type in Pandas, look at [the documentation.](https://pandas.pydata.org/pandas-docs/stable/categorical.html)

While this isn't specific to Dask, it's a good practice in general. The function below can be modified for different problems as required.

In [2]:
def convert_types(df):
    """Convert pandas data types for memory reduction."""
    
    # Iterate through each column
    for c in df:
        
        # Convert ids and booleans to integers
        if ('SK_ID' in c):
            df[c] = df[c].fillna(0).astype(np.int32)
            
        # Convert objects to category
        elif (df[c].dtype == 'object') and (df[c].nunique() < df.shape[0]):
            df[c] = df[c].astype('category')
        
        # Booleans mapped to integers
        elif list(df[c].unique()) == [1, 0]:
            df[c] = df[c].astype(bool)
        
        # Float64 to float32
        elif df[c].dtype == float:
            df[c] = df[c].astype(np.float32)
            
        # Int64 to int32
        elif df[c].dtype == int:
            df[c] = df[c].astype(np.int32)
        
    return df

Now we'll read in the datasets and apply the convert types function. 

In [3]:
# Read in the datasets and replace the anomalous values
app_train = pd.read_csv('../input/application_train.csv').replace({365243: np.nan})
app_test = pd.read_csv('../input/application_test.csv').replace({365243: np.nan})
bureau = pd.read_csv('../input/bureau.csv').replace({365243: np.nan})
bureau_balance = pd.read_csv('../input/bureau_balance.csv').replace({365243: np.nan})
cash = pd.read_csv('../input/POS_CASH_balance.csv').replace({365243: np.nan})
credit = pd.read_csv('../input/credit_card_balance.csv').replace({365243: np.nan})
previous = pd.read_csv('../input/previous_application.csv').replace({365243: np.nan})
installments = pd.read_csv('../input/installments_payments.csv').replace({365243: np.nan})

app_test['TARGET'] = np.nan

# Join together training and testing
app = app_train.append(app_test, ignore_index = True, sort = True)
number_clients = app.shape[0]

# Need `SK_ID_CURR` in every dataset
bureau_balance = bureau_balance.merge(bureau[['SK_ID_CURR', 'SK_ID_BUREAU']], 
                                      on = 'SK_ID_BUREAU', how = 'left')

print(f"""Total memory before converting types: \
{round(np.sum([x.memory_usage().sum() / 1e9 for x in 
[app, bureau, bureau_balance, cash, credit, previous, installments]]), 2)} gb.""")

# Convert types to reduce memory usage
app = convert_types(app)
bureau = convert_types(bureau)
bureau_balance = convert_types(bureau_balance)
cash = convert_types(cash)
credit = convert_types(credit)
previous = convert_types(previous)
installments = convert_types(installments)

print(f"""Total memory after converting types: \
{round(np.sum([x.memory_usage().sum() / 1e9 for x in 
[app, bureau, bureau_balance, cash, credit, previous, installments]]), 2)} gb.""")

# Set the index for locating
for dataset in [app, bureau, bureau_balance, cash, credit, previous, installments]:
    dataset.set_index('SK_ID_CURR', inplace = True)

Total memory before converting types: 4.38 gb.
Total memory after converting types: 2.06 gb.


In [4]:
print('Object memory usage.')
print(bureau['CREDIT_TYPE'].astype('object').memory_usage() / 1e9, 'gb')

print('Category memory usage.')
print(bureau['CREDIT_TYPE'].astype('category').memory_usage() / 1e9, 'gb')

print('Length of data: ', bureau.shape[0])
print('Number of unique categories: ', bureau['CREDIT_TYPE'].nunique())

Object memory usage.
0.027462848 gb
Category memory usage.
0.015448612 gb
Length of data:  1716428
Number of unique categories:  15


We can see the significant difference in memory usage depending on the data type. Since we are looking to get the most from our machine, any step that can reduce computational overhead is beneficial.

# Partitioning Data

Next, we partition the data into 104 separate datasets based on the client id, `SK_ID_CURR` and save the partitions to disk. Every partition will contain the data associated with a subset of the clients and therefore will have 7 smaller csv files. 

* Each partition by itself contains all the data needed to make an `EntitySet` for the clients 
* This `EntitySet` can then be used to create a feature matrix 
* Partitioning and saving the raw data allows for more flexilibilitiy when we create the entity set and feature matrix

104 partitions is sort of an arbitrary number and it might be worth exploring other options to see which works best.

In [5]:
def create_partition(user_list, partition):
    """Creates and saves a dataset with only the users in `user_list`."""
    
    # Make the directory
    directory = '../input/partitions/p%d' % (partition + 1)
    if os.path.exists(directory):
        return
    
    else:
        os.makedirs(directory)
        
        # Subset based on user list
        app_subset = app[app.index.isin(user_list)].copy().reset_index()
        bureau_subset = bureau[bureau.index.isin(user_list)].copy().reset_index()

        # Drop SK_ID_CURR from bureau_balance, cash, credit, and installments
        bureau_balance_subset = bureau_balance[bureau_balance.index.isin(user_list)].copy().reset_index(drop = True)
        cash_subset = cash[cash.index.isin(user_list)].copy().reset_index(drop = True)
        credit_subset = credit[credit.index.isin(user_list)].copy().reset_index(drop = True)
        previous_subset = previous[previous.index.isin(user_list)].copy().reset_index()
        installments_subset = installments[installments.index.isin(user_list)].copy().reset_index(drop = True)
        

        # Save data to the directory
        app_subset.to_csv('%s/app.csv' % directory, index = False)
        bureau_subset.to_csv('%s/bureau.csv' % directory, index = False)
        bureau_balance_subset.to_csv('%s/bureau_balance.csv' % directory, index = False)
        cash_subset.to_csv('%s/cash.csv' % directory, index = False)
        credit_subset.to_csv('%s/credit.csv' % directory, index = False)
        previous_subset.to_csv('%s/previous.csv' % directory, index = False)
        installments_subset.to_csv('%s/installments.csv' % directory, index = False)

        if partition % 10 == 0:
            print('Saved all files in partition {} to {}.'.format(partition + 1, directory))

In [6]:
# Break into 104 chunks
chunk_size = app.shape[0] // 103

# Construct an id list
id_list = [list(app.iloc[i:i+chunk_size].index) for i in range(0, app.shape[0], chunk_size)]

In [7]:
from itertools import chain

# Sanity check that we have not missed any ids
print('Number of ids in id_list:         {}.'.format(len(list(chain(*id_list)))))
print('Total length of application data: {}.'.format(len(app)))

Number of ids in id_list:         356255.
Total length of application data: 356255.


In [8]:
start = timer()
for i, ids in enumerate(id_list):
    # Create a partition based on the ids
    create_partition(ids, i)
    
end = timer()
print(f'Partitioning took {round(end - start)} seconds.')

Partitioning took 0 seconds.


__I already had the partitions made, but running the above cell took 1300 seconds (21 minutes) the first time. __

We can independently generate the feature matrix for each partition because the partition contains all the data for that group of clients. Moreover, each subset of data is small enough for the feature matrix calculation to fit entirely on one core.

#### Load in Feature names

We already calculated the feature names, so we can read them in. This avoids the need to have to recalculate the features on each partition. Instead of using `ft.dfs`, if we have the feature names, we can use `ft.calculate_feature_matrix` and pass in the `EntitySet` and the feature names.

In [9]:
featurenames = ft.load_features('../input/features.txt')
print(len(featurenames))

1820


__For each feature matrix, we'll make 1820 features drawing from all 7 tables.__

#### Variable Types

If the Automated notebook, we specified the variable types when adding entities to the entityset. However, since we already properly defined the data types for each column, Featuretools will now infer the correct variable type. For example, while before we have Booleans mapped to integers which would be interpreted as numeric, now the Booleans are represented as Booleans and hence will be correctly inferred by Featuretools.

In [10]:
# app_types = {'FLAG_CONT_MOBILE': vtypes.Boolean, 'FLAG_DOCUMENT_10': vtypes.Boolean, 'FLAG_DOCUMENT_11': vtypes.Boolean, 'FLAG_DOCUMENT_12': vtypes.Boolean, 'FLAG_DOCUMENT_13': vtypes.Boolean, 'FLAG_DOCUMENT_14': vtypes.Boolean, 'FLAG_DOCUMENT_15': vtypes.Boolean, 'FLAG_DOCUMENT_16': vtypes.Boolean, 'FLAG_DOCUMENT_17': vtypes.Boolean, 'FLAG_DOCUMENT_18': vtypes.Boolean, 'FLAG_DOCUMENT_19': vtypes.Boolean, 'FLAG_DOCUMENT_2': vtypes.Boolean, 'FLAG_DOCUMENT_20': vtypes.Boolean, 'FLAG_DOCUMENT_21': vtypes.Boolean, 'FLAG_DOCUMENT_3': vtypes.Boolean, 'FLAG_DOCUMENT_4': vtypes.Boolean, 'FLAG_DOCUMENT_5': vtypes.Boolean, 'FLAG_DOCUMENT_6': vtypes.Boolean, 'FLAG_DOCUMENT_7': vtypes.Boolean, 'FLAG_DOCUMENT_8': vtypes.Boolean, 'FLAG_DOCUMENT_9': vtypes.Boolean, 'FLAG_EMAIL': vtypes.Boolean, 'FLAG_EMP_PHONE': vtypes.Boolean, 'FLAG_MOBIL': vtypes.Boolean, 'FLAG_PHONE': vtypes.Boolean, 'FLAG_WORK_PHONE': vtypes.Boolean, 'LIVE_CITY_NOT_WORK_CITY': vtypes.Boolean, 'LIVE_REGION_NOT_WORK_REGION': vtypes.Boolean, 'REG_CITY_NOT_LIVE_CITY': vtypes.Boolean, 'REG_CITY_NOT_WORK_CITY': vtypes.Boolean, 'REG_REGION_NOT_LIVE_REGION': vtypes.Boolean, 'REG_REGION_NOT_WORK_REGION': vtypes.Boolean, 'REGION_RATING_CLIENT': vtypes.Ordinal, 'REGION_RATING_CLIENT_W_CITY': vtypes.Ordinal, 'HOUR_APPR_PROCESS_START': vtypes.Ordinal}
# previous_types = {'NFLAG_LAST_APPL_IN_DAY': vtypes.Boolean, 
#              'NFLAG_INSURED_ON_APPROVAL': vtypes.Boolean}

# Function to Create EntitySet from Partition 

The next function takes a single partition of data and make an `EntitySet`. We won't save these entitysets to disk, but instead will keep them in memory while calculating the feature matrices. Therefore, if we want to make any changes to the `EntitySet`, such as adding in interesting values or seed features, we can alter this function and remake the `EntitySet` without having to rewrite all the Entity Sets on disk. Writing the entity sets to disk would be another option if we are sure that they won't ever change. For greater flexibility, we write the data partitions to disk (as done above). 

In [11]:
def entityset_from_partition(path):
    """Create an EntitySet from a partition of data specified as a path.
       Returns a dictionary with the entityset and the number used for saving the feature matrix."""
    
    partition_num = int(path[21:])
    
    # Read in data
    app = pd.read_csv('%s/app.csv' % path)
    bureau = pd.read_csv('%s/bureau.csv' % path)
    bureau_balance = pd.read_csv('%s/bureau_balance.csv' % path)
    previous = pd.read_csv('%s/previous.csv' % path)
    credit = pd.read_csv('%s/credit.csv' % path)
    installments = pd.read_csv('%s/installments.csv' % path)
    cash = pd.read_csv('%s/cash.csv' % path)
    
    # Empty entityset
    es = ft.EntitySet(id = 'clients')
    
    # Entities with a unique index
    es = es.entity_from_dataframe(entity_id = 'app', dataframe = app, index = 'SK_ID_CURR')

    es = es.entity_from_dataframe(entity_id = 'bureau', dataframe = bureau, index = 'SK_ID_BUREAU')

    es = es.entity_from_dataframe(entity_id = 'previous', dataframe = previous, index = 'SK_ID_PREV')

    # Entities that do not have a unique index
    es = es.entity_from_dataframe(entity_id = 'bureau_balance', dataframe = bureau_balance, 
                                  make_index = True, index = 'bureaubalance_index')

    es = es.entity_from_dataframe(entity_id = 'cash', dataframe = cash, 
                                  make_index = True, index = 'cash_index')

    es = es.entity_from_dataframe(entity_id = 'installments', dataframe = installments,
                                  make_index = True, index = 'installments_index')

    es = es.entity_from_dataframe(entity_id = 'credit', dataframe = credit,
                                  make_index = True, index = 'credit_index')
    
    # Relationship between app_train and bureau
    r_app_bureau = ft.Relationship(es['app']['SK_ID_CURR'], es['bureau']['SK_ID_CURR'])

    # Relationship between bureau and bureau balance
    r_bureau_balance = ft.Relationship(es['bureau']['SK_ID_BUREAU'], es['bureau_balance']['SK_ID_BUREAU'])

    # Relationship between current app and previous apps
    r_app_previous = ft.Relationship(es['app']['SK_ID_CURR'], es['previous']['SK_ID_CURR'])

    # Relationships between previous apps and cash, installments, and credit
    r_previous_cash = ft.Relationship(es['previous']['SK_ID_PREV'], es['cash']['SK_ID_PREV'])
    r_previous_installments = ft.Relationship(es['previous']['SK_ID_PREV'], es['installments']['SK_ID_PREV'])
    r_previous_credit = ft.Relationship(es['previous']['SK_ID_PREV'], es['credit']['SK_ID_PREV'])
    
    # Add in the defined relationships
    es = es.add_relationships([r_app_bureau, r_bureau_balance, r_app_previous,
                               r_previous_cash, r_previous_installments, r_previous_credit])

    return ({'es': es, 'num': partition_num})

Let's test the function to make sure it can make an `EntitySet` from a data partition.

In [12]:
es1_dict = entityset_from_partition('../input/partitions/p1')
es1_dict['es']

Entityset: clients
  Entities:
    app [Rows: 3458, Columns: 122]
    bureau [Rows: 16097, Columns: 17]
    previous [Rows: 16204, Columns: 37]
    bureau_balance [Rows: 166374, Columns: 4]
    cash [Rows: 96632, Columns: 8]
    installments [Rows: 129130, Columns: 8]
    credit [Rows: 35694, Columns: 23]
  Relationships:
    bureau.SK_ID_CURR -> app.SK_ID_CURR
    bureau_balance.SK_ID_BUREAU -> bureau.SK_ID_BUREAU
    previous.SK_ID_CURR -> app.SK_ID_CURR
    cash.SK_ID_PREV -> previous.SK_ID_PREV
    installments.SK_ID_PREV -> previous.SK_ID_PREV
    credit.SK_ID_PREV -> previous.SK_ID_PREV

The function works as intended. The next step is to write a function that can take a single `EntitySet` and the `features` we want to build, and make a feature matrix. (`entityset_from_partition` returns a dictionary with the partition number so we can save the feature matrix based on this number. It's a minor clerical detail).

# Function to Create Feature Matrix from EntitySet 

With the entity set and the feature names, generating the feature matrix is a one-liner in Featuretools. Since we are going to use Dask for parallelizing the operation, we'll set the number of jobs to 1. The `chunk_size` is an extremely important parameter, and I'd suggest experimenting with this to find the optimal value. What I've found works best is setting the `chunk_size` to the length of the entire dataset. This might be the fastest way to make the feature matrix provided each one can fit entirely in memory. 

The last step in the function is to save the feature matrix to disk using the name of the partition of data. Using the `featurenames` ensures that we create the exact same set of features for each parition.

In [13]:
def feature_matrix_from_entityset(es_dict, feature_names, return_fm = False):
    """Run deep feature synthesis from an entityset and feature names. Saves feature matrix based on partition.""" 
    
    # Extract the entityset
    es = es_dict['es']
    
    # Calculate the feature matrix and save
    feature_matrix = ft.calculate_feature_matrix(feature_names, 
                                                 entityset=es, 
                                                 n_jobs = 1, 
                                                 verbose = 0,
                                                 chunk_size = es['app'].df.shape[0])
    
    feature_matrix.to_csv('../input/fm/p%d' % es_dict['num'], index = True)
    
    if return_fm:
        return feature_matrix

Below we test the function using the entityset from the first partition.

In [14]:
start = timer()
fm1 = feature_matrix_from_entityset(es1_dict, featurenames, return_fm = True)
end = timer()
fm1.shape

(3458, 1820)

In [15]:
print(f'Computing one feature matrix took {round(end - start, 2)} seconds.')

Computing one feature matrix took 225.58 seconds.


__We now have both parts needed to go from a data partition on disk to a feature matrix made using 1/104 of the data.__ All we have to do is repeat this operation 104 times and we will have all of our features. Since we have eight cores, we can make eight feature matrices at once (your number may differ). This gets around the fundamental bottleneck in the calculation: __running on a single core is inefficient__ especially when we have 8 available. 

To actually run this in parallel, we use the Dask library.

# Dask

We will use the Dask to parallelize the calculation of feature matrices. First, we'll import and set up a `Client` using processes, which will create one worker for each core on the machine. The memory limit of each worker will be the total system memory (16 gb) divided by the number of cores (8). 

Then we'll use the `db.from_sequence` method to create a "Dask bag" from the partition paths. A [Dask bag](http://dask.pydata.org/en/latest/bag-overview.html) is just a list of operations that we want to run in parallel. We then `map` the paths to the `entityset_from_partition` function which will create the `EntitySets`. These in turn are `map`ped to the `feature_matrix_from_entityset` to make the `feature_matrix` for one of the 104 partitions. 

Each individual feature matrix is saved to disk. We have the option to use the subset feature matrices in a machine learning pipeline using a method such as [Scikit-Learn's partial_fit](http://scikit-learn.org/stable/modules/scaling_strategies.html) if the classifier in question supports the method (Random Forests do not allow for incremental learning). If we want to make the full feature matrix, we can read in the feature matrices and then `concat` all of them together. This operation cannot be parallelized because the final feature matrix is too large to fit on a single core, but it does not represent a significant bottleneck in the overall process. Once we have the final feature matrix, it can be used in any standard machine learning pipeline.

Below we clear the system memory for a full run of Dask.

In [16]:
import gc

# Free up all system memory
gc.enable()
del app, bureau, bureau_balance, previous, credit, cash, installments
gc.collect()

322

The code below starts up 8 workers, each using one of our cores. The memory limit per worker will be the total system memory divided by the number of cores. We use `processes` instead of threads because we doing computationally heavy work and we would not be able to run this in parallel using threads. The issue with Python and threads is that threads share memory - processes do not - and because of the [Global Interpreter Lock](https://wiki.python.org/moin/GlobalInterpreterLock) there are few operations that can run in parallel using threads. For more on the topic of processes, threads, and the global interpreter lock in Python, I recommend [this article](https://medium.com/@bfortuner/python-multithreading-vs-multiprocessing-73072ce5600b).

In [17]:
import dask.bag as db
from dask.distributed import Client

# Use all 8 cores
client = Client(processes = True)

In [18]:
client.ncores()

{'tcp://127.0.0.1:61482': 1,
 'tcp://127.0.0.1:61484': 1,
 'tcp://127.0.0.1:61486': 1,
 'tcp://127.0.0.1:61487': 1,
 'tcp://127.0.0.1:61489': 1,
 'tcp://127.0.0.1:61490': 1,
 'tcp://127.0.0.1:61494': 1,
 'tcp://127.0.0.1:61500': 1}

## Visualizations of Dask

After starting a `Client`, if you have `Bokeh` installed, you can navigate to http://localhost:8787/ to view the status of the workers. Doing this on my machine (8 cores with 16 gb total RAM) gives me:

![](../images/process_workers.png)

Right now we aren't taxing our system very much! 

Next, let's create a list of paths of our partitions.

In [19]:
paths = ['../input/partitions/p%d' %  i for i in range(1, 105)]
paths[:8]

['../input/partitions/p1',
 '../input/partitions/p2',
 '../input/partitions/p3',
 '../input/partitions/p4',
 '../input/partitions/p5',
 '../input/partitions/p6',
 '../input/partitions/p7',
 '../input/partitions/p8']

We made the partitions small enough that none of the feature matrices will be too large for an individual worker. 

The next step is the heart of the code. We create a "Dask bag" from the paths, map this to the `EntitySet` creating function, and then map the result to the `feature_matrix` create and save function. The `EntitySet` is never saved and only exists in working memory. The cell below does not actually execute the code, but only creates the `bag` of tasks that Dask will then be able to allocate. 

In [20]:
fms = []

# Create a bag object
b = db.from_sequence(paths)

# Map entityset futures
b = b.map(entityset_from_partition)

# Map feature matrix futures
b = b.map(feature_matrix_from_entityset, feature_names = featurenames)
    
b

dask.bag<map-fea..., npartitions=104>

If we look at the task graph in the Bokeh dashboard, we can see the tasks that are ongoing. From the structure of this [Directed Acyclic Graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph), it's clear that this problem is highly parallizable!

![](../images/task_graph_start.png)

The cell below carries out the computation. Nothing is returned since each feature matrix is saved as a `csv`.

In [None]:
overall_start = timer()
b.compute()
overall_end = timer()

print(f"Total Time Elapsed: {round(overall_end - overall_start, 2)} seconds.")

If you have Bokeh installed, you can see quite a bit of system information during the run. 


We can also make sure that we're using all system resources if we take a look at the status:

![](../images/status.png)

Now our workers are being used. You can take some time to look over the profile to see what operations took the most time.

![](../images/profile.png)

## Optional Final Feature Matrix

If we want on final matrix, we can read in the individual feature matrices and join them together. This could be done in Dask using threads, but it just as easily can be done in pure Python with Pandas.

In [None]:
# Base directory for feature matrices
base = '../input/fm/'
fm_paths = [base + p for p in os.listdir(base) if '.csv' in p]

In [None]:
read_start = timer()
fms = [pd.read_csv(path) for path in fm_paths]
read_end = timer()

print(f'Reading in {len(fms)} feature matrices took {round(read_end - read_start)} seconds.')

In [None]:
concat_start = timer()
feature_matrix = pd.concat(fms, axis = 1)
concat_end = timer()

print('Final Feature Matrix Shape:', feature_matrix.shape)

In [None]:
print(f"Concatenation time: {round(concat_end - concat_start, 2)} seconds.")

The final feature matrix is exactly the expected shape: the number of clients in `app` by the number of features. 

If you don't already have the feature matrix, you can use the following line to save it to disk. This is now ready for feature selection and modeling! 

In [None]:
# feature_matrix.reset_index(inplace = True)
# feature_matrix.to_csv('../input/feature_matrix.csv', index = False)
feature_matrix.head()

# Conclusions

Working with constraints, such as limited computing power, leads to innovation. In this notebook, we had to engineer a solution to calculating the feature matrix in Dask to complete the task in a reasonable amount of time on a personal machine. Our approach was as follows:

1. Partition the data into sets based on the clients
2. Write a function to generate an `EntitySet` from a partition
3. Write a function to create a `feature_matrix` from an `EntitySet`
4. Set up Dask to use all 8 cores to make a feature matrix from 8 partitions at once
5. Save the resulting feature matrices to disk
6. Read in the individual feature matrices to create one final feature matrix.

Parallel processing allows us to take full advantage of our system's resources. Thanks to libraries such as Dask, we can run operations in parallel and reduce computation time by 10x or more. The same framework developed in this notebook can be applied to other data science and machine learning problems.