# Partitioning Data

In this notebook, we partition the data into 100 separate datasets based on the client id, `SK_ID_CURR`. These partitions can then be used in combination with Dask to take advantage of all the resources on our machine for generating the feature matrix.

In [1]:
# pandas and numpy for data manipulation
import pandas as pd
import numpy as np

# featuretools for automated feature engineering
import featuretools as ft

import featuretools.variable_types as vtypes

import sys
import psutil

import os

In [2]:
# Read in the datasets and replace the anomalous values
app_train = pd.read_csv('../input/application_train.csv').replace({365243: np.nan})
app_test = pd.read_csv('../input/application_test.csv').replace({365243: np.nan})
bureau = pd.read_csv('../input/bureau.csv').replace({365243: np.nan})
bureau_balance = pd.read_csv('../input/bureau_balance.csv').replace({365243: np.nan})
cash = pd.read_csv('../input/POS_CASH_balance.csv').replace({365243: np.nan})
credit = pd.read_csv('../input/credit_card_balance.csv').replace({365243: np.nan})
previous = pd.read_csv('../input/previous_application.csv').replace({365243: np.nan})
installments = pd.read_csv('../input/installments_payments.csv').replace({365243: np.nan})

app_test['TARGET'] = np.nan

# Join together training and testing
app = app_train.append(app_test, ignore_index = True, sort = True)

# All ids should be integers
for index in ['SK_ID_CURR', 'SK_ID_PREV', 'SK_ID_BUREAU']:
    for dataset in [app, bureau, bureau_balance, cash, credit, previous, installments]:
        if index in list(dataset.columns):
            # Convert to integers after filling in missing values (not sure why values are missing)
            dataset[index] = dataset[index].fillna(0).astype(np.int64)

# Need `SK_ID_CURR` in every dataset
bureau_balance = bureau_balance.merge(bureau[['SK_ID_CURR', 'SK_ID_BUREAU']], 
                                      on = 'SK_ID_BUREAU', how = 'left')


# Set the index for locating
for dataset in [app, bureau, bureau_balance, cash, credit, previous, installments]:
    dataset.set_index('SK_ID_CURR', inplace = True)

In [8]:
def create_partition(user_list, partition):
    """Creates an entityset with only the users in `user_list`. 
       Main purpose is partioning data"""
    
    # Subset based on user list
    app_subset = app[app.index.isin(user_list)].copy().reset_index()
    bureau_subset = bureau[bureau.index.isin(user_list)].copy().reset_index()
    
    # Drop SK_ID_CURR from bureau_balance, cash, credit, and installments
    bureau_balance_subset = bureau_balance[bureau_balance.index.isin(user_list)].copy().reset_index(drop = True)
    cash_subset = cash[cash.index.isin(user_list)].copy().reset_index(drop = True)
    credit_subset = credit[credit.index.isin(user_list)].copy().reset_index(drop = True)
    previous_subset = previous[previous.index.isin(user_list)].copy().reset_index()
    installments_subset = installments[installments.index.isin(user_list)].copy().reset_index(drop = True)
    
    # Make the directory
    directory = '../input/partitions/p%d' % (partition + 1)
    os.makedirs(directory, exist_ok=True)
    
    # Save data to the directory
    app_subset.to_csv('%s/app.csv' % directory, index = False)
    bureau_subset.to_csv('%s/bureau.csv' % directory, index = False)
    bureau_balance_subset.to_csv('%s/bureau_balance.csv' % directory, index = False)
    cash_subset.to_csv('%s/cash.csv' % directory, index = False)
    credit_subset.to_csv('%s/credit.csv' % directory, index = False)
    previous_subset.to_csv('%s/previous.csv' % directory, index = False)
    installments_subset.to_csv('%s/installments.csv' % directory, index = False)

    print('Saved all files in partition {} to {}.'.format(partition + 1, directory))

In [9]:
# Break into 50 chunks
chunk_size = app.shape[0] // 50

# Construct an id list
id_list = [list(app.iloc[i:i+chunk_size].index) for i in range(0, app.shape[0], chunk_size)]

In [10]:
from itertools import chain

# Sanity check that we have not missed any ids
print('Number of ids in id_list:         {}.'.format(len(list(chain(*id_list)))))
print('Total length of application data: {}.'.format(len(app)))

Number of ids in id_list:         356255.
Total length of application data: 356255.


In [11]:
for i, ids in enumerate(id_list):
    # Create a partition based on the ids
    create_partition(ids, i)

Saved all files in partition 1 to ../input/partitions/p1.
Saved all files in partition 2 to ../input/partitions/p2.
Saved all files in partition 3 to ../input/partitions/p3.
Saved all files in partition 4 to ../input/partitions/p4.
Saved all files in partition 5 to ../input/partitions/p5.
Saved all files in partition 6 to ../input/partitions/p6.
Saved all files in partition 7 to ../input/partitions/p7.
Saved all files in partition 8 to ../input/partitions/p8.
Saved all files in partition 9 to ../input/partitions/p9.
Saved all files in partition 10 to ../input/partitions/p10.
Saved all files in partition 11 to ../input/partitions/p11.
Saved all files in partition 12 to ../input/partitions/p12.
Saved all files in partition 13 to ../input/partitions/p13.
Saved all files in partition 14 to ../input/partitions/p14.
Saved all files in partition 15 to ../input/partitions/p15.
Saved all files in partition 16 to ../input/partitions/p16.
Saved all files in partition 17 to ../input/partitions/p17

# Conclusions

Now that the data has been partitioned into 100 sections, we can use Dask to parallize computing the feature matrix. We can independently generate the feature matrix for each partition because the partition contains all the data for that group of clients. The feature matrix itself will be generated on a personal machine using Dask. This is implemented in the Featuretools Implementation with Dask notebook.