# Introduction: Partitioning 

In this notebook, we will partition the data based on customer id. This will allow us to use a framework such as Spark using PySpark or Dask in order to parallelize computations.

In [1]:
import pandas as pd
import numpy as np

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## List of All Files

Below are all the files provided by the competition. There are several versions of multiple files because the competition released new data several times.

In [2]:
from pathlib import Path

PATH = Path('/data/churn/')
list(PATH.iterdir())

[PosixPath('/data/churn/train_v2.csv'),
 PosixPath('/data/churn/user_logs_v2.csv'),
 PosixPath('/data/churn/transactions.csv'),
 PosixPath('/data/churn/WSDMChurnLabeller.scala'),
 PosixPath('/data/churn/user_logs.csv'),
 PosixPath('/data/churn/transactions_v2.csv'),
 PosixPath('/data/churn/members_v3.csv'),
 PosixPath('/data/churn/train.csv'),
 PosixPath('/data/churn/sample_submission_v2.csv'),
 PosixPath('/data/churn/sample_submission_zero.csv')]

## Training data

Here we read in the training data (the labels) and join together.

In [9]:
train_1 = pd.read_csv(PATH/'train.csv')
train_2 = pd.read_csv(PATH/'train_v2.csv')
all_train = pd.concat([train_1, train_2], axis = 0)
all_train.shape

(1963891, 2)

In [11]:
all_train.to_csv(PATH/'all_train.csv', index = False)

## Transactions Data

We take the same approach with the transactions data.

In [3]:
trans_1 = pd.read_csv(PATH/'transactions.csv')
trans_2 = pd.read_csv(PATH/'transactions_v2.csv')

In [6]:
trans_1.head()
trans_2.head()

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
0,YyO+tlZtAXYXoZhNr3Vg3+dfVQvrBVGO8j1mfqe4ZHc=,41,30,129,129,1,20150930,20151101,0
1,AZtu6Wl0gPojrEQYB8Q3vBSmE2wnZ3hi1FbK1rQQ0A4=,41,30,149,149,1,20150930,20151031,0
2,UkDFI97Qb6+s2LWcijVVv4rMAsORbVDT2wNXF0aVbns=,41,30,129,129,1,20150930,20160427,0
3,M1C56ijxozNaGD0t2h68PnH2xtx5iO5iR2MVYQB6nBI=,39,30,149,149,1,20150930,20151128,0
4,yvj6zyBUaqdbUQSrKsrZ+xNDVM62knauSZJzakS9OW4=,39,30,149,149,1,20150930,20151121,0


Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
0,++6eU4LsQ3UQ20ILS7d99XK8WbiVgbyYL4FUgzZR134=,32,90,298,298,0,20170131,20170504,0
1,++lvGPJOinuin/8esghpnqdljm6NXS8m8Zwchc7gOeA=,41,30,149,149,1,20150809,20190412,0
2,+/GXNtXWQVfKrEDqYAzcSw2xSPYMKWNj22m+5XkVQZc=,36,30,180,180,1,20170303,20170422,0
3,+/w1UrZwyka4C9oNH3+Q8fUf3fD8R3EwWrx57ODIsqk=,36,30,180,180,1,20170329,20170331,1
4,+00PGzKTYqtnb65mPKPyeHXcZEwqiEzktpQksaaSC3c=,41,30,99,99,1,20170323,20170423,0


In [7]:
all_trans = pd.concat([trans_1, trans_2], axis = 0)
all_trans.shape

(22978755, 9)

In [8]:
all_trans.to_csv(PATH/'all_trans.csv', index = False)

## Members Data

The members data can be the parent of the other datasets. It includes the basic metadata about each member.

In [10]:
members = pd.read_csv(PATH/'members_v3.csv')
members.head()

Unnamed: 0,msno,city,bd,gender,registered_via,registration_init_time
0,Rb9UwLQTrxzBVwCB6+bCcSQWZ9JiNLC9dXtM1oEsZA8=,1,0,,11,20110911
1,+tJonkh+O1CA796Fm5X60UMOtB6POHAwPjbTRVl/EuU=,1,0,,7,20110914
2,cV358ssn7a0f7jZOwGNWS07wCKVqxyiImJUX6xcIwKw=,1,0,,11,20110915
3,9bzDeJP6sQodK73K5CBlJ6fgIQzPeLnRl0p5B77XP+g=,1,0,,11,20110915
4,WFLY3s7z4EZsieHCt63XrsdtfTEmJ+2PnnKLH5GY4Tk=,6,32,female,9,20110915


## Approach to EntitySet

* Members is the parent
* Transactions is a child
* User logs is another child

Partition data based no user ids, `msno`.

In [14]:
all_train['msno'].nunique()

1082190

In [15]:
len(all_train)

1963891

In [16]:
all_train.head()

Unnamed: 0,msno,is_churn
0,waLDQMmcOu2jLDaV1ddDkgCrB/jl6sD66Xzs0Vqax1Y=,1
1,QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=,1
2,fGwBva6hikQmTJzrbz/2Ezjm5Cth5jZUNvXigKK2AFA=,1
3,mT5V8rEpa+8wuqi6x0DoVd3H5icMKkE9Prt49UlmK+4=,1
4,XaPhtGLk/5UvvOYHcONTwsnH97P4eGECeq+BARGItRw=,1


Some customers have multiple labels as indicated below. It's possible that the same customer can have both a label of did not churn and a churn label.

In [18]:
all_train.groupby(['msno'])['is_churn'].nunique().to_frame().sort_values('is_churn').tail()

Unnamed: 0_level_0,is_churn
msno,Unnamed: 1_level_1
VLmKab6JByryjhBzawWgIXnsDn/p8gtuKNUDEIrALIw=,2
PX+bAAD+a5wZ0qGOiZbydu8SNbFZrhiCjz+z0q6jlLM=,2
ArZiQi90rqg0aBc5YeXAm6ux0CkHlyziFa03DKAtjVc=,2
3x8Nugz7JdfTmexKza79oBIly8isF7PTReWCXyAr9QU=,2
DeskHs66YWKq1JhCzVtfj0z8P5B+Y2Xrtu0mByqUufQ=,2


In [19]:
all_trans.head()

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
0,YyO+tlZtAXYXoZhNr3Vg3+dfVQvrBVGO8j1mfqe4ZHc=,41,30,129,129,1,20150930,20151101,0
1,AZtu6Wl0gPojrEQYB8Q3vBSmE2wnZ3hi1FbK1rQQ0A4=,41,30,149,149,1,20150930,20151031,0
2,UkDFI97Qb6+s2LWcijVVv4rMAsORbVDT2wNXF0aVbns=,41,30,129,129,1,20150930,20160427,0
3,M1C56ijxozNaGD0t2h68PnH2xtx5iO5iR2MVYQB6nBI=,39,30,149,149,1,20150930,20151128,0
4,yvj6zyBUaqdbUQSrKsrZ+xNDVM62knauSZJzakS9OW4=,39,30,149,149,1,20150930,20151121,0


In [20]:
all_trans['msno'].nunique()

2426143

# Using Dask Dataframe for Large Datasets

For the user logs, we'll have to use Dask dataframes. This is a method for working with large datasets that does not load the entire file into memory at once. Instead, it loads data as needed (lazy loading) to carry out similar operations as are done to Pandas dataframes. This makes it feasible to work with datasets that are beyond the capabilities of our single machine.

## User Logs

The user logs are quite large (up to 29 GB) and must be dealt with using Dask dataframes. For example, to compute the shape of the log file, we cannot load in the entire file at once and instead must let Dask handle the loading behind the scenes.

In [21]:
from dask import dataframe as dd

logs1 = dd.read_csv(PATH/'user_logs.csv')

Calling a method or attribute of a Dask dataframe does not return the value, instead it creates a future that will be executed at a later time. In order to actually execute the task, we have to call the compute method.

In [31]:
shape1 = logs1.shape[0]
shape1.compute()

392106543

In [32]:
logs2 = dd.read_csv(PATH/'user_logs_v2.csv')
shape2 = logs2.shape[0]
shape2.compute()

18396362

In [36]:
logs1['msno'].nunique().compute()

5234111

In [37]:
logs2['msno'].nunique().compute()

1103894

This operation does not actually join the datasets together until necessary. Instead, it creates a futures object that will be called when we want to run any calculations with the log data.

In [49]:
all_logs = dd.concat([logs1, logs2], axis = 0)

## Submission Data

The following are the submission files (test data) for the competition. This is also in two parts because of the multiple issues with data provenance encountered in the competition.

In [38]:
sub1 = pd.read_csv(PATH/'sample_submission_zero.csv')
sub2 = pd.read_csv(PATH/'sample_submission_v2.csv')
all_subs = pd.concat([sub1, sub2], axis = 0)
all_subs['msno'].nunique()

1076941

### Ids

I'm not sure how many ids there really are because all the data has differing number of unique ids. For partitioning the data, I'll create a list of all the ids in the training and testing dataframes (both versions). It's possible that much of the data in the other dataframes is not related to any of these customers.

In [39]:
all_msno = list(all_subs['msno'].unique()) + list(all_train['msno'].unique())
print(f'There are {len(all_msno)} unique ids in the training and testing data.')

There are 2159131 unique ids in the training and testing data.


#### Chunking Data

To keep the size of the individual chunks manageable, we'll create 1000 partitions. Each partition will contain all the data associated with a subset of customers. 

In [43]:
n_chunks = 1000
chunk_size = len(all_msno) // (n_chunks - 1)

The following code creates a list of 1000 lists of subset customer ids. 

In [45]:
id_list = [list(all_msno[i:i+chunk_size]) for i in range(0, len(all_msno), chunk_size)]
len(id_list)

1000

In [47]:
from itertools import chain
all_ids = list(chain(*id_list))
len(all_ids)

2159131

We should make sure that all the ids from the trainig and testing data are represented in the list of lists.

In [48]:
all_ids == all_msno

True

### Example of Subsetting the Dask Dataframe

The Dask dataframe cannot be subsetted all at once. Instead, we can create the call and then compute the result using Dask. Behind the scenes, Dask will load in the dataframe a little at a time and find the correct rows. Then, it will return all the rows we need which we can save in the partition. Without Dask, this operation would not be feasible.

In [56]:
logs_subset = all_logs.loc[all_logs['msno'].isin(id_list[0]), :].copy().compute()

#### Save Data 

Each partition will be saved as a separate partition on the disk. At the end of paritioning, there will be 1000 directories, each with all the data associated with a subset of customers.

In [52]:
import os

partition = 0
id_section = id_list[0]
directory = PATH/f'partitions/p{partition}'

os.makedirs(directory, exist_ok=True)

#### Example Application

We'll create one trial partition of the first subset of customers. Each subset has about 2100 customers.

In [57]:
logs_subset = all_logs.loc[all_logs['msno'].isin(id_section), :].copy().compute()
trans_subset = all_trans.loc[all_trans['msno'].isin(id_section), :].copy()
members_subset = members.loc[members['msno'].isin(id_section), :].copy()
train_subset = all_train.loc[all_train['msno'].isin(id_section), :].copy()
test_subset = all_subs.loc[all_subs['msno'].isin(id_section), :].copy()

In [58]:
logs_subset.to_csv(f'{directory}/logs.csv', index = False)
trans_subset.to_csv(f'{directory}/transactions.csv', index = False)
members_subset.to_csv(f'{directory}/members.csv', index = False)
train_subset.to_csv(f'{directory}/train.csv', index = False)
test_subset.to_csv(f'{directory}/test.csv', index = False)

In [59]:
os.listdir(directory)

['transactions.csv', 'members.csv', 'test.csv', 'train.csv', 'logs.csv']

That works as intended so we can wrap the code in a function and then calculate all the partitions. The following function takes in a list of customers (`id_section`) and creates a partition for those customers. The partition is then saved using the number of the parition (`partition`) as the directory identifier.

In [70]:
def create_partition(id_section: list, partition: int) -> None:
    """Create a partition from a list of ids. Save to disk using the partition number."""
    
    # Log subset must be computed because it is a dask dataframe
    logs_subset = all_logs.loc[all_logs['msno'].isin(id_section), :].copy().compute()
    # All other subsets can simply be selected
    trans_subset = all_trans.loc[all_trans['msno'].isin(id_section), :].copy()
    members_subset = members.loc[members['msno'].isin(id_section), :].copy()
    train_subset = all_train.loc[all_train['msno'].isin(id_section), :].copy()
    test_subset = all_subs.loc[all_subs['msno'].isin(id_section), :].copy()
    
    # Make the partition directory
    directory = PATH/f'partitions/p{partition}'
    os.makedirs(directory, exist_ok = True)
    
    # Save the subset to disk
    logs_subset.to_csv(f'{directory}/logs.csv', index = False)
    trans_subset.to_csv(f'{directory}/transactions.csv', index = False)
    members_subset.to_csv(f'{directory}/members.csv', index = False)
    train_subset.to_csv(f'{directory}/train.csv', index = False)
    test_subset.to_csv(f'{directory}/test.csv', index = False)
    
    # Progress
    if partition % 10 == 0:
        print(f'Partition {partition} saved to {directory}.')

In [71]:
from timeit import default_timer as timer

The following code runs all of the partitions. This may take quite a while.

In [None]:
start = timer()
for i, id_subset in enumerate(id_list):
    create_partition(id_subset, i)
end = timer()
print(f'{round(end - start)} seconds elapsed.')

Partition 0 saved to /data/churn/partitions/p0.
Partition 10 saved to /data/churn/partitions/p10.
Partition 20 saved to /data/churn/partitions/p20.
Partition 30 saved to /data/churn/partitions/p30.
Partition 40 saved to /data/churn/partitions/p40.
Partition 50 saved to /data/churn/partitions/p50.
Partition 60 saved to /data/churn/partitions/p60.
Partition 70 saved to /data/churn/partitions/p70.
Partition 80 saved to /data/churn/partitions/p80.
Partition 90 saved to /data/churn/partitions/p90.
Partition 100 saved to /data/churn/partitions/p100.
Partition 110 saved to /data/churn/partitions/p110.
Partition 120 saved to /data/churn/partitions/p120.
Partition 130 saved to /data/churn/partitions/p130.
Partition 140 saved to /data/churn/partitions/p140.
Partition 150 saved to /data/churn/partitions/p150.
Partition 160 saved to /data/churn/partitions/p160.
Partition 170 saved to /data/churn/partitions/p170.
Partition 180 saved to /data/churn/partitions/p180.
Partition 190 saved to /data/churn

# Next Steps

Now that the data has been partitioned, we can run calculations in parallel on the data. After building a processing pipeline by experimenting with a single partition, we can then scale to using all the data by running the operations in parallel. This can be done using Dask or a similar framework like Spark using PySpark.

The next notebook will create a processing pipeline for a single partition.