# Introduction: Partition Pipeline

In this notebook, we will work with a single partition to develop a pipeline for processing the data. The end goal is code that can take a partition on disk and generate a feature matrix from the partition. This will then be parallelized using Spark in PySpark.

In [1]:
import pandas as pd 
import numpy as np

import featuretools as ft

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
PARTITION = '500'
directory = '/data/churn/partitions/p' + PARTITION

import os
os.listdir(directory)

['logs.csv', 'members.csv', 'train.csv', 'test.csv', 'transactions.csv']

In [3]:
all_partitions = os.listdir('/data/churn/partitions/')
len(all_partitions)

1000

In [4]:
members = pd.read_csv(f'{directory}/members.csv', 
                      parse_dates=['registration_init_time'], infer_datetime_format = True)
trans = pd.read_csv(f'{directory}/transactions.csv',
                   parse_dates=['transaction_date', 'membership_expire_date'], infer_datetime_format = True)
logs = pd.read_csv(f'{directory}/logs.csv', parse_dates = ['date'])
train = pd.read_csv(f'{directory}/train.csv')
test = pd.read_csv(f'{directory}/test.csv')

In [5]:
members.head()
trans.head()
logs.head()
train.head()
test.head()

Unnamed: 0,msno,city,bd,gender,registered_via,registration_init_time
0,IPcy704aIqoa4MY5NBAKhVw1qZCWvQcYICBVMufSbcg=,5,0,male,3,2014-11-02
1,N7VphdA9MRD/ojyO/jSWydNrQqfZMe2d1eDl5kwB+vg=,5,17,female,4,2016-12-26
2,wnOtVWT2Hi28usrU9Yb0JCdl/TGO48HUfJlgehG0kDw=,1,0,,4,2017-01-20
3,DEIygRcw0Soz4FguDgJQnSrlHoTYHmlvTcoOLB9dF2Y=,1,0,,4,2017-01-21
4,q4k48ZA18embL69OlVhGpT/8sB5nhETBpH5B6Ud+JXI=,1,0,,4,2016-08-15


Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
0,/4OzeklvQKOIr804cYEbcsy4xbpWHQFF40oeMTMuTak=,41,30,129,129,1,2016-02-29,2016-03-31,0
1,1lhQM//dvJCyWLTaCw7x+aDrCFNhNk/8QzlMwiRgB4Y=,41,30,149,149,1,2015-12-31,2016-01-31,0
2,bSgrbAUbyZDpkoQgVxeH4dQ7v8yEoucUK0lB0x6F2R0=,21,30,149,149,1,2015-12-02,2016-01-08,0
3,c3HjpBgEcGfa+mkJVtC47gE2CaW+KTBUxijgvrnBUuY=,28,30,150,150,0,2016-07-02,2016-08-01,0
4,vWLvk74sFSINQPmCbcIMqAh1MDdzxroTKIjaxKWEQHA=,41,30,99,99,1,2016-09-13,2016-10-13,0


Unnamed: 0,msno,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs
0,1T6cC9wlTNDxYh+ikIsljHO3LJ62pdNxeo0uC6b9iUk=,2017-03-17,17,1,1,0,55,51,12650.427
1,Qbj5QJcK+N/z9h4fR82QYmABCS9g3EIbGijYxqOAw3M=,2017-03-01,4,2,2,1,38,46,10247.052
2,tk3KXVctKu4yERExEwFvMMOrpU88K083pDNRONhpMzY=,2017-03-30,0,0,0,0,18,18,4565.533
3,q9u6CM2lMNSyc0mHPnH9O/yWvMGqeTcMqBHRnS7s0MI=,2017-03-20,0,0,0,0,21,21,5523.67
4,a/vnjfU45TFglx+JFOPBWQHOaQdEY/lYUw8cxLurbwA=,2017-03-10,4,1,1,0,12,11,3670.509


Unnamed: 0,msno,is_churn
0,ZUSJqkHx/1fHxi6uRqt7OYQ40pe2yWz695QUeCsaW+0=,1
1,L4oh94wx7Q77MwjON4+V+zL0jDYc3WqYirPaK+B7cxo=,1
2,lEgrHpAlsYj6yDiPFXOWAR9afI1rbu9/iGAmzJarhFI=,1
3,f//RYNN5Hyx3mT7tNS54h5YFCpzbBIGzrwy2qkpqTi0=,1
4,5pbXRfeGBsY0aecqUGMlLmxmKp7LoQKvIUiRW8feIuk=,1


Unnamed: 0,msno,is_churn
0,5R4p/Be8S+16d/K/aCJt39p/H73shezcMgPLANMLDLM=,0
1,MhdSuw4UBWK+xPXTUmlJIEghwlmmnqd+4bI03KUup5I=,0
2,nJLo+BP1rrR8XzVU1C0zg395X0C82ECOOidNr29nIAY=,0
3,zJmQZsz5bCdZcEoW0cMiNc0/G2DwOUHLdB/hLBhwqbo=,0
4,8AlCmBO/ax97R+c/XW0gTsXax22dde5dOYjGlJtfjXc=,0


## How Many Unique Members are There? 

Who do we need to find data for? The best choice is probably only the customers in the transactions dataframe since we can make labels for them. 

The defintion of a label will be: within 30 days of cancelling, does a customer resubscribe? Given this definition, we can write a function to generate labels. We'll start with a single customer and then figure out how to write a function for any customer.

In [6]:
trans = trans.loc[trans['membership_expire_date'] >= trans['transaction_date']]
trans = trans.sort_values(['msno', 'transaction_date', 'membership_expire_date'])
trans.groupby('msno').count().sort_values('is_cancel').tail()

Unnamed: 0_level_0,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
msno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
go1TddRe7x2XYZSQrvtMljLfzuyVckrvuy20g+x0NFY=,39,39,39,39,39,39,39,39
DD/G10hDcepG6zbvcqw1BN5m769aEIYc9pdKx0dpeRQ=,39,39,39,39,39,39,39,39
xYuQQy+4yBl/9xuUrmit5y4beSVf6LRgrZWQERAF9y8=,40,40,40,40,40,40,40,40
punXIeuAwM+W+pi9qpTkBNdnYORR4+/gOXgZG3fB8Q4=,41,41,41,41,41,41,41,41
ineWmhVlalmqq5V4rNEd8ECLRSdVSwYX7yuxmVSzIsY=,50,50,50,50,50,50,50,50


In [7]:
cust = trans.loc[trans['msno'] == 'ineWmhVlalmqq5V4rNEd8ECLRSdVSwYX7yuxmVSzIsY='].copy()
cust.iloc[:, 1:].tail()

Unnamed: 0,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
15631,35,7,0,0,0,2015-12-29,2015-12-29,0
10056,35,7,0,0,0,2015-12-30,2015-12-30,0
13049,35,7,0,0,0,2015-12-30,2015-12-31,0
3426,38,410,1788,1788,0,2016-01-03,2017-02-16,0
22115,32,90,298,298,0,2017-02-17,2017-05-21,0


In [8]:
cancel = cust.loc[cust['is_cancel'] == 1].copy()
cancel

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
15813,ineWmhVlalmqq5V4rNEd8ECLRSdVSwYX7yuxmVSzIsY=,37,30,149,149,1,2015-07-21,2015-08-11,1


## Potential Outcomes from a Transaction

At the time of the transaction, we want to make a prediction of the behavior following the transaction. We'll establish four different possibilities (five including no data):

* 0: Sign another deal before membership expires
* 1: Sign another deal on the day membership expires
* 2: Sign another deal within 30 days of membership expiring
* 3: Do not sign another deal within 30 days of membership expiring
* NaN: no information in 30 days following expiration

In [9]:
non_cancels = cust.loc[cust['is_cancel'] == 0].copy()
cancels = cust.loc[cust['is_cancel'] == 1].copy()

## Non Cancellations

First we'll work through making labels for the non-cancellation transactions. A customer is defined as churned if they do not renew within 30 days of the end of the membership. 

In [10]:
labels = []
label_times = []

# Iterate through each cancellation
for i, transaction in non_cancels.iterrows():
    
    # Find the transaction date and membership renewal date
    transaction_date = transaction['transaction_date']
    expire_date = transaction['membership_expire_date']
        
    label_times.append(transaction_date)
    
    # Customer has 30 days to renew 
    renew_by_date = expire_date + pd.Timedelta(30, 'D')
    
    # Subset to transactions within the renewal period
    renewal_trans = cust.loc[(cust['transaction_date'] >= transaction_date) & 
                             (cust['transaction_date'] <= renew_by_date) & 
                             (cust['is_cancel'] == 0) & (cust.index != i)].copy()
    
    # Data after the end of renewal period
    post_renewal = cust.loc[cust['transaction_date'] > expire_date]
    
    if len(renewal_trans) > 0:
        churned = 0
        
    # No data
    elif (len(renewal_trans) == 0) and (len(post_renewal) == 0):
        churned = np.nan
    
    # Data exists but after renewal period
    elif len(post_renewal) > 0:
        churned = 1
        
    labels.append(churned)

cutoff_times = pd.DataFrame({'cutoff': label_times, 'label': labels})
np.where(np.array(labels) == 1)

(array([6]),)

In [11]:
non_cancels.iloc[6:8, 1:]

Unnamed: 0,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
5098,37,30,149,149,1,2015-07-12,2015-08-11,0
20104,35,7,0,0,0,2015-11-06,2015-11-06,0


## Cancellations

We'll do the same thing with the cancellations: for a given cancellation, will the customer sign another contract within 30 days?

In [12]:
labels = []
label_times = []

# Iterate through each cancellation
for i, transaction in cancels.iterrows():
    
    # Find the transaction date and membership renewal date
    transaction_date = transaction['transaction_date']
    expire_date = transaction['membership_expire_date']
    
    label_times.append(transaction_date)
    
    # Customer has 30 days to renew 
    renew_by_date = expire_date + pd.Timedelta(30, 'D')
    
    # Subset to transactions within the renewal period
    renewal_trans = cust.loc[(cust['transaction_date'] >= transaction_date) & 
                             (cust['transaction_date'] <= renew_by_date) & 
                             (cust['is_cancel'] == 0) & (cust.index != i)].copy()
    
    # Data after the end of renewal period
    post_renewal = cust.loc[cust['transaction_date'] > expire_date]
    
    if len(renewal_trans) > 0:
        churned = 0
        
    # No data
    elif (len(renewal_trans) == 0) and (len(post_renewal) == 0):
        churned = np.nan
    
    # Data exists but after renewal period
    elif len(post_renewal) > 0:
        churned = 1
        
    labels.append(churned)
    
cancel_cutoff_times = pd.DataFrame({'cutoff': label_times, 'label': labels})
np.where(np.array(labels) == 1)

(array([0]),)

In [13]:
cancels.iloc[0, 1:]

payment_method_id                          37
payment_plan_days                          30
plan_list_price                           149
actual_amount_paid                        149
is_auto_renew                               1
transaction_date          2015-07-21 00:00:00
membership_expire_date    2015-08-11 00:00:00
is_cancel                                   1
Name: 15813, dtype: object

In [14]:
cutoff_times.head()
cancel_cutoff_times

Unnamed: 0,cutoff,label
0,2015-01-11,0.0
1,2015-02-11,0.0
2,2015-03-11,0.0
3,2015-04-12,0.0
4,2015-05-12,0.0


Unnamed: 0,cutoff,label
0,2015-07-21,1


# Function to Generate Labels for One Customer

Now we need to take those individual blocks of code and combine them into one function. The input is a customer id and the output is two dataframes: a label cutoff time dataframe for the non-cancellation transactions, and a label cutoff time dataframe for the cancellation transactions.

In [15]:
trans = pd.read_csv(f'{directory}/transactions.csv',
                   parse_dates=['transaction_date', 'membership_expire_date'], infer_datetime_format = True)

trans = trans.sort_values(['msno', 'transaction_date', 'membership_expire_date'])
trans = trans.loc[trans['membership_expire_date'] >= trans['transaction_date']]

In [16]:
def make_labels(df, trans_df, customer_id):
    
    labels = []
    label_times = []
    
    # Iterate through each cancellation
    for i, transaction in df.iterrows():

        # Find the transaction date and membership renewal date
        transaction_date = transaction['transaction_date']
        expire_date = transaction['membership_expire_date']

        label_times.append(transaction_date)

        # Customer has 30 days to renew 
        renew_by_date = expire_date + pd.Timedelta(30, 'D')

        # Subset to transactions within the renewal period
        renewal_trans = trans_df.loc[(trans_df['transaction_date'] >= transaction_date) & 
                                     (trans_df['transaction_date'] <= renew_by_date) & 
                                     (trans_df['is_cancel'] == 0) & (trans_df.index != i)].copy()

        # Data after the end of renewal period
        post_renewal = trans_df.loc[trans_df['transaction_date'] > expire_date]

        if len(renewal_trans) > 0:
            churned = 0

        # No data
        elif (len(renewal_trans) == 0) and (len(post_renewal) == 0):
            churned = np.nan

        # Data exists but after renewal period
        elif len(post_renewal) > 0:
            churned = 1

        labels.append(churned)

    cutoff_times = pd.DataFrame({'msno': customer_id,'cutoff': label_times, 'label': labels})
    return cutoff_times

In [17]:
def make_customer_labels(customer_id, transactions):
    """Make label cutoff time dataframes for a customer. 
    
    Params
    --------
        customer_id (str): customer id (`msno`) as a string
    
    Returns
    --------
        cutoff_times (dataframe): label dataframe with columns 'msno', 'cutoff', and 'label' for the non-cancellation
                                  transactions
        cancel_cutoff_times (dataframe): label dataframe with columns 'msno', 'cutoff', and 'label' for the
                                         cancellation transactions
    """
    # Subset to the customer data
    cust = transactions.loc[transactions['msno'] == customer_id].copy()
    
    non_cancels = cust.loc[cust['is_cancel'] == 0].copy()
    cancels = cust.loc[cust['is_cancel'] == 1].copy()
    
    cutoff_times = make_labels(non_cancels, cust, customer_id)
    cancel_cutoff_times = make_labels(cancels, cust, customer_id)
    
    return cutoff_times, cancel_cutoff_times

In [18]:
cutoff_times.head()

Unnamed: 0,cutoff,label
0,2015-01-11,0.0
1,2015-02-11,0.0
2,2015-03-11,0.0
3,2015-04-12,0.0
4,2015-05-12,0.0


In [19]:
ct, cancel_ct = make_customer_labels('ineWmhVlalmqq5V4rNEd8ECLRSdVSwYX7yuxmVSzIsY=', trans)
ct.head()
cancel_ct

Unnamed: 0,msno,cutoff,label
0,ineWmhVlalmqq5V4rNEd8ECLRSdVSwYX7yuxmVSzIsY=,2015-01-11,0.0
1,ineWmhVlalmqq5V4rNEd8ECLRSdVSwYX7yuxmVSzIsY=,2015-02-11,0.0
2,ineWmhVlalmqq5V4rNEd8ECLRSdVSwYX7yuxmVSzIsY=,2015-03-11,0.0
3,ineWmhVlalmqq5V4rNEd8ECLRSdVSwYX7yuxmVSzIsY=,2015-04-12,0.0
4,ineWmhVlalmqq5V4rNEd8ECLRSdVSwYX7yuxmVSzIsY=,2015-05-12,0.0


Unnamed: 0,msno,cutoff,label
0,ineWmhVlalmqq5V4rNEd8ECLRSdVSwYX7yuxmVSzIsY=,2015-07-21,1


## Function to Make Labels for Entire Partition

The next step is to take the function from a single individual to an entire partition. Then, we can run this over every partition.

In [20]:
def make_partition_labels(partition, save_data = True):
    """Make label cutoff time dataframe for one partition.
    Parameters
    --------
        partition (int): number of partition
        
    Returns
    --------
        No return. Saves results to the partition directory as cutoff_times.csv and cancel_cutoff_times.csv"""
    
    # Read in transactions
    transactions = pd.read_csv(f'/data/churn/partitions/p{partition}/transactions.csv', 
                              parse_dates=['transaction_date', 'membership_expire_date'], 
                               infer_datetime_format = True)
    # Sort and filter
    transactions = transactions.sort_values(['msno', 'transaction_date', 'membership_expire_date'])
    transactions = transactions[transactions['membership_expire_date'] >= transactions['transaction_date']]
    
    # Unique customers
    customer_ids = list(transactions['msno'].unique())
    
    # Lists to hold dataframes
    cutoff_times = []
    cancel_cutoff_times = []
    
    for customer_id in customer_ids:
        ct, cancel_ct = make_customer_labels(customer_id, transactions)
        
        cutoff_times.append(ct)
        cancel_cutoff_times.append(cancel_ct)
    
    # Optionally return the data
    if not save_data:
        return cutoff_times, cancel_cutoff_times
    
    # By default save the data as a single dataframe
    cutoff_times = pd.concat(cutoff_times)
    cutoff_times.to_csv(f'/data/churn/partitions/p{partition}/cutoff_times.csv', index = False)

    cancel_cutoff_times = pd.concat(cancel_cutoff_times)
    cancel_cutoff_times.to_csv(f'/data/churn/partitions/p{partition}/cancel_cutoff_times.csv', index = False)

In [21]:
ct, cancel_ct = make_partition_labels(500, save_data = False)

In [22]:
cutoff_times_partition = pd.concat(ct)
cutoff_times_partition.head()

cancel_cutoff_times_partition = pd.concat(cancel_ct)
cancel_cutoff_times_partition.head()

Unnamed: 0,msno,cutoff,label
0,++qj3R3B417SL86dDpxCFbn32bg5sDnXe/Xq5vainPc=,2015-09-15,
0,+0c//ipo6m6vtrBrIwjCTjfKJO0pLnYM85tlXpatHRc=,2017-01-20,0.0
1,+0c//ipo6m6vtrBrIwjCTjfKJO0pLnYM85tlXpatHRc=,2017-02-19,0.0
2,+0c//ipo6m6vtrBrIwjCTjfKJO0pLnYM85tlXpatHRc=,2017-03-19,
0,+3iItQu4ny+tpUkvPkT3eR8GJIRVS1Fm7KO08TEUVKA=,2015-06-14,


Unnamed: 0,msno,cutoff,label
0,++qj3R3B417SL86dDpxCFbn32bg5sDnXe/Xq5vainPc=,2015-10-13,
0,+FXysf97i1BwwAPzHMRJPXmy7A8QnkKFzbx2ano9RLk=,2016-05-17,
0,+JOhl9HGa8ui2BvCWjFqAJ3yCcZSykfXNTuhqLPMj6I=,2015-12-13,
0,+WG2dC1meP8st31uaePeWyE9hMQTGTQj/jsvpQey8go=,2016-02-18,
0,+a48TFPXVLU0+duM97qrwq3uC2OUCYsZ0K0UcxoDIYg=,2017-03-16,


In [23]:
cutoff_times_partition.shape
cancel_cutoff_times_partition.shape

(21987, 3)

(722, 3)

In [24]:
len(trans) == (len(cutoff_times_partition) + len(cancel_cutoff_times_partition))

True

# Use Dask to Parallelize Making Cutoff Times

In [25]:
partitions = list(range(len(os.listdir('/data/churn/partitions'))))
len(partitions)

1000

In [26]:
import dask.bag as db
from dask.distributed import Client

# Use all cores
client = Client(processes = True)

In [27]:
client.ncores()

{'tcp://127.0.0.1:33070': 1,
 'tcp://127.0.0.1:33644': 1,
 'tcp://127.0.0.1:33917': 1,
 'tcp://127.0.0.1:34035': 1,
 'tcp://127.0.0.1:35414': 1,
 'tcp://127.0.0.1:35563': 1,
 'tcp://127.0.0.1:35692': 1,
 'tcp://127.0.0.1:35808': 1,
 'tcp://127.0.0.1:36866': 1,
 'tcp://127.0.0.1:37122': 1,
 'tcp://127.0.0.1:37947': 1,
 'tcp://127.0.0.1:38477': 1,
 'tcp://127.0.0.1:40177': 1,
 'tcp://127.0.0.1:42193': 1,
 'tcp://127.0.0.1:42682': 1,
 'tcp://127.0.0.1:45913': 1}

In [28]:
# Create a bag object
b = db.from_sequence(partitions, npartitions=len(partitions))

# Map partition making function
b = b.map(make_partition_labels)
    
b

dask.bag<map-mak..., npartitions=1000>

In [29]:
%%capture
from timeit import default_timer as timer

start = timer()
b.compute()
end = timer()



[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

7038 seconds elapsed.


In [30]:
print(f'{round(end - start)} seconds elapsed.')

7038 seconds elapsed.


# Conclusions

In this notebook, we generated labels and cutoff time dataframes for each partition. We can then use these cutoff times in a call to deep feature synthesis to make features for each label. 

The next notebook is `Feature Engineering`. 