# Introduction: Partition Labeling

In this notebook, we will work with a single partition to develop a labeling function. The end goal is code that can take a partition on disk and generate labels with the cutoff time dataframe. This will then be parallelized using Dask.

In [1]:
import pandas as pd 
import numpy as np

import featuretools as ft

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
PARTITION = '100'
directory = '/data/churn/partitions/p' + PARTITION

import os
os.listdir(directory)

['logs.csv',
 'members.csv',
 'train.csv',
 'cancel_cutoff_times.csv',
 'test.csv',
 'transactions.csv',
 'cutoff_times.csv']

In [3]:
all_partitions = os.listdir('/data/churn/partitions/')
len(all_partitions)

1000

### Read in Data Files

The only data we need for making the labels is the transactions. However, we'll read in all the data here.

In [4]:
members = pd.read_csv(f'{directory}/members.csv', 
                      parse_dates=['registration_init_time'], infer_datetime_format = True)
trans = pd.read_csv(f'{directory}/transactions.csv',
                   parse_dates=['transaction_date', 'membership_expire_date'], infer_datetime_format = True)
logs = pd.read_csv(f'{directory}/logs.csv', parse_dates = ['date'])

In [5]:
members.head()
trans.head()
logs.head()

Unnamed: 0,msno,city,bd,gender,registered_via,registration_init_time
0,46r0VsNbxboshxpjyJ91oWdGX8SHzxFuS2mWkEBujD4=,13,29,female,9,2007-04-29
1,0C2o0WrDEiGkjbQtOR8x3U05OVCVLYFKHVRjgPCA0mM=,10,20,male,9,2007-11-07
2,d1UX5bu9bMtb8mzId7VbHFIUa46UO+IElVh9CxbonNQ=,1,0,,9,2015-01-28
3,mF6w9kCGjtrI0PtOomXQTmN027pGL21K8E2Jvitb0RE=,1,0,,9,2016-12-31
4,EjxD7eoFZ1+/jsVZc+8J79lRbHpZK+ZtyRGZFtZ0at4=,1,0,,4,2017-01-17


Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
0,G7TmHc9Gg2t8ovG/KFaB53We/0CQPELhZ5UUN2Ol3AQ=,39,30,149,149,1,2015-09-30,2015-11-13,0
1,LPbp8N7VRuqEISEVim8ppTaeYJG/rWS/t4g/dEFuWjw=,34,30,149,149,1,2016-02-29,2016-03-31,0
2,xvYqULBWzJvN8heyFtY3hbY3egyQNbXuDx0igtsoi00=,29,30,180,180,1,2017-01-31,2017-03-01,0
3,UR4iin4mAkajoa7o+AyTTmz5k3N2GR3/rZY8a4KwADI=,41,30,99,99,1,2017-01-31,2017-02-28,0
4,ax8CRhY8BMRA/ZvT1wI+2N/EdPXiSPGxa9y7bntA1Uc=,40,30,149,149,1,2016-05-04,2016-06-08,0


Unnamed: 0,msno,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs
0,zraYK27dg2odAmzM1z3r4DKqx/9X6P/O7f3VQbTF3zU=,2017-03-18,9,0,1,0,22,26,5714.52
1,Cv+9XMTbZ6Ua6TZoJSJJl7gtkstlXa3R/LHl2zdK+PY=,2017-03-09,4,0,0,0,28,30,3701.484
2,6bL/KHOSEXgB9yeqNpwAZDc9LYd7I/JbUUpXpTAyyKI=,2017-03-21,1,2,1,2,15,16,4288.633
3,UL9bm0eoKgK6YQWCGwz8CtD/9ySFy1fbFBzqUF2jmEs=,2017-03-30,13,3,0,0,30,37,8551.879
4,6cPyxd/lWz24stfnNlsWbW839l5GVWCQ/oJzr7ZT4a4=,2017-03-24,45,9,6,3,32,81,11342.642


## Remove Anomalies

There are a number of records in the transactinos where the `membership_expire_date` is before the `transaction_date`. We will remove these to prevent them from creating issues with our labeling.

In [6]:
trans = trans.loc[trans['membership_expire_date'] >= trans['transaction_date']]

Some of the transactions occur on the same day for the same customer. In these cases, we will assume that the transaction with the later `membership_expire_date` happened second. 

In [7]:
trans = trans.sort_values(['msno', 'transaction_date', 'membership_expire_date'])

# View value counts of customers
trans.groupby('msno').count().sort_values('is_cancel').tail()

Unnamed: 0_level_0,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
msno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
5fPXqLcScoC93rH/gCPK+5Soj+XdNMXX9S3LhV5dJjM=,39,39,39,39,39,39,39,39
CNVTd5cNoA6WRxMf7VveVwYyrOoDgFs5xmm1122qQNc=,39,39,39,39,39,39,39,39
kA7JMc6Q6nAFw1fgcs0hejCw4xXaGUzB/eq+M5n21wQ=,40,40,40,40,40,40,40,40
mFsXs71TCuJOnHKWQQ271BdecYkrXgPIRU7VUfmtqAY=,42,42,42,42,42,42,42,42
/7/KMLZlMBnmWtb9NNkm3bYMQHWrt0C1BChb62EiQLM=,51,51,51,51,51,51,51,51


We'll work with the data from one customer at first. 

In [37]:
cust = trans.loc[trans['msno'] == '/7/KMLZlMBnmWtb9NNkm3bYMQHWrt0C1BChb62EiQLM='].copy()
cust.iloc[:, 1:].tail(6)

Unnamed: 0,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
12519,41,30,129,129,1,2016-11-08,2016-12-08,0
7776,41,30,129,129,1,2016-11-14,2016-11-14,1
1504,41,30,99,99,1,2016-11-14,2016-12-13,0
5462,41,30,99,99,1,2016-12-13,2017-01-13,0
10717,41,30,99,99,1,2017-01-13,2017-02-13,0
21899,41,30,99,99,1,2017-03-13,2017-04-13,0


In [38]:
cancel = cust.loc[cust['is_cancel'] == 1].copy()
cancel

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
7776,/7/KMLZlMBnmWtb9NNkm3bYMQHWrt0C1BChb62EiQLM=,41,30,129,129,1,2016-11-14,2016-11-14,1


## Find Churns

To find a churn, we'll set a variable `periods` that defines the number of days required without a membership before a customer is considered churned. Then, if there are any gaps in membership longer than this period, we will mark it as a churn. To find the time period without a membership, we'll look at each `membership_expire_date` and calculate the time to the next transaction that is not a cancel. If this value is greater than `periods`, then the customer has churned.

In [39]:
cust.reset_index(inplace = True, drop = True)
periods = 30

In [42]:
for tup in cust.itertuples():
    break
tup

Pandas(Index=0, msno='/7/KMLZlMBnmWtb9NNkm3bYMQHWrt0C1BChb62EiQLM=', payment_method_id=41, payment_plan_days=30, plan_list_price=149, actual_amount_paid=149, is_auto_renew=1, transaction_date=Timestamp('2015-01-13 00:00:00'), membership_expire_date=Timestamp('2016-12-15 00:00:00'), is_cancel=0)

In [52]:
gaps = []

# Iterate through each entry
for tup in cust.itertuples():
    i = tup[0]
    expire_date = tup[8]
    if i == len(cust) - 1:
        gaps.append(np.nan)
    else:
        next_trans = cust.loc[i+1, :].copy()
        j = 0
        while next_trans['is_cancel'] == 1:
            next_trans = cust.loc[i + j, :].copy()
            j += 1
            # Handle the case where the last transaction is cancelled
            if i + j >= len(cust):
                gaps.append(np.nan)
            
        next_start_date = next_trans['transaction_date']
        gaps.append((next_start_date - expire_date).days)

In [51]:
(next_start_date - expire_date).days

-694

In [54]:
cust['gap'] = gaps
cust

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,gap
0,/7/KMLZlMBnmWtb9NNkm3bYMQHWrt0C1BChb62EiQLM=,41,30,149,149,1,2015-01-13,2016-12-15,0,-694.0
1,/7/KMLZlMBnmWtb9NNkm3bYMQHWrt0C1BChb62EiQLM=,41,30,149,119,1,2015-01-21,2017-01-15,0,-720.0
2,/7/KMLZlMBnmWtb9NNkm3bYMQHWrt0C1BChb62EiQLM=,41,30,149,149,1,2015-01-26,2017-02-15,0,-733.0
3,/7/KMLZlMBnmWtb9NNkm3bYMQHWrt0C1BChb62EiQLM=,41,30,149,149,1,2015-02-13,2017-03-15,0,-753.0
4,/7/KMLZlMBnmWtb9NNkm3bYMQHWrt0C1BChb62EiQLM=,41,30,149,119,1,2015-02-21,2017-04-12,0,-776.0
5,/7/KMLZlMBnmWtb9NNkm3bYMQHWrt0C1BChb62EiQLM=,41,30,149,149,1,2015-02-26,2017-05-10,0,-789.0
6,/7/KMLZlMBnmWtb9NNkm3bYMQHWrt0C1BChb62EiQLM=,41,30,149,149,1,2015-03-13,2017-06-10,0,-812.0
7,/7/KMLZlMBnmWtb9NNkm3bYMQHWrt0C1BChb62EiQLM=,41,30,149,119,1,2015-03-21,2017-07-11,0,-838.0
8,/7/KMLZlMBnmWtb9NNkm3bYMQHWrt0C1BChb62EiQLM=,41,30,149,149,1,2015-03-26,2017-08-11,0,-851.0
9,/7/KMLZlMBnmWtb9NNkm3bYMQHWrt0C1BChb62EiQLM=,41,30,149,149,1,2015-04-13,2017-09-10,0,-873.0


In [33]:
next_length_days

30

In [34]:
next_trans_date

Timestamp('2015-01-21 00:00:00')

In [35]:
expire_date

149

In [30]:
next_start_date - expire_date

ValueError: Cannot add integral value to Timestamp without freq.

In [29]:
next_start_date

Timestamp('2015-01-21 00:00:00')

## Potential Outcomes from a Transaction

At the time of the transaction, we want to make a prediction of the behavior following the transaction. We'll establish four different possibilities (five including no data):

* 0: Sign another deal before membership expires
* 1: Sign another deal on the day membership expires
* 2: Sign another deal within 30 days of membership expiring
* 3: Do not sign another deal within 30 days of membership expiring
* NaN: no information in 30 days following expiration

In [None]:
non_cancels = cust.loc[cust['is_cancel'] == 0].copy()
cancels = cust.loc[cust['is_cancel'] == 1].copy()

## Non Cancellations

First we'll work through making labels for the non-cancellation transactions. A customer is defined as churned if they do not renew within 30 days of the end of the membership. 

In [None]:
labels = []
label_times = []

# Iterate through each cancellation
for i, transaction in non_cancels.iterrows():
    
    # Find the transaction date and membership renewal date
    transaction_date = transaction['transaction_date']
    expire_date = transaction['membership_expire_date']
        
    label_times.append(transaction_date)
    
    # Customer has 30 days to renew 
    renew_by_date = expire_date + pd.Timedelta(30, 'D')
    
    # Subset to transactions within the renewal period
    renewal_trans = cust.loc[(cust['transaction_date'] >= transaction_date) & 
                             (cust['transaction_date'] <= renew_by_date) & 
                             (cust['is_cancel'] == 0) & (cust.index != i)].copy()
    
    # Data after the end of renewal period
    post_renewal = cust.loc[cust['transaction_date'] > expire_date]
    
    if len(renewal_trans) > 0:
        churned = 0
        
    # No data
    elif (len(renewal_trans) == 0) and (len(post_renewal) == 0):
        churned = np.nan
    
    # Data exists but after renewal period
    elif len(post_renewal) > 0:
        churned = 1
        
    labels.append(churned)

cutoff_times = pd.DataFrame({'cutoff': label_times, 'label': labels})
np.where(np.array(labels) == 1)

In [None]:
non_cancels.iloc[6:8, 1:]

## Cancellations

We'll do the same thing with the cancellations: for a given cancellation, will the customer sign another contract within 30 days?

In [None]:
labels = []
label_times = []

# Iterate through each cancellation
for i, transaction in cancels.iterrows():
    
    # Find the transaction date and membership renewal date
    transaction_date = transaction['transaction_date']
    expire_date = transaction['membership_expire_date']
    
    label_times.append(transaction_date)
    
    # Customer has 30 days to renew 
    renew_by_date = expire_date + pd.Timedelta(30, 'D')
    
    # Subset to transactions within the renewal period
    renewal_trans = cust.loc[(cust['transaction_date'] >= transaction_date) & 
                             (cust['transaction_date'] <= renew_by_date) & 
                             (cust['is_cancel'] == 0) & (cust.index != i)].copy()
    
    # Data after the end of renewal period
    post_renewal = cust.loc[cust['transaction_date'] > expire_date]
    
    if len(renewal_trans) > 0:
        churned = 0
        
    # No data
    elif (len(renewal_trans) == 0) and (len(post_renewal) == 0):
        churned = np.nan
    
    # Data exists but after renewal period
    elif len(post_renewal) > 0:
        churned = 1
        
    labels.append(churned)
    
cancel_cutoff_times = pd.DataFrame({'cutoff': label_times, 'label': labels})
np.where(np.array(labels) == 1)

In [None]:
cancels.iloc[0, 1:]

In [None]:
cutoff_times.head()
cancel_cutoff_times

# Function to Generate Labels for One Customer

Now we need to take those individual blocks of code and combine them into one function. The input is a customer id and the output is two dataframes: a label cutoff time dataframe for the non-cancellation transactions, and a label cutoff time dataframe for the cancellation transactions.

In [None]:
trans = pd.read_csv(f'{directory}/transactions.csv',
                   parse_dates=['transaction_date', 'membership_expire_date'], infer_datetime_format = True)

trans = trans.sort_values(['msno', 'transaction_date', 'membership_expire_date'])
trans = trans.loc[trans['membership_expire_date'] >= trans['transaction_date']]

In [None]:
def make_labels(df, trans_df, customer_id):
    
    labels = []
    label_times = []
    
    # Iterate through each cancellation
    for i, transaction in df.iterrows():

        # Find the transaction date and membership renewal date
        transaction_date = transaction['transaction_date']
        expire_date = transaction['membership_expire_date']

        label_times.append(transaction_date)

        # Customer has 30 days to renew 
        renew_by_date = expire_date + pd.Timedelta(30, 'D')

        # Subset to transactions within the renewal period
        renewal_trans = trans_df.loc[(trans_df['transaction_date'] >= transaction_date) & 
                                     (trans_df['transaction_date'] <= renew_by_date) & 
                                     (trans_df['is_cancel'] == 0) & (trans_df.index != i)].copy()

        # Data after the end of renewal period
        post_renewal = trans_df.loc[trans_df['transaction_date'] > expire_date]

        if len(renewal_trans) > 0:
            churned = 0

        # No data
        elif (len(renewal_trans) == 0) and (len(post_renewal) == 0):
            churned = np.nan

        # Data exists but after renewal period
        elif len(post_renewal) > 0:
            churned = 1

        labels.append(churned)

    cutoff_times = pd.DataFrame({'msno': customer_id,'cutoff': label_times, 'label': labels})
    return cutoff_times

In [None]:
def make_customer_labels(customer_id, transactions):
    """Make label cutoff time dataframes for a customer. 
    
    Params
    --------
        customer_id (str): customer id (`msno`) as a string
    
    Returns
    --------
        cutoff_times (dataframe): label dataframe with columns 'msno', 'cutoff', and 'label' for the non-cancellation
                                  transactions
        cancel_cutoff_times (dataframe): label dataframe with columns 'msno', 'cutoff', and 'label' for the
                                         cancellation transactions
    """
    # Subset to the customer data
    cust = transactions.loc[transactions['msno'] == customer_id].copy()
    
    non_cancels = cust.loc[cust['is_cancel'] == 0].copy()
    cancels = cust.loc[cust['is_cancel'] == 1].copy()
    
    cutoff_times = make_labels(non_cancels, cust, customer_id)
    cancel_cutoff_times = make_labels(cancels, cust, customer_id)
    
    return cutoff_times, cancel_cutoff_times

In [None]:
cutoff_times.head()

In [None]:
ct, cancel_ct = make_customer_labels('ineWmhVlalmqq5V4rNEd8ECLRSdVSwYX7yuxmVSzIsY=', trans)
ct.head()
cancel_ct

## Function to Make Labels for Entire Partition

The next step is to take the function from a single individual to an entire partition. Then, we can run this over every partition.

In [None]:
def make_partition_labels(partition, save_data = True):
    """Make label cutoff time dataframe for one partition.
    Parameters
    --------
        partition (int): number of partition
        
    Returns
    --------
        No return. Saves results to the partition directory as cutoff_times.csv and cancel_cutoff_times.csv"""
    
    # Read in transactions
    transactions = pd.read_csv(f'/data/churn/partitions/p{partition}/transactions.csv', 
                              parse_dates=['transaction_date', 'membership_expire_date'], 
                               infer_datetime_format = True)
    # Sort and filter
    transactions = transactions.sort_values(['msno', 'transaction_date', 'membership_expire_date'])
    transactions = transactions[transactions['membership_expire_date'] >= transactions['transaction_date']]
    
    # Unique customers
    customer_ids = list(transactions['msno'].unique())
    
    # Lists to hold dataframes
    cutoff_times = []
    cancel_cutoff_times = []
    
    for customer_id in customer_ids:
        ct, cancel_ct = make_customer_labels(customer_id, transactions)
        
        cutoff_times.append(ct)
        cancel_cutoff_times.append(cancel_ct)
    
    # Optionally return the data
    if not save_data:
        return cutoff_times, cancel_cutoff_times
    
    # By default save the data as a single dataframe
    cutoff_times = pd.concat(cutoff_times)
    cutoff_times.to_csv(f'/data/churn/partitions/p{partition}/cutoff_times.csv', index = False)

    cancel_cutoff_times = pd.concat(cancel_cutoff_times)
    cancel_cutoff_times.to_csv(f'/data/churn/partitions/p{partition}/cancel_cutoff_times.csv', index = False)

In [None]:
ct, cancel_ct = make_partition_labels(500, save_data = False)

In [None]:
cutoff_times_partition = pd.concat(ct)
cutoff_times_partition.head()

cancel_cutoff_times_partition = pd.concat(cancel_ct)
cancel_cutoff_times_partition.head()

In [None]:
cutoff_times_partition.shape
cancel_cutoff_times_partition.shape

In [None]:
len(trans) == (len(cutoff_times_partition) + len(cancel_cutoff_times_partition))

# Use Dask to Parallelize Making Cutoff Times

In [None]:
partitions = list(range(len(os.listdir('/data/churn/partitions'))))
len(partitions)

In [None]:
import dask.bag as db
from dask.distributed import Client

# Use all cores
client = Client(processes = True)

In [None]:
client.ncores()

In [None]:
# Create a bag object
b = db.from_sequence(partitions, npartitions=len(partitions))

# Map partition making function
b = b.map(make_partition_labels)
    
b

In [None]:
%%capture
from timeit import default_timer as timer

start = timer()
b.compute()
end = timer()



In [None]:
print(f'{round(end - start)} seconds elapsed.')

# Conclusions

In this notebook, we generated labels and cutoff time dataframes for each partition. We can then use these cutoff times in a call to deep feature synthesis to make features for each label. 

The next notebook is `Feature Engineering`. 