# Introduction: Partition Labeling

In this notebook, we will work with a single partition to develop a labeling function. The end goal is code that can take a partition on disk and generate labels for a cutoff time dataframe. 

## Definition of Churn

The definition of churn used in this notebook is not renewing a membership within a certain number of days from the end of the membership. The number of days is left as a parameter to adjust. 

### Prediction Problem

Given the definition of churn and the available transactions data, there are a number of prediction problems that can be asked. For example, we can choose to make predictions at different times points: in this notebook we will look at making predictions for two different sets of cutoff times:

* At the start of every month
* On the first and fifteenth day of every month

We can also vary the number of days required for a customer to be considered a churn. Finally, we can predict the churn itself - a binary yes or no - or the numbers of days until the customer churns - a regression problem. We could even segment the number of days until a churn into multiple groups - say 1-7 days, 8-14 days, 15-21 days and longer - and then make this a multiclass problem. 

To leave the prediction problem open, in this notebook, we'll find the labels - churn or not - as well as the number of days until the next churn for two different scenarios:

1. Making a prediction on the first of the month with the churn period set at 30 days.
2. Making a prediction on the first and fifteenth of the month with the churn period set at 14 days. 

Once we calculate and save these labels for the two different time frames, we can change the exact prediction problem from classification to regression. Moreover, the same features can be used for both classification and regression because they will be calculated based on the cutoff time. 

In [1]:
import pandas as pd 
import numpy as np

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Each partition has the 5 data files provided by the competition containing only the customers in the partition. These partitions were randomly assigned based on hashing the customer id to an integer and taking the remained of the integer divided by the number of partitions. 

In [2]:
PARTITION = '100'
base_dir = '/data/churn/partitions/'
directory = base_dir + 'p' + PARTITION

import os
os.listdir(directory)

['logs.csv',
 'members.csv',
 'train.csv',
 'cancel_cutoff_times.csv',
 'month_labels_30.csv',
 'monthly_labels_30.csv',
 'test.csv',
 'month_labels.csv',
 'bimonthly_labels_14.csv',
 'transactions.csv',
 'cutoff_times.csv']

In [3]:
all_partitions = os.listdir('/data/churn/partitions/')
len(all_partitions)

1000

### Read in Data Files

The only data we need for making the labels is the transactions. 

In [4]:
members = pd.read_csv(f'{directory}/members.csv', 
                      parse_dates=['registration_init_time'], infer_datetime_format = True)
trans = pd.read_csv(f'{directory}/transactions.csv',
                   parse_dates=['transaction_date', 'membership_expire_date'], infer_datetime_format = True)
logs = pd.read_csv(f'{directory}/logs.csv', parse_dates = ['date'])

In [5]:
trans.head()

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
0,G7TmHc9Gg2t8ovG/KFaB53We/0CQPELhZ5UUN2Ol3AQ=,39,30,149,149,1,2015-09-30,2015-11-13,0
1,LPbp8N7VRuqEISEVim8ppTaeYJG/rWS/t4g/dEFuWjw=,34,30,149,149,1,2016-02-29,2016-03-31,0
2,xvYqULBWzJvN8heyFtY3hbY3egyQNbXuDx0igtsoi00=,29,30,180,180,1,2017-01-31,2017-03-01,0
3,UR4iin4mAkajoa7o+AyTTmz5k3N2GR3/rZY8a4KwADI=,41,30,99,99,1,2017-01-31,2017-02-28,0
4,ax8CRhY8BMRA/ZvT1wI+2N/EdPXiSPGxa9y7bntA1Uc=,40,30,149,149,1,2016-05-04,2016-06-08,0


## Remove Anomalies

There are a number of records in the transactinos where the `membership_expire_date` is before the `transaction_date`. We will remove these to prevent them from creating issues with our labeling.

In [6]:
trans = trans.loc[trans['membership_expire_date'] >= trans['transaction_date']]

Some of the transactions occur on the same day for the same customer. In these cases, we will assume that the transaction with the later `membership_expire_date` happened second. What I think these same-day transactions represent is a customer signing up for one plan, then receiving a new offer (that they think is a better deal) on the same day and switching to a different plan. 

In [7]:
trans = trans.sort_values(['msno', 'transaction_date', 'membership_expire_date'])

In [8]:
# View value counts of customers
trans.groupby('msno').count().sort_values('is_cancel').tail()

Unnamed: 0_level_0,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
msno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
5fPXqLcScoC93rH/gCPK+5Soj+XdNMXX9S3LhV5dJjM=,39,39,39,39,39,39,39,39
CNVTd5cNoA6WRxMf7VveVwYyrOoDgFs5xmm1122qQNc=,39,39,39,39,39,39,39,39
kA7JMc6Q6nAFw1fgcs0hejCw4xXaGUzB/eq+M5n21wQ=,40,40,40,40,40,40,40,40
mFsXs71TCuJOnHKWQQ271BdecYkrXgPIRU7VUfmtqAY=,42,42,42,42,42,42,42,42
/7/KMLZlMBnmWtb9NNkm3bYMQHWrt0C1BChb62EiQLM=,51,51,51,51,51,51,51,51


We'll work with the data from one customer at first. 

In [9]:
cust = trans.loc[trans['msno'] == '/7/KMLZlMBnmWtb9NNkm3bYMQHWrt0C1BChb62EiQLM='].copy()
cust.iloc[:, 1:].tail(6)

Unnamed: 0,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
12519,41,30,129,129,1,2016-11-08,2016-12-08,0
7776,41,30,129,129,1,2016-11-14,2016-11-14,1
1504,41,30,99,99,1,2016-11-14,2016-12-13,0
5462,41,30,99,99,1,2016-12-13,2017-01-13,0
10717,41,30,99,99,1,2017-01-13,2017-02-13,0
21899,41,30,99,99,1,2017-03-13,2017-04-13,0


## Find Churns

To find a churn, we'll need to find the number of days from the end of one membership to the start of the next. We'll iterate through each transaction for a customer (using `itertuples`), find the `membership_expire_date`, and calculate the number of days until the next transaction that is not a cancellation. This gap represents the number of days the customer was not an active member. 

Once we have the gaps, we can compare them to any definition of churn - for example 14 days - to determine which transactions represents churns and when it occurs.

First we'll want to make sure to sort the dataframe as described previously.

In [10]:
cust = cust.sort_values(['transaction_date', 'membership_expire_date']).reset_index(drop = True)

We will iterate through the transactions using `itertuples`. This is much faster than `iterrows` because it does not package each row as a series. 

In [11]:
for tup in cust.itertuples():
    break
tup

Pandas(Index=0, msno='/7/KMLZlMBnmWtb9NNkm3bYMQHWrt0C1BChb62EiQLM=', payment_method_id=41, payment_plan_days=30, plan_list_price=149, actual_amount_paid=149, is_auto_renew=1, transaction_date=Timestamp('2015-01-13 00:00:00'), membership_expire_date=Timestamp('2016-12-15 00:00:00'), is_cancel=0)

In [22]:
cust['gap'] = (cust['transaction_date'].shift(-1) - cust['membership_expire_date']).dt.days 
cust.iloc[:, 1:]

Unnamed: 0,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,gap
0,41,30,149,149,1,2015-01-13,2016-12-15,0,-694.0
1,41,30,149,119,1,2015-01-21,2017-01-15,0,-720.0
2,41,30,149,149,1,2015-01-26,2017-02-15,0,-733.0
3,41,30,149,149,1,2015-02-13,2017-03-15,0,-753.0
4,41,30,149,119,1,2015-02-21,2017-04-12,0,-776.0
5,41,30,149,149,1,2015-02-26,2017-05-10,0,-789.0
6,41,30,149,149,1,2015-03-13,2017-06-10,0,-812.0
7,41,30,149,119,1,2015-03-21,2017-07-11,0,-838.0
8,41,30,149,149,1,2015-03-26,2017-08-11,0,-851.0
9,41,30,149,149,1,2015-04-13,2017-09-10,0,-873.0


The code below finds the gaps between all membership ends in the data. For each transaction, we record the 

In [None]:
gaps = []

    # Iterate through each entry
    for tup in cust.itertuples():
        # Find the index and expiration date
        i = tup[0]
        expire_date = tup[8]
        # For last entry, the gap is unknown
        if i == len(cust) - 1:
            gaps.append(np.nan)
        # Find the gap between membership renewals
        else:
            j = 1
            next_trans = cust.loc[i+j, :].copy()
            last_cancelled = False
            # Find the next transaction that is not a cancellation
            while next_trans['is_cancel'] == 1:
                # Handle the case where the last transaction is a cancellation
                if i + j == len(cust) - 1:
                    last_cancelled = True
                    gaps.append(np.nan)
                    break
                # Otherwise keep searching for next transaction that is not a cancellation
                else:
                    j += 1
                    next_trans = cust.loc[i + j, :].copy()
                
            # Find start of next membership
            next_start_date = next_trans['transaction_date']
            
            # Calculate the gap between end of membership and start of next
            if not last_cancelled:
                gaps.append((next_start_date - expire_date).days)

    cust['gap'] = gaps

In [None]:
cust['gap'] = gaps

This customer does not have any churns if we define churns as a period of 30 days without renewing. However, they would have one churn if measuring by for example, two weeks. 

The next step is to find out when the churn actually occurs. We'll use two weeks as the churn period so we actually have one churn.

In [None]:
# Days without membership to be considered a churn
periods = 14

# Determine if churned occur
cust['churn'] = cust['gap'] > periods
cust['potential_churn_date'] = cust['membership_expire_date'] + pd.Timedelta(periods, unit = 'd')

# If customer did churn set the churn date
cust.loc[cust['churn'] == 1, 'churn_date'] = cust.loc[cust['churn'] == 1, 'potential_churn_date']
cust.head()
cust.tail()

# Make Labels

The actual labels are going to occur either on a monthly or two-week basis. For the labels, at the `cutoff_time`, we'll ask two questions: does the customer churn in the next [two weeks / month] and how long until the customer churns.

## Month Labels

We'll start by making labels for each month that the customer was a member.

In [None]:
first_trans = cust['transaction_date'].min()
last_trans = cust['membership_expire_date'].max()
start_date = pd.datetime(first_trans.year, first_trans.month, 1)
end_date = pd.datetime(last_trans.year, last_trans.month, 1)

# Create a range of months
date_range = pd.date_range(start_date, end_date, freq = 'MS')
date_range[:5]
date_range[-5:]

In [None]:
labels = pd.DataFrame({'cutoff_time': date_range})
labels['next_cutoff_time'] = labels['cutoff_time'].shift(-1)

previous_churn = None

# Iterate through the churn dates
for churn_date in cust.loc[cust['churn_date'].notnull(), 'churn_date']:
    print(churn_date)
    # Assign the label 1 if the customer churned during the cutoff_time period
    labels.loc[(labels['cutoff_time'] <= churn_date) & (labels['next_cutoff_time'] > churn_date), 'churn'] = 1
    
    # If there was a previous churn
    if previous_churn is not None:
        # Subset to cutoff times after the previous churn but before the current churn
        # Calculate the days until the churn
        labels.loc[(labels['cutoff_time'] > previous_churn) & 
                   (labels['cutoff_time'] <= churn_date), 
                   'days_to_next_churn'] = (churn_date - labels.loc[(labels['cutoff_time'] > previous_churn) & 
                                                                   (labels['cutoff_time'] <= churn_date), 'cutoff_time']).dt.days
    # No previous churn
    else:
        # Subset to cutoff times before the current churn and calculate days until the churn
        labels.loc[labels['cutoff_time'] <= churn_date, 
                   'days_to_next_churn'] = (churn_date - labels.loc[labels['cutoff_time'] <= churn_date,
                                                                     'cutoff_time']).dt.days
    previous_churn = churn_date

In [None]:
labels[labels['churn'] == 1]

Those are the month labels. Let's write a function that calculates the month labels for one customer.

In [23]:
def generate_labels(customer_id, trans, label_type, churn_period = 30, return_cust = False):
    """Make labels for one customer for one period
    Params
    --------
        customer_id (str): string used to select customer
        trans (dataframe): transactions for customers
        label_type (str): either 'MS' for monthly labels at the start of the month or 
                          'SMS' for twice a month labels (on 1 and 15 of month)
        churn_period (int): number of days without membership required for a churn [default 30 days]
        return_cust (bool): whether or not to return the customer dataframe. Useful for debugging
        
    Return
    --------
        labels (dataframe): labels for all months in customer history
                            columns are ['msno', 'cutoff', 'churn', 'days_to_next_churn']
        cust (dataframe): if return_cust == True, a dataframe of the customers transactions
    """
    assert label_type in ['MS', 'SMS'], 'label_type must be either "MS" or "SMS"'
    
    cust = trans.loc[trans['msno'] == customer_id].copy()
    
    # Make sure to sort transactions and drop the index
    cust = cust.sort_values(['transaction_date', 'membership_expire_date']).reset_index(drop = True)
    
    # Find gap between membership expiration and start of next membership
    cust['gap'] = (cust['transaction_date'].shift(-1) - cust['membership_expire_date']).dt.days 
    
    # Determine if churn occur
    cust.loc[cust['gap'] > churn_period, 'churn']  = 1
    cust.loc[cust['gap'] <= churn_period, 'churn'] = 0
    
    # Calculate date range for labels
    first_trans = cust['transaction_date'].min()
    last_trans = cust['membership_expire_date'].max()
    start_date = pd.datetime(first_trans.year, first_trans.month, 1)
    
    # Handle case where last transaction month was december
    if last_trans.month == 12:
        end_date = pd.datetime(last_trans.year + 1, 1, 1)
    else:
        end_date = pd.datetime(last_trans.year, last_trans.month + 1, 1)

    # Create a range of dates for labels 
    # 'MS' = month starts, 'SM': twice a month on 15 and end.
    date_range = pd.date_range(start_date, end_date, freq = label_type)
    
    # Create a label dataframe
    labels = pd.DataFrame({'cutoff_time': date_range})
    labels['next_cutoff_time'] = labels['cutoff_time'].shift(-1)
    labels['msno'] = customer_id
    
    # Handle case where there are no churns
    if not np.any(cust['churn'] == 1):
        labels['churn'] = 0
        labels['days_to_next_churn'] = np.nan
        return labels[['msno', 'cutoff_time', 'churn', 'days_to_next_churn']]
    
    # If customer did churn set the churn date
    cust['potential_churn_date'] = cust['membership_expire_date'] + pd.Timedelta(churn_period, unit = 'd')
    cust.loc[cust['churn'] == 1, 'churn_date'] = cust.loc[cust['churn'] == 1, 'potential_churn_date']
    
    previous_churn = None

    # Iterate through the churn dates
    for churn_date in cust.loc[cust['churn_date'].notnull(), 'churn_date']:
        
        # Assign the label 1 if the customer churned during the cutoff_time period
        labels.loc[(labels['cutoff_time'] <= churn_date) & (labels['next_cutoff_time'] > churn_date), 'churn'] = 1

        if previous_churn is not None:
            # Subset to cutoff times after the previous churn but before the current churn
            # Calculate the days until the churn
            labels.loc[(labels['cutoff_time'] > previous_churn) & 
                       (labels['cutoff_time'] <= churn_date), 
                       'days_to_next_churn'] = (churn_date - labels.loc[(labels['cutoff_time'] > previous_churn) & 
                                                                       (labels['cutoff_time'] <= churn_date), 
                                                                        'cutoff_time']).dt.days
        # No previous churn
        else:
            # Subset to cutoff times before the current churn and calculate days until the churn
            labels.loc[labels['cutoff_time'] <= churn_date, 
                       'days_to_next_churn'] = (churn_date - labels.loc[labels['cutoff_time'] <= churn_date,
                                                                         'cutoff_time']).dt.days
        previous_churn = churn_date
    
    labels['churn'] = labels['churn'].fillna(0)
    
    # Sometimes want to return customer information for debugging
    if return_cust:
        return cust, labels[['msno', 'cutoff_time', 'churn', 'days_to_next_churn']]
    
    # Subset to relevant columns
    return labels[['msno', 'cutoff_time', 'churn', 'days_to_next_churn']]

In [24]:
cutoff_times = generate_labels('/7/KMLZlMBnmWtb9NNkm3bYMQHWrt0C1BChb62EiQLM=', trans, 
                               label_type = 'MS', churn_period = 14)
cutoff_times[cutoff_times['churn'] == 1].head()

Unnamed: 0,msno,cutoff_time,churn,days_to_next_churn
25,/7/KMLZlMBnmWtb9NNkm3bYMQHWrt0C1BChb62EiQLM=,2017-02-01,1.0,26.0


In [25]:
cutoff_times = generate_labels('/7/KMLZlMBnmWtb9NNkm3bYMQHWrt0C1BChb62EiQLM=',  trans,
                               label_type = 'MS', churn_period = 30)
cutoff_times[cutoff_times['churn'] == 1].head()

Unnamed: 0,msno,cutoff_time,churn,days_to_next_churn


In [27]:
cutoff_times = generate_labels('/7/KMLZlMBnmWtb9NNkm3bYMQHWrt0C1BChb62EiQLM=',  trans,
                               label_type = 'SMS', churn_period = 14)
cutoff_times[cutoff_times['churn'] == 1].head()

Unnamed: 0,msno,cutoff_time,churn,days_to_next_churn
51,/7/KMLZlMBnmWtb9NNkm3bYMQHWrt0C1BChb62EiQLM=,2017-02-15,1.0,12.0


In [29]:
cutoff_times = generate_labels('5fPXqLcScoC93rH/gCPK+5Soj+XdNMXX9S3LhV5dJjM=', trans,
                               label_type = 'MS', churn_period = 30)
cutoff_times.head()

Unnamed: 0,msno,cutoff_time,churn,days_to_next_churn
0,5fPXqLcScoC93rH/gCPK+5Soj+XdNMXX9S3LhV5dJjM=,2015-01-01,0,
1,5fPXqLcScoC93rH/gCPK+5Soj+XdNMXX9S3LhV5dJjM=,2015-02-01,0,
2,5fPXqLcScoC93rH/gCPK+5Soj+XdNMXX9S3LhV5dJjM=,2015-03-01,0,
3,5fPXqLcScoC93rH/gCPK+5Soj+XdNMXX9S3LhV5dJjM=,2015-04-01,0,
4,5fPXqLcScoC93rH/gCPK+5Soj+XdNMXX9S3LhV5dJjM=,2015-05-01,0,


In [32]:
cutoff_times = generate_labels('5fPXqLcScoC93rH/gCPK+5Soj+XdNMXX9S3LhV5dJjM=', trans,
                               label_type = 'SMS', churn_period = 14)
cutoff_times.head()

Unnamed: 0,msno,cutoff_time,churn,days_to_next_churn
0,5fPXqLcScoC93rH/gCPK+5Soj+XdNMXX9S3LhV5dJjM=,2015-01-01,0,
1,5fPXqLcScoC93rH/gCPK+5Soj+XdNMXX9S3LhV5dJjM=,2015-01-15,0,
2,5fPXqLcScoC93rH/gCPK+5Soj+XdNMXX9S3LhV5dJjM=,2015-02-01,0,
3,5fPXqLcScoC93rH/gCPK+5Soj+XdNMXX9S3LhV5dJjM=,2015-02-15,0,
4,5fPXqLcScoC93rH/gCPK+5Soj+XdNMXX9S3LhV5dJjM=,2015-03-01,0,


In [33]:
cutoff_list = []

# Iterate through every customer
for customer_id in trans['msno'].unique():
    cutoff_list.append(generate_labels(customer_id, trans, label_type = 'MS', churn_period = 30))
    
cutoff_times = pd.concat(cutoff_list)
cutoff_times.head()
cutoff_times.tail()

Unnamed: 0,msno,cutoff_time,churn,days_to_next_churn
0,+/6nRSzfF+CIynhnBM5xz8J6ArlSdLY74gsNK09dbes=,2015-06-01,0.0,
1,+/6nRSzfF+CIynhnBM5xz8J6ArlSdLY74gsNK09dbes=,2015-07-01,0.0,
0,+1TKL6EWVDuKFAOvWZOsoGTILy2POMnxxvUgP7PPCy8=,2016-11-01,0.0,
1,+1TKL6EWVDuKFAOvWZOsoGTILy2POMnxxvUgP7PPCy8=,2016-12-01,0.0,
0,+58aOzMPOZSi0END5IUKzK009k/iGY9mB+k9s5qetAI=,2016-11-01,0.0,


Unnamed: 0,msno,cutoff_time,churn,days_to_next_churn
24,zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=,2017-01-01,0.0,
25,zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=,2017-02-01,0.0,
26,zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=,2017-03-01,0.0,
27,zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=,2017-04-01,0.0,
28,zzm2UvJnzuTRkXaiaZHtbJwPG9jZQZkZxG0n4PYDTvw=,2017-05-01,0.0,


In [34]:
cust, c = generate_labels('0NRMdOljNJEsUC6WtOVzjqdSAdIIdZ1G0Ye6pIlms5U=', trans, 
                          label_type = 'MS', churn_period = 30, return_cust = True)

In [35]:
c[c['days_to_next_churn'] < 30]

Unnamed: 0,msno,cutoff_time,churn,days_to_next_churn
17,0NRMdOljNJEsUC6WtOVzjqdSAdIIdZ1G0Ye6pIlms5U=,2016-06-01,1.0,27.0
19,0NRMdOljNJEsUC6WtOVzjqdSAdIIdZ1G0Ye6pIlms5U=,2016-08-01,1.0,27.0
24,0NRMdOljNJEsUC6WtOVzjqdSAdIIdZ1G0Ye6pIlms5U=,2017-01-01,1.0,6.0


In [36]:
cust[cust['gap'] > 30]

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,gap,churn,potential_churn_date,churn_date
14,0NRMdOljNJEsUC6WtOVzjqdSAdIIdZ1G0Ye6pIlms5U=,36,30,150,150,0,2016-03-30,2016-05-29,0,31.0,1.0,2016-06-28,2016-06-28
15,0NRMdOljNJEsUC6WtOVzjqdSAdIIdZ1G0Ye6pIlms5U=,38,30,149,149,0,2016-06-29,2016-07-29,0,102.0,1.0,2016-08-28,2016-08-28
16,0NRMdOljNJEsUC6WtOVzjqdSAdIIdZ1G0Ye6pIlms5U=,36,30,180,180,0,2016-11-08,2016-12-08,0,94.0,1.0,2017-01-07,2017-01-07


In [37]:
cust, c = generate_labels('+6UN6VJD8u9vZm4lZRAREpzBRM4YoeOSWhEX0c5JBAU=', trans, 
                          label_type = 'MS', churn_period = 30, return_cust = True)

In [38]:
c[c['days_to_next_churn'] < 30]

Unnamed: 0,msno,cutoff_time,churn,days_to_next_churn
2,+6UN6VJD8u9vZm4lZRAREpzBRM4YoeOSWhEX0c5JBAU=,2016-10-01,1.0,24.0


In [39]:
cust[cust['gap'] > 30]

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,gap,churn,potential_churn_date,churn_date
0,+6UN6VJD8u9vZm4lZRAREpzBRM4YoeOSWhEX0c5JBAU=,36,30,180,180,0,2016-08-26,2016-09-25,0,93.0,1.0,2016-10-25,2016-10-25


# All Labels for Partition 

The next step is to write a function that can make the month labels for a partition. It will accept a partition number and generate the partition labels. These will then be saved to the partition directory.

In [40]:
partitions = list(range(len(os.listdir('/data/churn/partitions/'))))
partitions[-1]

999

In [41]:
def partition_to_labels(partition, label_type, churn_period):
    """Make labels for all customers in one partition
    Either for one month or twice a month
    
    Params
    --------
        partition (int): number of partition
        label_type (str): either 'monthly' for monthly labels or
                          'bimonthly' for twice a month labels
        churn_period (int): number of days required without a membership for a churn
    
    Returns
    --------
        None: saves the label dataframes with the appropriate name to the partition directory
    """
    
    # Read in data and filter anomalies
    trans = pd.read_csv(f'{base_dir}p{partition}/transactions.csv',
                        parse_dates=['transaction_date', 'membership_expire_date'], 
                        infer_datetime_format = True)
    trans = trans.loc[trans['membership_expire_date'] >= trans['transaction_date']]
    
    cutoff_list = []

    if label_type == 'monthly':
        # Iterate through every customer
        for customer_id in trans['msno'].unique():
            cutoff_list.append(generate_labels(customer_id, trans, label_type = 'MS', churn_period = churn_period))
        cutoff_times = pd.concat(cutoff_list)
        cutoff_times.to_csv(f'{base_dir}p{partition}/monthly_labels_{churn_period}.csv', index = False)
        
    
    elif label_type == 'bimonthly':
        for customer_id in trans['msno'].unique():
            cutoff_list.append(generate_labels(customer_id, trans, label_type = 'SMS', churn_period = churn_period))
        cutoff_times = pd.concat(cutoff_list)
        cutoff_times.to_csv(f'{base_dir}p{partition}/bimonthly_labels_{churn_period}.csv', index = False)

In [42]:
partition_to_labels(1)

TypeError: partition_to_labels() missing 2 required positional arguments: 'label_type' and 'churn_period'

In [None]:
partition_to_labels(50)

# Two Week Labels

Next we'll work on making labels in two-week increments. This time, the question is not whether the customer churned during the subsequent month, but whether the customer churned in the two weeks following the `cutoff_time`.

In [None]:
import dask.bag as db
from dask.distributed import Client

# Use all cores
client = Client(processes = True)

In [None]:
client.ncores()

In [None]:
# Create a bag object
b = db.from_sequence(partitions, npartitions=len(partitions))

# Map partition making function
b = b.map(make_partition_labels)
    
b

In [None]:
%%capture
from timeit import default_timer as timer

start = timer()
b.compute()
end = timer()



In [None]:
print(f'{round(end - start)} seconds elapsed.')

# Conclusions

In this notebook, we generated labels and cutoff time dataframes for each partition. We can then use these cutoff times in a call to deep feature synthesis to make features for each label. 

The next notebook is `Feature Engineering`. 

In [None]:
pd.date_range(pd.datetime(2010, 1, 1), pd.datetime(2012, 1, 1), freq = '2W-MON')