### Clean Transaction file description:

|                                       transactions.csv                                        |
|-----------------------------------------------------------------------------------------------|
| msno                   | user id  (letters, digits and special characters)                    |
| payment_method_id      | payment method   (masked)                                            |
| payment_plan_days      | length of membership plan in days                                    |
| plan_list_price        | in New Taiwan Dollar (NTD)                                           |
| actual_amount_paid     | in New Taiwan Dollar (NTD)                                           |
| is_auto_renew          | true when customer opted in renewing its subscription automatically  |
| transaction_date       | format %Y%m%d                                                        |
| membership_expire_date | format %Y%m%d                                                        |
| is_cancel              | whether or not the user canceled the membership in this transaction. |
| plan_duration          | Intervals in days that fit payment_plan_days |

### List of features:
- Fraction of number of days left in expiration month (s_frac_day)
- Number of days in or out of churning zone at the end of prediction month (s_churn_zone)
- Ratio of active cancellation to number of transactions (s_cancel_ratio)
- Ratio of auto renew to number of transactions (s_autorenew_ratio)
- Number of uninterrupted days of membership (s_uninterusers)
- Last Plan duration which is an added column during EDA (s_planduration)
- Last Payment ID (s_payID)

#### Necessary imports

In [1]:
import os
import pandas as pd
import numpy as np
from time import time
from datetime import timedelta

#### Retrieve eligible users for training

In [2]:
# eligible users are provided in a csv file
train_dir = os.path.join(os.pardir, 'data', 'raw', 'train.csv')
s_users = pd.read_csv(train_dir, dtype = {'is_churn' : np.bool})

In [3]:
current_users = s_users.msno.values
print('Number of eligible users for churn prediction', len(current_users))

In [5]:
# Just checking for duplicates. There is none.
s_users.msno.value_counts().count()

992931

#### Get cleaned transaction histories

In [8]:
transaction_dir = os.path.join(os.pardir, 'data', 'interim', 'transactions_clean.csv')
df_transac = pd.read_csv(transaction_dir, parse_dates=['transaction_date', 'membership_expire_date'])

In [9]:
# Total distinct users
Num_distinct_users = df_transac.msno.value_counts().count()
print('Number of distinct users in transaction history =', Num_distinct_users)

Number of distinct users in transaction history = 2328615


In [10]:
# only keep eligible users as provided by KKBOX
df_transac = df_transac[df_transac.msno.isin(current_users)]

In [11]:
# some eligible users were removed during data wrangling of transaction data!
# they were removed because I couldn't find a satisfying replacement for missing values
unique_users = df_transac.msno.value_counts()
print('Number of unique users:', unique_users.count())

Number of unique users: 972332


#### Recover discarded transactions (I didn't keep transactions for ~20k eligible users)

In [12]:
# load raw transaction data
transaction_origin_dir = os.path.join(os.pardir, 'data', 'raw', 'transactions.csv')
df_transac_origin = pd.read_csv(transaction_origin_dir, parse_dates=['transaction_date', 'membership_expire_date'])

In [13]:
# only keep eligible users
df_transac_origin = df_transac_origin[df_transac_origin.msno.isin(current_users)]

In [14]:
# now we have all transactions from eligible users
# df_transac_origin.msno.value_counts()

In [15]:
# keep discarded transactions not already present in cleaned data
df_transac_origin = df_transac_origin[~df_transac_origin.msno.isin(unique_users.index)]

Need to input data for plan duration

In [16]:
crit_30 = (df_transac_origin.payment_plan_days >= 28) & (df_transac_origin.payment_plan_days <= 32)
df_transac_origin.loc[crit_30, 'payment_plan_days'] = 30

In [17]:
# create custom intervals (bin edges will be left inclusive) in increasing order
days_plan = [0, 8, 30, 90, 180, 365, 485 ]

# compile labels
days_plan_upperbounds = [d-1 for d in days_plan[1:-1] ]
days_plan_upperbounds.append(days_plan[-1])
days_plan_labels = [ "{} - {}".format(l,u) for l,u in zip(days_plan[:-1], days_plan_upperbounds) ]

print('Bin edges = {}'.format(days_plan))
print('Associated labels = {}'.format(days_plan_labels))

Bin edges = [0, 8, 30, 90, 180, 365, 485]
Associated labels = ['0 - 7', '8 - 29', '30 - 89', '90 - 179', '180 - 364', '365 - 485']


In [18]:
# create new column with plan duration category
df_transac_origin['plan_duration'] = pd.cut(df_transac_origin.payment_plan_days, days_plan, right=False, labels=days_plan_labels)

#### Concatenate previously discarded transactions to our main dataframe

In [19]:
# concat reorder columns so keep columns order before concat
col_order = df_transac.columns.tolist()

# do concatenation
df_transac = pd.concat([df_transac, df_transac_origin], ignore_index=True)

# re-assign column order
df_transac = df_transac[col_order]

In [20]:
# only keep transaction up to January 2017 (We have transactions up to February 2018 here)
# When we make predicitons for month X, we only know transactions in month X-1
df_transac = df_transac[df_transac.transaction_date < '2017-02-01']

In [21]:
# sort dataframe by msno, transaction date then expiration date
df_transac.sort_values(['msno', 'transaction_date', 'membership_expire_date'], inplace=True)

In [22]:
df_transac.tail()

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,plan_duration
15397993,zzzN9thH22os1dRS0VHReY/8FTfGHOi86//d+wGGFsQ=,41,30.0,99,99,1,2016-09-04,2016-10-04,0,30 - 89
15397994,zzzN9thH22os1dRS0VHReY/8FTfGHOi86//d+wGGFsQ=,41,30.0,99,99,1,2016-10-04,2016-11-04,0,30 - 89
15397995,zzzN9thH22os1dRS0VHReY/8FTfGHOi86//d+wGGFsQ=,41,30.0,99,99,1,2016-11-04,2016-12-04,0,30 - 89
15397996,zzzN9thH22os1dRS0VHReY/8FTfGHOi86//d+wGGFsQ=,41,30.0,99,99,1,2016-12-04,2017-01-04,0,30 - 89
15397997,zzzN9thH22os1dRS0VHReY/8FTfGHOi86//d+wGGFsQ=,41,30.0,99,99,1,2017-01-04,2017-02-04,0,30 - 89


In [23]:
# df_transac[df_transac.msno == '+12RZ32vDGh9930rExes758LIwn51H0ADIBtbQ1ysNc=']

# 1st feature
- Fraction of number of days left in expiration month.

The idea is that people may be more inclined to churn when their expiration date is closer to the end of the month.

Don't take into account expiration date where cancellation is active. When people actively cancels, their expiration date changes to whatever date they want to stop. Most people stops same day or the day after they chose to cancel.

In [24]:
# filter active cancellation and only keep last transaction of each user
df_frac_day = df_transac[df_transac.is_cancel == False].groupby('msno', sort=False).nth(-1)

# NOTE: dataframe is pre-sorted so we don't need to sort after grouby

In [25]:
# compute fraction
s_frac_day = (df_frac_day.membership_expire_date.dt.daysinmonth - df_frac_day.membership_expire_date.dt.day) / \
                            df_frac_day.membership_expire_date.dt.daysinmonth

In [26]:
# total number of eligible users: 992931, Two reasons for not having all users:
# 1) There are 5 users who have only one transaction up to January 31st and it is an active cancellation
# 2) Also there are ~2,000 users who have no transactions at all prior to February 1st
s_frac_day.shape

(990827,)

In [27]:
def add_missing_users(s_in, default_val):
    """
    Given an input Serie with user's id as indexes, add unseen users from prediction Serie s_users.
    s_users must defined earlier. For unseen users, assign them default_val
    """
    # find remaining users
    other_users = set(s_users.msno).difference(s_in.index)

    # create series, default value is zero (end of the month expiration date)
    s_others = pd.Series([default_val]*len(other_users), index=other_users)

    # concatenate with s_frac_day
    s_complete = pd.concat([s_in, s_others])
    # check that serie is complete
    assert len(s_users) == len(s_complete)
    
    return s_complete

In [28]:
# add remaining users
s_frac_day = add_missing_users(s_frac_day, 0)
assert len(s_frac_day) == len(s_users)

In [130]:
# give serie a name
s_frac_day.name = 'time_of_month'

# 2nd feature
- Number of days in or out of churning zone at the end of prediction month

Based on transactions up to January 31st, we know each user membership expiration date. Here we derive how many days elapsed past membership expiration date to the end of February. Some instances have an expiration date past end of February so it will be negative.

In [29]:
# keep last expiration date for each user
df_churn_zone = df_transac.groupby('msno', sort=False).nth(-1)

In [30]:
# time difference with end of February
s_churn_zone = pd.Timestamp('2017-02-28') - df_churn_zone.membership_expire_date 

# convert time stamp to integer (number of days)
s_churn_zone = s_churn_zone.dt.days

In [31]:
# Again we need to include new users
s_churn_zone.shape

(990832,)

In [32]:
# Assign best case scenario to new users which is 31 days until they get in the churning zone
s_churn_zone = add_missing_users(s_churn_zone, s_churn_zone.min())
assert len(s_churn_zone) == len(s_users)

In [129]:
# give serie a name
s_churn_zone.name = 'churn_zone_days'

# 3rd feature
- Ratio of active cancellation to number of transactions

The idea is that a long time user may just signed up for another plan after an active cancellation. Whereas a recent member may just not enjoy this music streaming service and will churn.

In [33]:
# number of transactions per user
s_num_trans = df_transac.groupby('msno', sort=False).is_cancel.count()

In [34]:
# number of active cancellation per user
s_num_cancel = df_transac[df_transac.is_cancel == True].groupby('msno', sort=False).is_cancel.count()

In [35]:
# ratio of active cancellation to total number of transactions
s_cancel_ratio = s_num_cancel.divide(s_num_trans, fill_value=0)

In [36]:
# add unseen users
s_cancel_ratio = add_missing_users(s_cancel_ratio, 0)
assert len(s_cancel_ratio) == len(s_users)

In [128]:
# give serie a name
s_cancel_ratio.name = 'cancel_ratio'

# 4th feature
- Ratio of auto renew to number of transactions

In [37]:
# number of active cancellation per user
s_num_autorenew = df_transac[df_transac.is_auto_renew == True].groupby('msno', sort=False).is_auto_renew.count()

In [38]:
# ratio of active cancellation to total number of transactions
s_autorenew_ratio = s_num_autorenew.divide(s_num_trans, fill_value=0)

In [39]:
# add unseen users
s_autorenew_ratio = add_missing_users(s_autorenew_ratio, 0)
assert len(s_autorenew_ratio) == len(s_users)

In [127]:
# give serie a name
s_autorenew_ratio.name = 'auto_renew_ratio'

# 5th feature
- Number of uninterrupted days of membership

Interruption could be an active cancellation (i.e. even if user signs up again) and transaction date must be anterior to previous transaction date.

In [40]:
df_uninter = df_transac[['msno', 'transaction_date', 'membership_expire_date']].copy()

In [41]:
# forward transaction by one index for each user
df_uninter['prev_exp_date'] = df_uninter.groupby('msno', sort=False).membership_expire_date.shift(periods=1)

# NOTE: it is faster to use shift and make the difference rather than doing it at once like this:
# df_transac['delta_exp_date'] = df_transac.groupby('msno', sort=False).membership_expire_date.diff(periods=1)

In [42]:
# remove user's first transaction (no prior data for first transactions!)
df_uninter = df_uninter[df_uninter.prev_exp_date.notnull()]

In [43]:
# compute number of days between successive expiration dates
df_uninter['delta_exp_date'] = (df_uninter.membership_expire_date - \
                                df_uninter.prev_exp_date).astype('timedelta64[D]').astype('int64')

# NOTE: use .astype('timedelta64[D]') as opposed to .dt.days which is much slower
# # SIDE NOTE: we could have done it this way:
# # compute difference in membership expiration date (dataframe must be pre-sorted by users and date)
# df_uninter['delta_exp_date'] = df_uninter.membership_expire_date.diff(periods=1)
# # remove user's overlap
# df_uninter = df_uninter[df_uninter.msno == df_uninter.msno.shift(periods=1)]
# similar time complexity

In [44]:
# determine overlap between transaction date and prior expiration date
df_uninter['membership_overlap'] = (df_uninter.prev_exp_date - \
                                df_uninter.transaction_date).astype('timedelta64[D]').astype('int64')

In [45]:
# label uninterrupted membership transactions
# delta_exp_date catches active cancellation
# membership_overlap catches gap in membership
df_uninter['is_continuous'] = (df_uninter.delta_exp_date >= 0) & (df_uninter.membership_overlap >= 0)

# convert bool to integer for subsequent shift
# df_uninter.is_continuous = df_uninter.is_continuous.astype('int64')

#### Process contineous membership first.

In [46]:
# some users have been contineously renewing their membership on time
# this behavior can be detected easily by comparing sum() to count()
df_contusers = df_uninter.groupby('msno', sort=False).is_continuous.agg(['count', 'sum'])

In [47]:
# retrieve loyal members
continuous_users = df_contusers[df_contusers['count'] == df_contusers['sum']].index

In [48]:
# keep first and last transaction of each user
df_firstlast = df_uninter.loc[df_uninter.msno.isin(continuous_users), ['msno', 'membership_expire_date', 'prev_exp_date']]
df_firstlast = df_firstlast.groupby('msno', sort=False).nth([0, -1])

In [49]:
# we need to make the difference between the earliest previous expiration date and the latest expiration date
# do a shift on prev_exp_date and make the difference with membership_expire_date
df_firstlast['uninterrupted_days'] = \
(df_firstlast.membership_expire_date - df_firstlast.groupby('msno', sort=False).prev_exp_date.shift()).astype('timedelta64[D]')

In [74]:
# df_firstlast.head()

In [51]:
# remove NaN rows as it is not useful anymore
s_contdays = df_firstlast.loc[df_firstlast.uninterrupted_days.notnull(),'uninterrupted_days'].astype('int64')

#### Take care of non-continuous transaction history

In [52]:
# process users with non-continuous transactions
df_noncontusers = df_uninter.loc[~df_uninter.msno.isin(continuous_users), :]

In [53]:
df_noncontusers.msno.value_counts().count()

458800

In [54]:
df_noncontusers.shape

(7247363, 7)

In [55]:
# sort by most recent transaction date
df_noncontusers = df_noncontusers.sort_values(['msno', 'transaction_date', 'membership_expire_date'], ascending=False)

In [58]:
def derive_uninterrupted_days(rows):
    """
    derive number of uninterrupted days for each user
    rows : dataframe of one user sorted by transaction date
    """
    # flag for change of state
    cont_up = False
    
    # variable to compute date range
    stop_date = 0
    start_date = 0

    for l in rows.itertuples(index=False):

        # use current expiration date as stop date when it is an active cancellation
        # and as long as we didn't detect a contineous transaction
        if not cont_up and l.delta_exp_date < 0:
            stop_date = l.membership_expire_date
            
        # look for first continuous transaction
        if l.is_continuous:
            # contineous transaction found
            cont_up = True
            # set stop date if it wasn't set already
            if stop_date == 0:
                stop_date = l.membership_expire_date
            # update start date
            start_date = l.prev_exp_date
            
        # condition not fulfilled
        elif cont_up:
            break
            
    # make sure start_date is a timedelta object
    if start_date == 0 or stop_date == 0:
        membership_loyalty = pd.Timedelta(days=0)
    else:
        membership_loyalty = stop_date - start_date
    
    # return number of days elapsed
    return membership_loyalty

In [59]:
# compute number of uninterrupted days for each user
start_time = time()
s_noncontusers = df_noncontusers.groupby('msno', sort=False).apply(derive_uninterrupted_days)
print("\n--- %s seconds elapsed ---" % (timedelta(seconds = time() - start_time)))


--- 0:08:03.091467 seconds ellapsed ---


In [86]:
# convert timedelta to integer
s_noncontusers = s_noncontusers.astype('timedelta64[D]').astype('int64')

In [95]:
# comnbine continuous and non-continuous cases
s_uninterusers = pd.concat([s_contdays, s_noncontusers])

In [99]:
# add unseen users and assign them 0 days of uninterrupted membership
s_uninterusers = add_missing_users(s_uninterusers, 0)
assert len(s_uninterusers) == len(s_users)

In [126]:
# give serie a name
s_uninterusers.name = 'uninterrupted_days'

# 6th feature
- Last Plan duration (added column during EDA)

In [102]:
# last plan duration (active cancellation were filtered out as plan_duration may be missing)
s_planduration = df_frac_day.plan_duration

In [111]:
# add unseen users and assign them the most popular plan duration
s_planduration = add_missing_users(s_planduration, s_planduration.value_counts().index[0])
assert len(s_planduration) == len(s_users)

In [125]:
# give serie a name
s_planduration.name = 'plan_duration'

# 7th feature
- Last Payment ID

In [114]:
# last payment ID (active cancellation were filtered out as plan_duration may be missing)
s_payID = df_frac_day.payment_method_id

In [115]:
# add unseen users and assign them the most popular plan duration
s_payID = add_missing_users(s_payID, s_payID.value_counts().index[0])
assert len(s_payID) == len(s_users)

In [120]:
# give serie a name
s_payID.name = 'pay_id'

# Combine all features into one dataframe

In [131]:
df_transfeatures = pd.concat([s_frac_day, s_churn_zone, s_cancel_ratio,\
           s_autorenew_ratio, s_uninterusers, s_planduration, s_payID], axis = 1)

In [132]:
df_transfeatures.head()

Unnamed: 0,time_of_month,churn_zone_days,cancel_ratio,auto_renew_ratio,uninterrupted_days,plan_duration,pay_id
+++hVY1rZox/33YtvDgmKA2Frg/2qhkz12B9ylCvh8o=,0.464286,13,0.0,1.0,62,30 - 89,41
+++l/EXNMLTijfLBa8p2TUVVVp2aFGSuUI/h7mLmthw=,0.387097,-19,0.0,1.0,181,30 - 89,39
+++snpr7pmobhLKUgSHTv/mpkqgBT0tQJ0zQj6qKrqc=,0.071429,2,0.0,1.0,731,30 - 89,41
++/9R3sX37CjxbY/AaGvbwr3QkwElKBCtSvVzhCBDOk=,0.464286,13,0.0,1.0,306,30 - 89,41
++/UDNo9DLrxT8QVGiDi1OnWfczAdEwThaVyD0fXO50=,0.258065,-23,0.0,1.0,181,30 - 89,39


In [133]:
# save dataframe to file (pickle)
trans_proc_dir = os.path.join(os.pardir, 'data', 'processed', 'transactions_February2017.p34')
df_transfeatures.to_pickle(trans_proc_dir)

# Draft

In [63]:
# # Delay-substract the whole column
# df_noncontusers['is_contineous_diff'] = df_noncontusers.is_contineous.diff()

# # correct for propagation from one user to another
# df_noncontusers.loc[df_noncontusers.msno != df_noncontusers.msno.shift(periods=1), 'is_contineous_diff'] = np.nan

In [None]:
# # assign rank to is_contineous_diff
# df_noncontusers['is_contineous_rank'] = df_noncontusers.groupby('msno', sort=False).is_contineous_diff.rank(method='first')