### Clean Transaction file description:

|                                       transactions.csv                                        |
|-----------------------------------------------------------------------------------------------|
| msno                   | user id  (letters, digits and special characters)                    |
| payment_method_id      | payment method   (masked)                                            |
| payment_plan_days      | length of membership plan in days                                    |
| plan_list_price        | in New Taiwan Dollar (NTD)                                           |
| actual_amount_paid     | in New Taiwan Dollar (NTD)                                           |
| is_auto_renew          | true when customer opted in renewing its subscription automatically  |
| transaction_date       | format %Y%m%d                                                        |
| membership_expire_date | format %Y%m%d                                                        |
| is_cancel              | whether or not the user canceled the membership in this transaction. |
| plan_duration          | Intervals in days that fit payment_plan_days |

### List of features:
- Fraction of number of days left in expiration month (s_frac_day)
- Ratio of active cancellation to number of transactions (s_cancel_ratio)
- Ratio of auto renew to number of transactions (s_autorenew_ratio)
- Number of uninterrupted days of membership (s_uninterusers)
- Last Plan duration which is an added column during EDA (s_planduration)
- Is Last Payment ID number 41 or not (s_payID)

#### Necessary imports

In [1]:
import os
import pandas as pd
import numpy as np
from time import time
from datetime import timedelta

#### Retrieve eligible users for training

In [2]:
# eligible users are provided in a csv file
train_dir = os.path.join(os.pardir, 'data', 'processed', 'train.csv')
s_users = pd.read_csv(train_dir, usecols = ['msno'])

In [3]:
current_users = s_users.msno.values
print('Number of eligible users for churn prediction', len(current_users))

Number of eligible users for churn prediction 987814


In [4]:
# Just checking for duplicates. There is none.
s_users.msno.value_counts().count()

987814

#### Get cleaned transaction histories

In [5]:
transaction_dir = os.path.join(os.pardir, 'data', 'interim', 'transactions_clean.csv')
df_transac = pd.read_csv(transaction_dir, parse_dates=['transaction_date', 'membership_expire_date'])

In [6]:
# Total distinct users
Num_distinct_users = df_transac.msno.value_counts().count()
print('Number of distinct users in transaction history = {:,}'.format(Num_distinct_users))

Number of distinct users in transaction history = 2,363,626


In [7]:
# only keep eligible users as provided by KKBOX
df_transac = df_transac[df_transac.msno.isin(current_users)]

In [8]:
# unique eligible users
unique_users = df_transac.msno.value_counts()
print('Number of unique users = {:,}'.format(unique_users.count()))

Number of unique users = 987,814


In [9]:
# only keep transaction up to January 2017 (We have transactions up to February 2017 here)
# When we make predicitons for month X, we only know transactions up to month X-1
df_transac = df_transac[df_transac.transaction_date < '2017-02-01']

In [10]:
# sort dataframe by msno, transaction date then expiration date
df_transac.sort_values(['msno', 'transaction_date', 'membership_expire_date'], inplace=True)

In [11]:
df_transac.tail()

Unnamed: 0,msno,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,is_payid_41,plan_duration
21547738,zzzN9thH22os1dRS0VHReY/8FTfGHOi86//d+wGGFsQ=,30.0,99.0,99.0,True,2016-09-04,2016-10-04,False,True,30 - 89
21547739,zzzN9thH22os1dRS0VHReY/8FTfGHOi86//d+wGGFsQ=,30.0,99.0,99.0,True,2016-10-04,2016-11-04,False,True,30 - 89
21547740,zzzN9thH22os1dRS0VHReY/8FTfGHOi86//d+wGGFsQ=,30.0,99.0,99.0,True,2016-11-04,2016-12-04,False,True,30 - 89
21547741,zzzN9thH22os1dRS0VHReY/8FTfGHOi86//d+wGGFsQ=,30.0,99.0,99.0,True,2016-12-04,2017-01-04,False,True,30 - 89
21547742,zzzN9thH22os1dRS0VHReY/8FTfGHOi86//d+wGGFsQ=,30.0,99.0,99.0,True,2017-01-04,2017-02-04,False,True,30 - 89


# Fraction of number of days left in expiration month.

The idea is that people may be more inclined to churn when their expiration date is closer to the end of the month.

Don't take into account expiration date where cancellation is active. When people actively cancels, their expiration date changes to whatever date they want to stop. Most people stops same day or the day after they chose to cancel.

In [12]:
# filter active cancellation and only keep last transaction of each user
df_frac_day = df_transac[df_transac.is_cancel == False].groupby('msno', sort=False).nth(-1)

# NOTE: dataframe is pre-sorted so we don't need to sort after groupby, it saves time

In [13]:
# compute fraction
s_frac_day = (df_frac_day.membership_expire_date.dt.daysinmonth - df_frac_day.membership_expire_date.dt.day) / \
                            df_frac_day.membership_expire_date.dt.daysinmonth

In [14]:
s_frac_day.shape

(987812,)

Total number of eligible users: 987,814. <br>
There are 2 users who have only one transaction up to January 31st and it is an active cancellation

In [23]:
def add_missing_users(s_in, default_val):
    """
    Given an input Serie with user's id as indexes, add unseen users from prediction Serie s_users.
    s_users must defined earlier. For unseen users, assign them default_val
    """
    # find remaining users
    other_users = set(s_users.msno).difference(s_in.index)

    # create series, default value is zero (end of the month expiration date)
    s_others = pd.Series([default_val]*len(other_users), index=other_users)

    # concatenate with s_frac_day
    s_complete = pd.concat([s_in, s_others])
    # check that serie is complete
    assert len(s_users) == len(s_complete)
    
    return s_complete

In [24]:
# add remaining users
s_frac_day = add_missing_users(s_frac_day, s_frac_day.mean())
assert len(s_frac_day) == len(s_users)

In [25]:
# give serie a name
s_frac_day.name = 'time_of_month'

# Ratio of active cancellation to number of transactions

The idea is that a long time user may just signed up for another plan after an active cancellation. Whereas a recent member may just not enjoy this music streaming service and will churn.

In [26]:
# number of transactions per user
s_num_trans = df_transac.groupby('msno', sort=False).is_cancel.count()

In [27]:
# number of active cancellation per user
s_num_cancel = df_transac[df_transac.is_cancel == True].groupby('msno', sort=False).is_cancel.count()

In [28]:
# ratio of active cancellation to total number of transactions
s_cancel_ratio = s_num_cancel.divide(s_num_trans, fill_value=0)

In [29]:
s_cancel_ratio.shape

(987814,)

In [30]:
# # no missing users for this feature
# #add unseen users 
# s_cancel_ratio = add_missing_users(s_cancel_ratio, 0)
# assert len(s_cancel_ratio) == len(s_users)

In [31]:
# give serie a name
s_cancel_ratio.name = 'cancel_ratio'

# Ratio of auto renew to number of transactions

In [32]:
# number of active cancellation per user
s_num_autorenew = df_transac[df_transac.is_auto_renew == True].groupby('msno', sort=False).is_auto_renew.count()

In [33]:
# ratio of active cancellation to total number of transactions
s_autorenew_ratio = s_num_autorenew.divide(s_num_trans, fill_value=0)

In [34]:
s_autorenew_ratio.shape

(987814,)

In [35]:
# # no missing users for this feature
# # add unseen users
# s_autorenew_ratio = add_missing_users(s_autorenew_ratio, 0)
# assert len(s_autorenew_ratio) == len(s_users)

In [36]:
# give serie a name
s_autorenew_ratio.name = 'auto_renew_ratio'

# Number of uninterrupted days of membership

Interruption could be an active cancellation (i.e. even if user signs up again) and transaction date must be anterior to previous transaction date.

In [37]:
df_uninter = df_transac[['msno', 'transaction_date', 'membership_expire_date']].copy()

In [38]:
# forward transaction by one index for each user
df_uninter['prev_exp_date'] = df_uninter.groupby('msno', sort=False).membership_expire_date.shift(periods=1)

# NOTE: it is faster to use shift and make the difference rather than doing it in one line like this:
# df_transac['delta_exp_date'] = df_transac.groupby('msno', sort=False).membership_expire_date.diff(periods=1)
# operation on groupby() are expensive and diff() takes a long time compared to shift.

User's first transactions have no prior data. We can use transaction date as an estimation of previous expiration date.<br>
Thus, first transactions are assumed to be contineous. It will help for users who have only one transaction.

In [39]:
# use transaction date as previous expiration for first transaction
crit_first_transaction = df_uninter.prev_exp_date.isnull()
df_uninter.loc[crit_first_transaction, 'prev_exp_date'] = df_uninter.loc[crit_first_transaction, 'transaction_date']

In [40]:
# compute number of days between successive expiration dates
df_uninter['delta_exp_date'] = (df_uninter.membership_expire_date - df_uninter.prev_exp_date) \
                                .astype('timedelta64[D]').astype('int64')

# NOTE1: do use .astype('timedelta64[D]') as opposed to .dt.days (much slower)

# NOTE2: we could have done it in two steps:
# compute difference in membership expiration date (dataframe must be pre-sorted by users and date)
# df_uninter['delta_exp_date'] = df_uninter.membership_expire_date.diff(periods=1)
# remove user's overlap
# df_uninter = df_uninter[df_uninter.msno == df_uninter.msno.shift(periods=1)]
# similar time complexity because diff() operates on a Dataframe (i.e. not groupby object)

In [41]:
# determine overlap between transaction date and prior expiration date
df_uninter['membership_overlap'] = (df_uninter.prev_exp_date - df_uninter.transaction_date) \
                                .astype('timedelta64[D]').astype('int64')

In [42]:
# label transactions as contineous if there is no interruption in membership from the previous one.
# delta_exp_date catches active cancellation and membership_overlap catches gap in membership
df_uninter['is_continuous'] = (df_uninter.delta_exp_date >= 0) & (df_uninter.membership_overlap >= 0)

# convert bool to integer for subsequent shift (optional)
# df_uninter.is_continuous = df_uninter.is_continuous.astype('int64')

#### Process fully contineous membership first. (no lapse or cancellation in membership)

In [43]:
# some users have been contineously renewing their membership on time
# this behavior can be detected easily by comparing sum() to count() on is_contineous
df_contusers = df_uninter.groupby('msno', sort=False).is_continuous.agg(['count', 'sum'])

In [44]:
# retrieve loyal members
continuous_users = df_contusers[df_contusers['count'] == df_contusers['sum']].index

In [45]:
# keep first and last transaction of each user
df_firstlast = df_uninter.loc[df_uninter.msno.isin(continuous_users), ['msno', 'membership_expire_date', 'prev_exp_date']]
df_firstlast = df_firstlast.groupby('msno', sort=False).nth([0, -1])

In [46]:
# we need to make the difference between the earliest previous expiration date and the latest expiration date
# do a shift on prev_exp_date and make the difference with membership_expire_date
df_firstlast['uninterrupted_days'] = (df_firstlast.membership_expire_date -
                                     df_firstlast.groupby('msno', sort=False).prev_exp_date.shift()) \
                                     .astype('timedelta64[D]')

In [47]:
# remove NaN rows as it is not useful anymore
s_contdays = df_firstlast.loc[df_firstlast.uninterrupted_days.notnull(),'uninterrupted_days'].astype('int64')

#### Take care of non-continuous transaction history
In this case, there is an active cancellation or membership expired on its own.

In [48]:
# process users with non-continuous transactions
df_noncontusers = df_uninter.loc[~df_uninter.msno.isin(continuous_users), :]

In [49]:
df_noncontusers.msno.value_counts().count()

505587

In [50]:
# sort by most recent transaction date
df_noncontusers = df_noncontusers.sort_values(['msno', 'transaction_date', 'membership_expire_date'], ascending=False)

In [51]:
def derive_uninterrupted_days(rows):
    """
    derive the most recent number of uninterrupted days for each user
    rows : dataframe of one user sorted by transaction date
    """
    # flag for change of state
    cont_up = False
    
    # variable to compute date range
    stop_date = 0
    start_date = 0

    for l in rows.itertuples(index=False):

        # use current expiration date as stop date when it is an active cancellation
        # and as long as we didn't detect a contineous transaction
        if not cont_up and l.delta_exp_date < 0:
            stop_date = l.membership_expire_date
            
        # look for first continuous transaction
        if l.is_continuous:
            # contineous transaction found
            cont_up = True
            # set stop date if it wasn't set already
            if stop_date == 0:
                stop_date = l.membership_expire_date
            # update start date
            start_date = l.prev_exp_date
            
        # condition not fulfilled
        elif cont_up:
            break
            
    # make sure start_date is a timedelta object
    if start_date == 0 or stop_date == 0:
        membership_loyalty = pd.Timedelta(days=0)
    else:
        membership_loyalty = stop_date - start_date
    
    # return number of days elapsed
    return membership_loyalty

In [52]:
# compute number of uninterrupted days for each user
start_time = time()

s_noncontusers = df_noncontusers.groupby('msno', sort=False).apply(derive_uninterrupted_days)

print("\n--- %s seconds elapsed ---" % (timedelta(seconds = time() - start_time)))


--- 0:08:47.931366 seconds elapsed ---


In [53]:
# convert timedelta to integer
s_noncontusers = s_noncontusers.astype('timedelta64[D]').astype('int64')

In [54]:
# combine continuous and non-continuous cases
s_uninterusers = pd.concat([s_contdays, s_noncontusers])

In [55]:
s_uninterusers.shape

(952235,)

In [56]:
# add users without this feature and assign them 0 days of uninterrupted membership
s_uninterusers = add_missing_users(s_uninterusers, 0)
assert len(s_uninterusers) == len(s_users)

In [57]:
# give serie a name
s_uninterusers.name = 'uninterrupted_days'

# Last Plan duration (added column during EDA)

In [58]:
# last plan duration (active cancellation were filtered out as plan_duration may be missing)
s_planduration = df_frac_day.plan_duration

In [59]:
s_planduration.shape

(987812,)

In [60]:
# add unseen users and assign them the most popular plan duration
s_planduration = add_missing_users(s_planduration, s_planduration.value_counts().index[0])
assert len(s_planduration) == len(s_users)

In [61]:
# give serie a name
s_planduration.name = 'plan_duration'

# Last Payment ID

In [62]:
# last payment ID (active cancellation were filtered out as plan_duration may be missing)
s_payID = df_frac_day.is_payid_41

In [63]:
s_payID.shape

(987812,)

In [64]:
# add unseen users and assign them payment ID 41
s_payID = add_missing_users(s_payID, True)
assert len(s_payID) == len(s_users)

In [65]:
# give serie a name
s_payID.name = 'pay_id'

# Combine all features into one dataframe

In [66]:
df_transfeatures = pd.concat([s_frac_day,
                              s_cancel_ratio,
                              s_autorenew_ratio,
                              s_uninterusers,
                              s_planduration,
                              s_payID], axis = 1)

In [67]:
df_transfeatures.head()

Unnamed: 0,time_of_month,cancel_ratio,auto_renew_ratio,uninterrupted_days,plan_duration,pay_id
+++hVY1rZox/33YtvDgmKA2Frg/2qhkz12B9ylCvh8o=,0.464286,0.0,1.0,91,30 - 89,True
+++l/EXNMLTijfLBa8p2TUVVVp2aFGSuUI/h7mLmthw=,0.387097,0.0,1.0,181,30 - 89,False
+++snpr7pmobhLKUgSHTv/mpkqgBT0tQJ0zQj6qKrqc=,0.071429,0.0,1.0,762,30 - 89,True
++/9R3sX37CjxbY/AaGvbwr3QkwElKBCtSvVzhCBDOk=,0.464286,0.0,1.0,337,30 - 89,True
++/UDNo9DLrxT8QVGiDi1OnWfczAdEwThaVyD0fXO50=,0.258065,0.0,1.0,181,30 - 89,False


In [68]:
# save dataframe to file (pickle)
trans_proc_dir = os.path.join(os.pardir, 'data', 'processed', 'transactions_February2017.p34')
df_transfeatures.to_pickle(trans_proc_dir)

# Draft

In [None]:
# # Delay-substract the whole column
# df_noncontusers['is_contineous_diff'] = df_noncontusers.is_contineous.diff()

# # correct for propagation from one user to another
# df_noncontusers.loc[df_noncontusers.msno != df_noncontusers.msno.shift(periods=1), 'is_contineous_diff'] = np.nan

In [None]:
# # assign rank to is_contineous_diff
# df_noncontusers['is_contineous_rank'] = df_noncontusers.groupby('msno', sort=False).is_contineous_diff.rank(method='first')

# 2nd feature
- Number of days in or out of churning zone at the end of prediction month

Based on transactions up to January 31st, we know each user membership expiration date. Here we derive how many days elapsed past membership expiration date to the end of February. Some instances have an expiration date past end of February so it will be negative.

In [None]:
# keep last expiration date for each user
df_churn_zone = df_transac.groupby('msno', sort=False).nth(-1)

In [None]:
# time difference with end of February
s_churn_zone = pd.Timestamp('2017-02-28') - df_churn_zone.membership_expire_date 

# convert time stamp to integer (number of days)
s_churn_zone = s_churn_zone.astype('timedelta64[D]').astype('int64')

In [None]:
# Again we need to include new users
s_churn_zone.shape

In [None]:
# users who have churned as of January, 31st (no need to add them!)
churned_users = s_churn_zone[s_churn_zone > 58].index
len(churned_users)

In [None]:
# people who are signing up during prediction month (February)
# we have no information on them!
other_users = set(s_users.msno).difference(s_churn_zone.index)
len(other_users)

In [None]:
# combination of the two above
non_eligible = other_users.union(set(churned_users))
len(non_eligible)

In [None]:
train_dir = os.path.join(os.pardir, 'data', 'processed', 'train.csv')
new_users_s = s_users[~s_users.msno.isin(non_eligible)].copy()
new_users_s.is_churn = new_users_s.is_churn.astype('int64')
new_users_s.to_csv(train_dir, index=False)

In [None]:
# # Assign best case scenario to new users which is 31 days until they get in the churning zone
# s_churn_zone = add_missing_users(s_churn_zone, s_churn_zone.min())
# assert len(s_churn_zone) == len(s_users)

In [None]:
# give serie a name
s_churn_zone.name = 'churn_zone_days'