# KKBox Customer Lifetime Value Analysis

---

# Part I: <font color=green>*Extraction, Transformation, and Loading*</font>

---

In [22]:
# General Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats
import datetime 

## Import and Prep Data

In [23]:
# Import Transaction Files
transactions0 = pd.read_csv('D:/J-5 Local/transaction0.csv')
transactions1 = pd.read_csv('D:/J-5 Local/transaction1.csv')
transactions2 = pd.read_csv('D:/J-5 Local/transaction2.csv')
transactions3 = pd.read_csv('D:/J-5 Local/transaction3.csv')
transactions4 = pd.read_csv('D:/J-5 Local/transaction4.csv')

# Concat all files into one
transactions = pd.concat([transactions0,transactions1,transactions2,transactions3,transactions4])

# Delete temp uploads
del transactions0
del transactions1
del transactions2
del transactions3
del transactions4

# Import Churn Files
churn_cluster = pd.read_csv('D:/J-5 Local/DRV_Feb2016_With_Cluster')

# Import Members Files
members = pd.read_csv('D:/J-5 Local/members.csv')

In [24]:
# Convert Date columns into DateTime Object
transactions['transaction_date'] = pd.to_datetime(transactions['transaction_date'])
transactions['membership_expire_date'] = pd.to_datetime(transactions['membership_expire_date'])
members['registration_init_time'] = pd.to_datetime(members['registration_init_time'])

As this the 3rd project with this dataset, we will simply be exploring data with respect to the use case of Survival Analysis and Customer Lifetime Value. Please refer to the previous projects if you wish to know more about the dataset as a whole.

The goal of this section is to prepare and format the dataset so that it is prepared for our Survival Analysis

### <font color=purple>Create Master DF</font>

In [25]:
# Create Master Dataset
clv_data_master = pd.merge(members, churn_cluster[['msno','is_churn','Cluster','city_agg']], on='msno')

In [26]:
del members
del churn_cluster

In [27]:
clv_data_master.head()

Unnamed: 0,msno,city,bd,gender,registered_via,registration_init_time,is_churn,Cluster,city_agg
0,P/Jw4MNLvfODOLBMXnuprsWoTDk2Tvez9k9uYPUDOH4=,20,0,,3,2013-03-31,0,2,0
1,BZugFTI+gk693KTVjzn4H+uENNjuOfoafXc73bkehQ0=,10,0,,3,2013-05-02,0,2,0
2,/84Qsm0+k8byRbnq4Uhcneu/Zp/BgicaGLgJ+wuZLXU=,21,0,,3,2014-03-25,0,1,0
3,TRdjEyUSvy9ou8lgD5/LPuOZSJwKTiwoNXNJu6CQD0k=,3,0,,3,2013-02-17,0,1,0
4,XXvj8r0kqkT10F0mRnFCThXHZCra/BRuGvwXCiym+VI=,17,0,,3,2013-11-08,0,1,0


### <font color=purple>Inspect Payment Plan Days</font>

In contractural settings, how often what one pays is critical in determining lifetime value. Let's look at our current use-case to see what payment plan periods are being utilized by our users. As this project is taking place at the same time of our Churn and initial Customer Segmentation projects, we will be observing all data through January 2016

In [28]:
# Payment plan days distribution
transactions[transactions['transaction_date'] < datetime.datetime(2016,2,28)]['payment_plan_days'].value_counts().head(10)

30     8017313
0       870121
31      766608
7       221883
410      79508
195      75600
10       32737
180      30904
100      11051
395      10329
Name: payment_plan_days, dtype: int64

Before 2016 KKBox made a switch from 31 day payments to 30 day payments. For the simplicity, we will be combining these values.

In [29]:
# Make a transaction DF just for users who have transaction dates beyond 2016
transactions = transactions[transactions['transaction_date'] < datetime.datetime(2016,2,28)]

# Convert 31 to 30
transactions['payment_plan_days'] = transactions['payment_plan_days'].apply(lambda x: 30 if x == 31 else x)

## Feature Engineering

### <font color=purple>*Do all users have a single unique Payment Plan Period?*</font>

Next we want to determine whether or not a user has had a single recurring payment plan period through his lifetime. Aside from comparing unique payment plan periods to each other, it would also be interesting to determine whether users who have had multiple payments have a higher LTV than those who have not.

In [30]:
# Members vs # of Unique payment_plan_days
temp = transactions.groupby('msno')['payment_plan_days'].nunique().reset_index()
temp['payment_plan_days'].value_counts()

1    1048019
2     544206
3      39834
4       1823
5        155
6         16
7          4
8          1
Name: payment_plan_days, dtype: int64

Here we see that some users do not have an exclusive payment plan and have switched from plan to plan over their lifetime. In order to have an accurate analysis we will segment across users with single plans vs users with various plans. Let's add these values as a new feature.

In [31]:
# Add unique_payment_plan_days to Master DF
temp.columns = ['msno', 'unique_payment_plans']
clv_data_2016 = pd.merge(clv_data_master, temp, on='msno', how='inner')

### <font color=purple>*Add Payment Plan: Days, List Price, Discount*</font>

In [32]:
# Add payment plan days to master df
temp = transactions.groupby('msno')['payment_plan_days'].median().reset_index()
clv_data_2016 = pd.merge(clv_data_2016, temp, on='msno')

In [33]:
# Add payment plan price to master df
temp = transactions.groupby('msno')['plan_list_price'].median().reset_index()
clv_data_2016 = pd.merge(clv_data_2016, temp, on='msno')

In [34]:
# Add discount categorical to master df
temp = transactions.groupby('msno')[['plan_list_price','actual_amount_paid']].sum().reset_index()
temp['discount'] = temp['plan_list_price'] - temp['actual_amount_paid']
temp['discount'] = temp['discount'].apply(lambda x: 'Discount' if x > 0 else 'No Discount')
clv_data_2016 = pd.merge(clv_data_2016, temp[['msno','discount']], on='msno')

### <font color=purple>*Calculate Tenure and Amount Spent Per Day Over Tenure*</font>

Now we will calculate membership tenure. As our dataset is from January 1st 2016 to March 31st 2017, we will calculate tenure as ***March 31st 2017 - Earliest Transaction Date***.

In [35]:
# Add tenure
temp = transactions.groupby('msno')['transaction_date'].min().reset_index()
temp['tenure'] = (datetime.datetime(2016,2,28) - temp['transaction_date']).dt.days

# Add paid_per_day
temp2 = transactions.groupby('msno')['actual_amount_paid'].sum().reset_index()
temp = pd.merge(temp[['msno','tenure']], temp2[['msno','actual_amount_paid']], on='msno')
temp['avg_paid_per_day'] = temp['actual_amount_paid'] / temp['tenure']

# Add both features to df
clv_data_2016 = pd.merge(clv_data_2016, temp[['msno','tenure','avg_paid_per_day']], on='msno')

In [36]:
clv_data_2016.head()

Unnamed: 0,msno,city,bd,gender,registered_via,registration_init_time,is_churn,Cluster,city_agg,unique_payment_plans,payment_plan_days,plan_list_price,discount,tenure,avg_paid_per_day
0,P/Jw4MNLvfODOLBMXnuprsWoTDk2Tvez9k9uYPUDOH4=,20,0,,3,2013-03-31,0,2,0,1,30.0,149.0,No Discount,234,5.094017
1,BZugFTI+gk693KTVjzn4H+uENNjuOfoafXc73bkehQ0=,10,0,,3,2013-05-02,0,2,0,2,30.0,149.0,No Discount,410,5.087805
2,/84Qsm0+k8byRbnq4Uhcneu/Zp/BgicaGLgJ+wuZLXU=,21,0,,3,2014-03-25,0,1,0,1,30.0,149.0,No Discount,40,7.45
3,TRdjEyUSvy9ou8lgD5/LPuOZSJwKTiwoNXNJu6CQD0k=,3,0,,3,2013-02-17,0,1,0,2,30.0,149.0,No Discount,419,4.97852
4,XXvj8r0kqkT10F0mRnFCThXHZCra/BRuGvwXCiym+VI=,17,0,,3,2013-11-08,0,1,0,1,30.0,149.0,No Discount,407,5.125307


## Export Data

In [41]:
clv_data_2016.to_csv('D:/J-5 Local/CLV_Feb2016.csv')