# WSDM - KKBox's Churn Prediction Challenge
*Can you predict when subscribers will churn?*


**Outline**

* [Read Data](#read)
* [Exploratory Data Analysis](#eda)
* [Feature Creation and Preprocessing](#preprocess)
* [Model and Score](#model) 
* [Predicition](#predict)
* [Reference](#reference)

---

In [1]:
%load_ext watermark

In [2]:
import os
import pandas as pd
import numpy as np

In [3]:
%watermark -a 'PredictiveII' -d -t -v -p pandas,numpy,sklearn,watermark

PredictiveII 2018-01-29 22:58:13 

CPython 3.6.3
IPython 6.1.0

pandas 0.20.3
numpy 1.13.3
sklearn 0.19.1
watermark 1.6.0


## <a id="read">Read Data</a>

### **Train**: 
The train data consists of users whose subscription expires within the month of February 2017
In other words, these are the user ids whose subscription expires in February 2017. When we merge them with their transaction records, their lastest expire date should be the month of February 2017. We should have some more data related to their previous behavior after merging with transaction datatset.

In [4]:
# read data
data_dir = os.path.join('..', 'data')

train_path = os.path.join(data_dir, 'train_v2.csv')
transactions_path = os.path.join(data_dir, 'transactions_v2.csv') 
user_logs_path = os.path.join(data_dir, 'user_logs_v2.csv') 
members_path = os.path.join(data_dir, 'members_v3.csv') 
sample_submission_path = os.path.join(data_dir, 'sample_submission_v2.csv')

train = pd.read_csv(train_path)
transaction = pd.read_csv(transactions_path)
user_log = pd.read_csv(user_logs_path)
member = pd.read_csv(members_path)
sample_submission = pd.read_csv(sample_submission_path)

In [5]:
train.head()

Unnamed: 0,msno,is_churn
0,ugx0CjOMzazClkFzU2xasmDZaoIqOUAZPsH1q0teWCg=,1
1,f/NmvEzHfhINFEYZTR05prUdr+E+3+oewvweYz9cCQE=,1
2,zLo9f73nGGT1p21ltZC3ChiRnAVvgibMyazbCxvWPcg=,1
3,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,1
4,K6fja4+jmoZ5xG6BypqX80Uw/XKpMgrEMdG2edFOxnA=,1


In [6]:
train.shape

(970960, 2)

### Transaction

* **transaction_date**: should be the date this user make a payment
* **membership_expire_date**: the date that the membership expires before the payment?

In [7]:
transaction['membership_expire_date'] = pd.to_datetime(transaction['membership_expire_date'], format='%Y%m%d', errors='ignore')
#transaction['membership_expire_year'] = transaction['membership_expire_date'].apply(lambda x: x.year)
#transaction['membership_expire_month'] = transaction['membership_expire_date'].apply(lambda x: x.month)
#transaction['membership_expire_day'] = transaction['membership_expire_date'].apply(lambda x: x.day)

In [8]:
transaction['transaction_date'] = pd.to_datetime(transaction['transaction_date'], format='%Y%m%d', errors='ignore')
#transaction['transaction_year'] = transaction['transaction_date'].apply(lambda x: x.year)
#transaction['transaction_month'] = transaction['transaction_date'].apply(lambda x: x.month)
#transaction['transaction_day'] = transaction['transaction_date'].apply(lambda x: x.day)

In [9]:
transaction.head()

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
0,++6eU4LsQ3UQ20ILS7d99XK8WbiVgbyYL4FUgzZR134=,32,90,298,298,0,2017-01-31,2017-05-04,0
1,++lvGPJOinuin/8esghpnqdljm6NXS8m8Zwchc7gOeA=,41,30,149,149,1,2015-08-09,2019-04-12,0
2,+/GXNtXWQVfKrEDqYAzcSw2xSPYMKWNj22m+5XkVQZc=,36,30,180,180,1,2017-03-03,2017-04-22,0
3,+/w1UrZwyka4C9oNH3+Q8fUf3fD8R3EwWrx57ODIsqk=,36,30,180,180,1,2017-03-29,2017-03-31,1
4,+00PGzKTYqtnb65mPKPyeHXcZEwqiEzktpQksaaSC3c=,41,30,99,99,1,2017-03-23,2017-04-23,0


In [10]:
transaction.shape

(1431009, 9)

**Questions**

* How to generate the submission? Where do those 907471 msno's coming from?
* The train data consists of users whose subscription expires within the month of February 2017. When I merge with transaction data, the menbership expire date for those ids have a lot of different membership_expire_date. What does that mean? Shouldn't all these id in the train data with the membership_expire_date in 2017 Feb?
* Don't understand what they are discuess [here](https://www.kaggle.com/c/kkbox-churn-prediction-challenge/discussion/39756)
* Should spend more time on reading the [Should I stay or should I go? - KKBox EDA
](https://www.kaggle.com/headsortails/should-i-stay-or-should-i-go-kkbox-eda), [Churn or No Churn - Exploration Data Analysis
](https://www.kaggle.com/rastaman/churn-or-no-churn-exploration-data-analysis)
* For some records in transaciton, the membership_expire_date is far from today, such as in the year of 2020, but they are labeled as churn. What does that mean? Remind that the criteria of "churn" is **no new valid service subscription within 30 days after the current membership expires.**


In [10]:
transaction[(transaction['membership_expire_year']==2017) & (transaction['membership_expire_month']==2)].shape

(350, 15)

In [27]:
temp = pd.merge(train, transaction, on='msno', how='left')

In [32]:
temp.shape

(1169418, 16)

In [72]:
temp2 = temp[temp['payment_method_id'].notnull()]

In [73]:
temp2.shape

(1132036, 16)

In [74]:
len(train.msno.unique())

970960

In [75]:
len(temp2.msno.unique())

933578

In [56]:
temp2.head()

Unnamed: 0,msno,is_churn,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,membership_expire_year,membership_expire_month,membership_expire_day,transaction_year,transaction_month,transaction_day
1,f/NmvEzHfhINFEYZTR05prUdr+E+3+oewvweYz9cCQE=,1,36.0,30.0,180.0,180.0,0.0,2017-03-11,2017-04-11,0.0,2017.0,4.0,11.0,2017.0,3.0,11.0
2,zLo9f73nGGT1p21ltZC3ChiRnAVvgibMyazbCxvWPcg=,1,17.0,60.0,0.0,0.0,0.0,2017-03-11,2017-03-14,0.0,2017.0,3.0,14.0,2017.0,3.0,11.0
3,zLo9f73nGGT1p21ltZC3ChiRnAVvgibMyazbCxvWPcg=,1,15.0,90.0,300.0,300.0,0.0,2017-03-14,2017-06-15,0.0,2017.0,6.0,15.0,2017.0,3.0,14.0
4,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,1,41.0,30.0,149.0,149.0,1.0,2015-09-08,2017-06-08,0.0,2017.0,6.0,8.0,2015.0,9.0,8.0
5,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,1,41.0,30.0,149.0,149.0,1.0,2015-09-08,2017-07-08,0.0,2017.0,7.0,8.0,2015.0,9.0,8.0


In [86]:
temp2.query('msno=="8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ="').sort_values(by='transaction_date',ascending=True)

Unnamed: 0,msno,is_churn,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,membership_expire_year,membership_expire_month,membership_expire_day,transaction_year,transaction_month,transaction_day
11,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,1,41.0,30.0,149.0,149.0,1.0,2015-08-08,2017-05-09,0.0,2017.0,5.0,9.0,2015.0,8.0,8.0
13,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,1,41.0,30.0,149.0,149.0,1.0,2015-08-08,2017-04-08,0.0,2017.0,4.0,8.0,2015.0,8.0,8.0
4,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,1,41.0,30.0,149.0,149.0,1.0,2015-09-08,2017-06-08,0.0,2017.0,6.0,8.0,2015.0,9.0,8.0
5,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,1,41.0,30.0,149.0,149.0,1.0,2015-09-08,2017-07-08,0.0,2017.0,7.0,8.0,2015.0,9.0,8.0
8,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,1,41.0,30.0,149.0,149.0,1.0,2015-10-08,2017-09-08,0.0,2017.0,9.0,8.0,2015.0,10.0,8.0
9,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,1,41.0,30.0,149.0,149.0,1.0,2015-10-08,2017-08-08,0.0,2017.0,8.0,8.0,2015.0,10.0,8.0
7,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,1,41.0,30.0,149.0,149.0,1.0,2015-11-08,2017-10-08,0.0,2017.0,10.0,8.0,2015.0,11.0,8.0
10,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,1,41.0,30.0,149.0,149.0,1.0,2015-11-08,2017-11-07,0.0,2017.0,11.0,7.0,2015.0,11.0,8.0
6,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,1,41.0,30.0,149.0,149.0,1.0,2015-12-08,2017-12-08,0.0,2017.0,12.0,8.0,2015.0,12.0,8.0
12,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,1,41.0,30.0,149.0,149.0,1.0,2015-12-08,2018-01-08,0.0,2018.0,1.0,8.0,2015.0,12.0,8.0


In [76]:
temp2.head(30)

Unnamed: 0,msno,is_churn,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,membership_expire_year,membership_expire_month,membership_expire_day,transaction_year,transaction_month,transaction_day
1,f/NmvEzHfhINFEYZTR05prUdr+E+3+oewvweYz9cCQE=,1,36.0,30.0,180.0,180.0,0.0,2017-03-11,2017-04-11,0.0,2017.0,4.0,11.0,2017.0,3.0,11.0
2,zLo9f73nGGT1p21ltZC3ChiRnAVvgibMyazbCxvWPcg=,1,17.0,60.0,0.0,0.0,0.0,2017-03-11,2017-03-14,0.0,2017.0,3.0,14.0,2017.0,3.0,11.0
3,zLo9f73nGGT1p21ltZC3ChiRnAVvgibMyazbCxvWPcg=,1,15.0,90.0,300.0,300.0,0.0,2017-03-14,2017-06-15,0.0,2017.0,6.0,15.0,2017.0,3.0,14.0
4,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,1,41.0,30.0,149.0,149.0,1.0,2015-09-08,2017-06-08,0.0,2017.0,6.0,8.0,2015.0,9.0,8.0
5,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,1,41.0,30.0,149.0,149.0,1.0,2015-09-08,2017-07-08,0.0,2017.0,7.0,8.0,2015.0,9.0,8.0
6,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,1,41.0,30.0,149.0,149.0,1.0,2015-12-08,2017-12-08,0.0,2017.0,12.0,8.0,2015.0,12.0,8.0
7,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,1,41.0,30.0,149.0,149.0,1.0,2015-11-08,2017-10-08,0.0,2017.0,10.0,8.0,2015.0,11.0,8.0
8,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,1,41.0,30.0,149.0,149.0,1.0,2015-10-08,2017-09-08,0.0,2017.0,9.0,8.0,2015.0,10.0,8.0
9,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,1,41.0,30.0,149.0,149.0,1.0,2015-10-08,2017-08-08,0.0,2017.0,8.0,8.0,2015.0,10.0,8.0
10,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,1,41.0,30.0,149.0,149.0,1.0,2015-11-08,2017-11-07,0.0,2017.0,11.0,7.0,2015.0,11.0,8.0


In [19]:
temp2 = temp.groupby(['msno']).agg({'payment_method_id': 'count'}).reset_index()

In [20]:
temp2[temp2.payment_method_id>1].sort_values(by='payment_method_id', ascending=False).head()

Unnamed: 0,msno,payment_method_id
137444,72gJqt1O31E/WoxAEYFn9LHNI6mAZFGera5Q6gvsFkA=,208
119705,5ty4nZkq54z93wQtBN7RHVYj8rNghBDCVBH+3xmxf0I=,172
398175,OGKDrZQDB3yewZhoSd5qqvmG5A1GcNTYMexO95NlH+g=,148
520336,WHsCtkOVsauvqBL0ULuG38887y7aU8GXdCmJMjw6hjQ=,145
461063,SNlFRAsmUqnXKPofSXA8WYUc5DtmLcUMy4pXSJ3Ohz0=,131


In [58]:
sample_merge = pd.merge(sample_submission,transaction, on='msno',how='left')

In [68]:
sample_submission.shape

(907471, 2)

In [69]:
sample_merge.head()

Unnamed: 0,msno,is_churn,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,membership_expire_year,membership_expire_month,membership_expire_day,transaction_year,transaction_month,transaction_day
0,4n+fXlyJvfQnTeKXTWT507Ll4JVYGrOC8LHCfwBmPE4=,0,41.0,30.0,99.0,99.0,1.0,2017-03-18,2017-04-18,0.0,2017.0,4.0,18.0,2017.0,3.0,18.0
1,aNmbC1GvFUxQyQUidCVmfbQ0YeCuwkPzEdQ0RwWyeZM=,0,34.0,30.0,149.0,149.0,1.0,2017-03-31,2017-04-30,0.0,2017.0,4.0,30.0,2017.0,3.0,31.0
2,rFC9eSG/tMuzpre6cwcMLZHEYM89xY02qcz7HL4//jc=,0,41.0,30.0,99.0,99.0,1.0,2017-03-15,2017-04-15,0.0,2017.0,4.0,15.0,2017.0,3.0,15.0
3,WZ59dLyrQcE7ft06MZ5dj40BnlYQY7PHgg/54+HaCSE=,0,41.0,30.0,99.0,99.0,1.0,2017-03-27,2017-04-27,0.0,2017.0,4.0,27.0,2017.0,3.0,27.0
4,aky/Iv8hMp1/V/yQHLtaVuEmmAxkB5GuasQZePJ7NU4=,0,30.0,30.0,129.0,129.0,1.0,2017-03-22,2017-04-21,0.0,2017.0,4.0,21.0,2017.0,3.0,22.0


In [70]:
sample_merge.shape

(973285, 16)

In [71]:
len(sample_merge[sample_merge['payment_method_id'].notnull()]['msno'].unique())

907470

## <a id="eda">Exploratory Data Analysis</a>