# Exploratory Data Analysis
##### _Author: Calvin Chi_

---

# Introduction

Viva Slots is a mobile app featuring slot machines with the option of credit purchases. The producer Rocket Games is interested in predicting player churn, defined as not making the next purchase from the previous purchase within a time period. Being able to predict players who are unlikely to pay again within a time frame allows Rocket Games engineers to target the right customers with incentives to encourage them to pay again, potentially improving business.

<img src="http://i.imgur.com/x4bzboV.jpg", width=500, height=500> 

To perform this data analysis we have collected features about individual customers as well as time-series data on in-game features such as number of level-ups within a time frame. Since an individual customer may have multiple transactions, we will define each transaction as a sample in our dataset. 

Our problem is thus given that a purchase has been made, what is the probability of it occurring again in the next 7, 14, or 30 days. We will choose the exact days depending on our analysis. Our time-series data and user features will be constructed up to the point of purchase. The test set will also similarly have features only up to the point of purchase. Note also that once the modeling is done, the data we will predict on cannot be generated more than `n` days from the last purchase, otherwise their churn labels will already be known.

Let's start with exploratory data analysis! First load the necessary packages:

In [None]:
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import sklearn.preprocessing as pp
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
from datetime import datetime
from sklearn.preprocessing import Imputer
import sys
import pickle
import copy
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import recall_score
from sklearn import grid_search
from sklearn.metrics import confusion_matrix

Define file locations:

In [None]:
Xtrain, Xtest, ytrain, ytest = train_test_split(data, y, test_size=0.2)
clf = DecisionTreeClassifier(max_depth=8, class_weight="balanced")
clf.fit(Xtrain, ytrain)
pred = clf.predict(Xtest)
print("Prediction accuracy on test dataset: ")
print(clf.score(Xtest, ytest))

Determine the recall score based on 50% decision threshold

In [None]:
pred = clf.predict(Xtest)
print(recall_score(ytest, pred))

Confusion Matrix

In [None]:
pd.crosstab(ytest, pred, rownames=['True'], colnames=['Predicted'], margins=True)

 Area under the curve

In [None]:
prob = clf.predict_proba(Xtest)[:, 1]
area = average_precision_score(ytest, prob)
print("Area under PR Curve")
print(area)

From all the analysis we have done so far, since there appears to be a significant overlap between positive and negative classes, there is going to be a significant tradeoff between recall and precision. Let us save our splitted dataset.

In [None]:
pickle.dump(Xtrain, open(dataDir + "Xtrain.p", 'wb'))
pickle.dump(ytrain, open(dataDir + "ytrain.p", 'wb'))
pickle.dump(Xtest, open(dataDir + "Xtest.p", 'wb'))
pickle.dump(ytest, open(dataDir + "ytest.p", 'wb'))

Load our modified dataset

In [None]:
Xtrain = pickle.load(open(dataDir + "Xtrain.p", "rb"))
ytrain = pickle.load(open(dataDir + "ytrain.p", "rb"))
Xtest = pickle.load(open(dataDir + "Xtest.p", "rb"))
ytest = pickle.load(open(dataDir + "ytest.p", "rb"))

Import Calvin's decision tree, which will output all the top features used to separate the classes.

Save the data

In [None]:
pickle.dump(data, open(dataDir + "data.p", 'wb'))

Load the data

In [None]:
data = pickle.load(open(dataDir + "data.p", 'rb'))

# Data Visualization

First set the labels

In [None]:
y1 = data['lapse7']
y2 = data['lapse14']
y3 = data['lapse30']
del data['lapse7']
del data['lapse14']
del data['lapse30']

Determine the dimensions of the data matrix

In [None]:
data.shape

Scale each feature in the data matrix so that each feature has zero mean and unit variance. This is almost always a necessary step for PCA and a step that cannot hurt to perform.

In [None]:
dataScale = pp.scale(data)

Let us whiten the data and plot a PCA of the total data

In [None]:
pca = PCA(n_components=135, whiten=True)
pca_transformed = pca.fit_transform(dataScale)

Plotting...

In [None]:
%matplotlib inline
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111, projection='3d')
posIdx = np.where(y1 == 1)[0]
negIdx = np.where(y1 == 0)[0]
ax.plot(pca_transformed[negIdx,0], pca_transformed[negIdx, 1], pca_transformed[negIdx, 2], '^', markersize=8, 
        alpha=0.7, color='red', label='Non-Churn')
ax.plot(pca_transformed[posIdx, 0], pca_transformed[posIdx, 1], pca_transformed[posIdx, 2], 'o', markersize=8, 
        color='blue', alpha=0.7, label='Churn')
ax.set_xlabel('PC1 (%.2f)' % (pca.explained_variance_ratio_[0]))
ax.set_ylabel('PC2 (%.2f)'% (pca.explained_variance_ratio_[1]))
ax.set_zlabel('PC3 (%.2f)' % (pca.explained_variance_ratio_[2]))
plt.title("PCA (Whole Dataset)")
ax.legend(loc='upper right')
plt.show()

Based on the plot alone it looks like there is a great deal of overlap between our classes. However, the overlap is not perfect, so it is possible to separate out the true negatives. However, plenty of false positives and false negatives are expected because plenty of samples from different classes share very similar features. Let us see how well our three components "capture" the total structure of the data.

In [None]:
plt.figure(figsize=(30,10))
plt.title("Variance Explained vs PCs (Whole Dataset)")
plt.bar(list(range(1, len(pca.explained_variance_ratio_) + 1)), pca.explained_variance_ratio_,
       color="g", align="center")
plt.xticks(list(range(1, len(pca.explained_variance_ratio_))), list(range(1, len(pca.explained_variance_ratio_))))
plt.show()

It looks like our first three PCs do a reasonable job in capturing the structure of the data compared with the rest of the components. Let us perform PCA again, but this time without the time series data to assess the performance of the non-time series data alone. We need to first subset the data so that only non-time series data are included.

To summarize the features we have: 

1. idfa: player id
2. rn: transaction identifier
3. rev: purchase amount
4. hasemail: boolean as to whether player provided email. Included because providing email provides a player with extra credits.
5. fb_friends: number of facebook friends playing. Included because friends can send gift credits.
6. e_viptier: vip tier
7. event_time: event time
8. e_purchaseamount: number of credits purchased
9. credits: credit balance prior to purchase
10. e_level: player level
11. hours_until: hours between this purchase and next
12. hours_prior: hours between last purchase and this purchase
13. lapse7: boolean as to whether next purchase was made within next 7 days. Lapse14 and lapse30 are defined accordingly
14. ooc: out of credit dialogs in unit time
15. ss: number of session starts in unit time
16. hb: number of heartbeats in unit time
17. qw: number of quality wins in unit time
18. sp: number of spins in unit time
19. lu: number of level ups in unit time
20. pv: number of purchases in unit time
21. rev: sum of revenue in unit time
22. chb: number of hourly bonus collections

Note that for the time series data, `ooc_5_4d` means number of out of credit dialogs between day 4 and day 5 prior to current purchase.

# Class Distribution
Let us view the distribution of classes by label definition:
    
    1. lapse7
    2. lapse14
    3. lapse30


In [None]:
%matplotlib inline
labels = 'Not Purchased', 'Purchased'
positive = sum(data['lapse7'] == 1)
negative = sum(data['lapse7'] == 0)
sizes = [positive, negative]
colors = ['yellowgreen', 'lightcoral']
plt.pie(sizes, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=90)
plt.axis('equal')
plt.title("Class Distribution")
plt.show()

In [None]:
positive = sum(data['lapse14'] == 1)
negative = sum(data['lapse14'] == 0)
sizes = [positive, negative]
colors = ['yellowgreen', 'lightcoral']
plt.pie(sizes, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=30)
plt.axis('equal')
plt.title("Class Distribution")
plt.show()

In [None]:
positive = sum(data['lapse30'] == 1)
negative = sum(data['lapse30'] == 0)
sizes = [positive, negative]
colors = ['yellowgreen', 'lightcoral']
plt.pie(sizes, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=0)
plt.axis('equal')
plt.title("Class Distribution")
plt.show()

We see that as we increase our lapse period, the number of transactions that do not churn decreases.

# Feature Engineering 1
 Features that represent unique IDs are not generalizable for classification and should be removed.

In [None]:
del data['idfa']
del data['rn']

We may want to convert `event_time` to a time stamp. For example, `event_time` could be converted to a feature that represents the number of minutes from or to midnight, whichever is less. This time stamp may be potentially useful because it is possible more frequent transactions may occur at a different time of the day than less frequent transactions. 

### Date Time
First check what data structure `event_time` is stored as

In [None]:
print(data['event_time'][0])
print(type(data['event_time'][0]))

Looks like it is a string, therefore we need to convert to datetime format, which can in turn be used to calculate the time to/from midnight.

In [None]:
pickle.dump(newData, open(dataDir + "data.p", 'wb'))

### Change of Time Series Data

Given that time series data seem to be more volatile, we may be interested in the change in time-series over time. For example we may create a feature representing the difference between `chb_2_1d` and `chb_1_0d`.

Let's subset the data to get only the time-series data

In [None]:
subset = data.iloc[:, 8:134]

Now calculate the difference between the columns

In [None]:
originalFeatures = subset.columns.values.tolist()
newFeatureName = [x + "Diff" for x in originalFeatures]
newFeatures = pd.DataFrame()

for i in range(0, 126, 14):
    sub = subset.iloc[:, i:i+14]
    sub = sub.diff(axis=1)
    newFeatures = pd.concat([newFeatures, sub], axis=1)

newFeatures.columns = newFeatureName
newFeatures.dropna(inplace=True, axis=1)
newFeatures.head()

Let us perform PCA on these differences to see how well separated the two classes are

In [None]:
diff = pp.scale(newFeatures)
pca = PCA(n_components=113, whiten=True)
pca_transformed = pca.fit_transform(diff)

fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111, projection='3d')
posIdx = np.where(y1 == 1)[0]
negIdx = np.where(y1 == 0)[0]
ax.plot(pca_transformed[negIdx,0], pca_transformed[negIdx, 1], pca_transformed[negIdx, 2], '^', 
        markersize=7, alpha=0.7, color='red', label='Non-churn')
ax.plot(pca_transformed[posIdx, 0], pca_transformed[posIdx, 1], pca_transformed[posIdx, 2], 'o', 
        markersize=7, color='blue', alpha=0.7, label='Churn')
ax.set_xlabel('PC1 (%.2f)' % (pca.explained_variance_ratio_[0]))
ax.set_ylabel('PC2 (%.2f)'% (pca.explained_variance_ratio_[1]))
ax.set_zlabel('PC3 (%.2f)' % (pca.explained_variance_ratio_[2]))
plt.title("PCA (Partial Dataset)")
ax.legend(loc='upper right')
plt.show()

In [None]:
newData = pd.concat([data, newFeatures], axis=1)
print(data.shape)
print(newData.shape)

It looks like these features can well separate the two classes.

Add the labels back to the data matrix

In [None]:
newData['lapse7'] = y1
newData['lapse14'] = y2
newData['lapse30'] = y3
pickle.dump(newData, open(dataDir + "data.p", 'wb'))

# New Feature Visualization

To gauge how well the best engineered features perform in separating classes, we will perform PCA. First load the data

In [None]:
data7 = pickle.load(open(dataDir + "data.p", 'rb'))
y7 = data7['lapse7']
del data7['lapse7']
del data7['lapse14']
del data7['lapse30']

Then subset the features to only included the best engineered features, determined to be the `hours_prior` and range features.

In [None]:
dataSubset = dataScale[:, list(range(8)) + [134]]
pca = PCA(n_components=9, whiten=True)
pca_transformed = pca.fit_transform(dataSubset)

Plot:

In [None]:
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111, projection='3d')
posIdx = np.where(y1 == 1)[0]
negIdx = np.where(y1 == 0)[0]
ax.plot(pca_transformed[negIdx,0], pca_transformed[negIdx, 1], pca_transformed[negIdx, 2], '^', 
        markersize=7, alpha=0.7, color='red', label='Non-Churn')
ax.plot(pca_transformed[posIdx, 0], pca_transformed[posIdx, 1], pca_transformed[posIdx, 2], 'o', 
        markersize=7, color='blue', alpha=0.7, label='Churn')
ax.set_xlabel('PC1 (%.2f)' % (pca.explained_variance_ratio_[0]))
ax.set_ylabel('PC2 (%.2f)'% (pca.explained_variance_ratio_[1]))
ax.set_zlabel('PC3 (%.2f)' % (pca.explained_variance_ratio_[2]))
plt.title("PCA (Partial Dataset)")
ax.legend(loc='upper right')
plt.show()

Keeping the time series data makes the data look more separable. Let's look at the data two components at a time.

In [None]:
fig = plt.figure(figsize=(12,4))
PCs = [(0, 1), (0, 2), (1, 2)]
i = 1
for pc in PCs:
    ax = fig.add_subplot(130 + i)
    ax.plot(pca_transformed[posIdx, pc[0]], pca_transformed[posIdx, pc[1]], 'o', markersize=7, color='blue', 
            alpha=0.7, label='Churn')
    ax.plot(pca_transformed[negIdx,pc[0]], pca_transformed[negIdx, pc[1]], '^', markersize=7, alpha=0.7, 
            color='red', label='Non-Churn')
    ax.set_xlabel('PC' + str(pc[0] + 1) + ' (%.2f)' % (pca.explained_variance_ratio_[pc[0]]))
    ax.set_ylabel('PC' + str(pc[1] + 1) + ' (%.2f)'% (pca.explained_variance_ratio_[pc[1]]))
    plt.rcParams['legend.fontsize'] = 7
    plt.title("PC" + str(pc[0] + 1) + " vs " + "PC" + str(pc[1] + 1) +  "(Partial Dataset)")
    ax.legend(loc='upper right')
    i += 1
plt.show()

Indeed, the non-time series features alone will not be very useful in distinguishing the two classes. 

Let us plot each time-series feature mean against time to see if there are any differences between churn and no churn. In the plots below, increasing time values indicate intervals closer to purchase (ie. `ooc_21_14d` vs `ooc_2_1d`). Keep in mind that for each time-series feature, there are two groups of intervals - intervals over 7 days and intervals over 1 day. Intervals closer to purchase are defined over a day. Bars included represent the standard deviation.

In [None]:
dataSubset = data.ix[:, 8:134]
fig = plt.figure(figsize = (15, 45))
titles = ["Out of Credit", "Session Starts", "HeartBeats", "Quality Wins", "Spins", "Levelups", "Purchases", 
         "Revenue", "Number of Hourly Bonus"]
counter = 0
for i in range(0, 126, 14):
    ax = fig.add_subplot(910 + counter + 1)
    pos = dataSubset.ix[posIdx, i:i+14]
    neg = dataSubset.ix[negIdx, i:i+14]
    posMean = pos.apply(np.mean)
    negMean = neg.apply(np.mean)
    posStd = pos.apply(np.std)
    negStd = neg.apply(np.std)
    ax.errorbar(list(range(1, 15)), posMean, yerr=posStd, color='blue', label="Churn")
    ax.errorbar(list(range(1, 15)), negMean, yerr=negStd, color='red', label="Non-Churn")
    ax.set_xlabel("time")
    ax.set_title(titles[counter])
    ax.legend(loc='upper right')
    plt.rcParams['legend.fontsize'] = 15
    counter += 1

A couple of observations can be made from these plots:

1. In general, non-churn players have higher mean feature values than churn players, although the significance of the difference may not be high.
2. In general, non-churn players have larger standard deviations than churn players.

These observations suggest that churn transactions in general are more diverse than non-churn playes. 

# Feature Engineering 2
### Number of Times Exceeding 1-2 SD
Non-churn players seem to have more volatile in-game experiences. We will leverage this to create new features as the number of times a feature value exceeds or goes below 1-2 standard deviations from the mean value over time, and see if it is a good feature.

In [None]:
newFeatures = {}
features = ["oocNew", "ssNew", "hbNew", "qwNew", "spinsNew", "luNew", "purchaseNew", 
         "revNew", "bonusNew"]
posIdx = np.where(y1 == 1)[0]
negIdx = np.where(y1 == 0)[0]
counter = 0

for i in range(0, 126, 14):
    sub = dataSubset.ix[:, i:i+14].as_matrix()
    means = np.mean(sub, axis=1).reshape(sub.shape[0], 1)
    stds = np.std(sub, axis=1).reshape(sub.shape[0], 1)
    truth = (sub > means + 1*stds) | (sub < means - 1*stds)
    newFeatures[counter] = np.sum(truth, axis=1)
    counter += 1

newFeatures = pd.DataFrame(newFeatures)
newFeatures.columns = features
newFeatures.head()

Let us see how well our newly created features separate out churn players from non-churn players.

In [None]:
newFeatures = newFeatures.as_matrix()
pos = newFeatures[posIdx, :]
neg = newFeatures[negIdx, :]
posMeans = np.mean(pos, axis=0)
posStds = np.std(pos, axis=0)
negMeans = np.mean(neg, axis=0)
negStds = np.std(neg, axis=0)
plt.figure(figsize = (10, 5))
plt.errorbar(list(range(len(features))), posMeans, yerr=posStds, color="blue", label="Churn")
plt.errorbar(list(range(len(features))), negMeans, yerr=negStds, color="red", label="Non-Churn")
plt.xlim([-0.5, 8.5])
plt.legend(loc="upper right")
plt.show()

Unfortunately this new feature doesn't seem to capture the differences well. However, there is little harm to adding this new feature.

We should delete `hours_until`

In [None]:
del data['hours_until']

### Lapse 30
Impute the `hours_prior` feature

In [None]:
data30 = copy.deepcopy(data)
pos = np.squeeze(data30.loc[(data30['hours_prior'] != '(null)') & (data30['lapse30'] == 1), 
                          ['hours_prior']]).astype('int')
print(pos.shape)
neg = np.squeeze(data30.loc[(data30['hours_prior'] != '(null)') & (data30['lapse30'] == 0), 
                              ['hours_prior']]).astype('int')
print(neg.shape)

In [None]:
posMean = int(np.mean(pos))
negMean = int(np.mean(neg))
data30.loc[(data30['lapse30'] == 1) & (data30['hours_prior'] == '(null)'), ['hours_prior']] = posMean
data30.loc[(data30['lapse30'] == 0) & (data30['hours_prior'] == '(null)'), ['hours_prior']] = negMean

In [None]:
y30 = data30['lapse30']
del data30['lapse7']
del data30['lapse14']
del data30['lapse30']

### Comparison

We will use a decision tree to roughly evaluate the area under the precison-recall curve when the label is defined according to `lapse7`, `lapse14` and `lapse30` respectively. We will pick the definition that allows us to achieve maximum area under the precision-recall curve. 

#### Lapse 7

Set the label for lapse7 and delete all the other labels

In [None]:
data7 = copy.deepcopy(data)
y7 = data7['lapse7']
del data7['lapse7']
del data7['lapse14']
del data7['lapse30']

We use the 80-20 split rule for train and test sets

In [None]:
Xtrain, Xtest, ytrain, ytest = train_test_split(data7, y7, test_size=0.20)

Let us first do hyperparameter tuning to determine the optimal tree depth for our decision tree. We choose the `class_weight` to be "balanced" because we have a class imbalance with a majority negative class. 

In [None]:
dt = DecisionTreeClassifier(class_weight="balanced")
depths = {'max_depth':list(range(4, 26, 2))}
clf = grid_search.GridSearchCV(dt, depths, scoring='average_precision', cv=3, verbose=True)
clf.fit(Xtrain, ytrain)

Look at all the scores

In [None]:
clf.grid_scores_

It looks like the optimal maximum tree depth is 6. 

In [None]:
clf = DecisionTreeClassifier(max_depth=6, class_weight="balanced")
clf.fit(Xtrain, ytrain)
print("Prediction accuracy on test dataset: ")
print(clf.score(Xtest, ytest))

Now evaluate area under precision-recall (PR) curve

Let us finally plot the precision-recall curve.

In [None]:
precision, recall, thresholds = precision_recall_curve(ytest, prob)
plt.plot(recall, precision, "o-", color="black")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Lapse7")
plt.show()

#### Lapse14

In [None]:
Xtrain, Xtest, ytrain, ytest = train_test_split(data14, y14, test_size=0.20)

Let us first do hyperparameter tuning to determine the optimal tree depth for our decision tree. We choose the `class_weight` to be "balanced" because we have a class imbalance with a majority negative class. Based on the previous run for `lapse7`, it appears that a tall decision tree leads to suboptimal area under the PR curve, so this will decrease the number of hyperparameters we need to tune this time.

In [None]:
dt = DecisionTreeClassifier(class_weight="balanced")
depths = {'max_depth':list(range(4, 16, 2))}
clf = grid_search.GridSearchCV(dt, depths, scoring='average_precision', cv=3, verbose=True)
clf.fit(Xtrain, ytrain)

It looks like the optimal maximum tree depth is 4. 

Look at the scores

The most optimal depth is 4

In [None]:
clf = DecisionTreeClassifier(max_depth=4, class_weight="balanced")
clf.fit(Xtrain, ytrain)
print("Prediction accuracy on test dataset: ")
print(clf.score(Xtest, ytest))

Plot the precision-recall curve

It appears that lapse7 gives us the best area under the PR curve. 

# Feature Selection

We are interested in which features are the best at discriminating classes. Since a decision tree automatically performs feature selection, we will leverage this property to help us determine that. 

In [None]:
from DecisionTree import DecisionTree

Train the decision tree

In [None]:
clf2 = DecisionTree(stop=0.35, output=True, minSize=1000)
clf2.train(Xtrain, ytrain)

Let's see how well Calvin's decision tree performs on the test set.

There doesn't appear to be any preference in purchase time between churn and non-churn transactions. However, it is noteworthy that in general, there is a preference for purchasing close to midnight (~3 hours within midnight). A possible explanation for this is that Viva Slots customers have more time to play and pay during the evenings when have less time during the day due to work.

### Email Status

Convert email status to binary so that we can work with numbers

In [None]:
data['hasemail'] = data['hasemail'].astype(int)
print(data['hasemail'][:5])

The rationale for including whether user associated with transaction has provided email is because providing email awards the user with in-game credits. However, this is a one-time event and may not be useful in separating churn from non-churn transactions.

In [None]:
print("Proportion of churn transactions with email: ")
print(np.mean(data.loc[(data['lapse7'] == 1), ['hasemail']]))

In [None]:
print("Proportion of non-churn transactions with email: ")
print(np.mean(data.loc[(data['lapse7'] == 0), ['hasemail']]))

Indeed, the proportion of transactions providing email is similar in both classes at around 50%, almost like tossing a coin on each transaction to decide whether the user for the transaction provided email! Some additional features will be constructed based on which lapse we choose, and we choose `lapse7` as the default. First save the data so far.

Load the data back up

### Hours Prior to Current Purchase

We need to throw away `hours_until` because that feature is used to determine our label (i.e. if `hours_until` > 7 then 1 else 0). However, we can keep `hours_prior` to current purchase.

We want to see if we should keep `hours_prior`, which contains `(null)` values corresponding to no purchases prior to current purchase. Inclusion of `hours_prior` requires a strategy for treating `(null)` values. First see how many `(null)` values there are as a percentage of our data.

In [None]:
print("Percentage null: ")
np.mean(data['hours_prior'] == '(null)')

A sizeable number of samples have `(null)`, hence we continue on to see if `hours_prior` is worth including:

In [None]:
pos = np.squeeze(data.loc[(data['hours_prior'] != '(null)') & (data['lapse7'] == 1), ['hours_prior']]).astype('int')
print(pos.shape)
neg = np.squeeze(data.loc[(data['hours_prior'] != '(null)') & (data['lapse7'] == 0), ['hours_prior']]).astype('int')
print(neg.shape)

In [None]:
print(pos.describe())
print(neg.describe())

In [None]:
plt.hist([pos])
plt.title("Churn")
plt.show()

In [None]:
plt.hist([neg])
plt.title("Non-Churn")
plt.show()

It appears that churn transactions have a larger within-class proportion of transactions who have hours prior to current payment greater than 250 hours. A possible explanation is that churn transactions in general have longer time span between successive purchases. Thus, `hours_prior` may be useful and should be included.

To treat the `(null)` values, we are going to impute them with the mean `hours_prior` values of transactions within the same class.

In [None]:
posMean = int(np.mean(pos))
negMean = int(np.mean(neg))
data.loc[(data['lapse7'] == 1) & (data['hours_prior'] == '(null)'), ['hours_prior']] = posMean
data.loc[(data['lapse7'] == 0) & (data['hours_prior'] == '(null)'), ['hours_prior']] = negMean

In [None]:
fileDir = "../data/payerChurnData_20160722.csv"
dataDir = "data/"
outputDir = "output/"

Read data:

In [None]:
data = pd.read_csv(fileDir)

Determine the data structure holding the data:

In [None]:
print(type(data))

Take a peek at the data...

In [None]:
data.head()

Assess the dimensions and print total list of features

In [None]:
print("Dimensions")
print(data.shape)
print("\n")
print(data.columns.values[:20])

In [None]:
pred = clf2.predict(Xtest)
print("Test set accuracy:")
print(np.mean(pred == ytest))

In [None]:
attributes = clf2.attributes
print(attributes)

In [None]:
pickle.dump(attributes, open(outputDir + "attributes.p", 'wb'))

In [None]:
attributes = pickle.load(open(outputDir + "attributes.p", 'rb'))

It looks like the best features are a combination of newly created range features and time series features. 

In [None]:
colNames = Xtrain.columns.values
idx = np.where(np.in1d(colNames, attributes))[0]
print(len(idx))
dataSubset = Xtrain.iloc[:, idx]
print(dataSubset.shape)

Let us see how separable the two classes are with these selected 34 features.

In [None]:
data = pickle.load(open(dataDir + "data.p", 'rb'))
y = data['lapse7']
del data['lapse7']
del data['lapse14']
del data['lapse30']

Let us first split the dataset and train on a simple decision tree to gauge its performance again.

In [None]:
temp = np.squeeze(data.as_matrix(['event_time']))
# Transform datetime.strptime function so that it can vectorize
func = np.vectorize(datetime.strptime)
dtVec = func(temp, "%Y-%m-%d %H:%M:%S")
print(dtVec[:5])

Let us define a function that converts our event times to closest time from/to midnight in minutes. For example, 23:56 pm would result in 4 minutes, 11:55 am would result in 715 minutes, and 13:00 pm would result in 660 minutes. There are 1440 minutes in a day.

In [None]:
def TimeToFromMidnight(dt):
    h = dt.hour
    m = dt.minute
    mTotal = h*60 + m
    if (1440 - mTotal > mTotal):
        return mTotal
    else:
        return (1440 - mTotal)

Perform transformation from event times to minutes to/from midnight.

In [None]:
TimeToFromMidnightVec = np.vectorize(TimeToFromMidnight)
times = TimeToFromMidnightVec(dtVec)
print(times[:10])

Add this new timestamp feature to our dataframe

In [None]:
del data['event_time']

Let's visualize the distribution of time stamp for churn and non-churn transactions to see if certain transactions have any time preferences between the two classes.

In [None]:
%matplotlib inline
posTime = np.squeeze(data.loc[(data['lapse7'] == 1), ['timeMN']]).astype('int')
negTime = np.squeeze(data.loc[(data['lapse7'] == 0), ['timeMN']]).astype('int')
plt.hist([posTime])
plt.title("Churn")
plt.show()

In [None]:
plt.hist([negTime])
plt.title("Non-Churn")
plt.show()

In [None]:
data['timeMN'] = times

Delete original `event_time` data

Impute the `hours_prior` feature

In [None]:
data14 = copy.deepcopy(data)
pos = np.squeeze(data14.loc[(data14['hours_prior'] != '(null)') & (data14['lapse14'] == 1), 
                          ['hours_prior']]).astype('int')
print(pos.shape)
neg = np.squeeze(data14.loc[(data14['hours_prior'] != '(null)') & (data14['lapse14'] == 0), 
                              ['hours_prior']]).astype('int')
print(neg.shape)

In [None]:
posMean = int(np.mean(pos))
negMean = int(np.mean(neg))
data14.loc[(data14['lapse14'] == 1) & (data14['hours_prior'] == '(null)'), ['hours_prior']] = posMean
data14.loc[(data14['lapse14'] == 0) & (data14['hours_prior'] == '(null)'), ['hours_prior']] = negMean

In [None]:
y14 = data14['lapse14']
del data14['lapse7']
del data14['lapse30']
del data14['lapse14']

In [None]:
data = pickle.load(open(dataDir + "data.p", 'rb'))
newFeatures = pd.DataFrame(newFeatures)
newFeatures.columns = features
data = pd.concat([data, newFeatures], axis=1)
pickle.dump(data, open(dataDir + "data.p", 'wb'))

### Range
Given that churn transactions are more volatile, let us construct a new feature using range (measure of dispersion) over time for each of the 9 time-series features as new features.

In [None]:
newFeatures = {}
features = ["oocRange", "ssRange", "hbRange", "qwRange", "spinsRange", "luRange", "purchaseRange", 
         "revRange", "bonusRange"]
counter = 0

for i in range(0, 126, 14):
    sub = dataSubset.ix[:, i:i+14].as_matrix()
    maxVal = np.max(sub, axis=1)
    minVal = np.min(sub, axis=1)
    r = maxVal - minVal
    newFeatures[counter] = r
    counter += 1

newFeatures = pd.DataFrame(newFeatures)
newFeatures.columns = features
newFeatures.head()

Let's visualize how well this new feature does at separating classes using histograms

In [None]:
fig = plt.figure(figsize = (15, 30))
posIdx = np.where(y1 == 1)[0]
negIdx = np.where(y1 == 0)[0]
pos = newFeatures.iloc[posIdx, :]
neg = newFeatures.iloc[negIdx, :]
counter = 1
n = 20
for feature in features:
    ax = fig.add_subplot(910 + counter)
    posF = pos[feature]
    negF = neg[feature]
    ax.hist([posF, negF], color=['blue', 'red'], label=['Churn', 'Non-churn'])
    ax.set_title(feature)
    counter += 1

It looks like this feature may capture the class differences, as non-churn trasactions generally have lower maximum values for each of the 9 features.

Plot

In [None]:
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111, projection='3d')
posIdx = np.where(y7 == 1)[0]
negIdx = np.where(y7 == 0)[0]
ax.plot(pca_transformed[negIdx,0], pca_transformed[negIdx, 1], pca_transformed[negIdx, 2], '^', 
        markersize=7, alpha=0.7, color='red', label='Non-churn')
ax.plot(pca_transformed[posIdx, 0], pca_transformed[posIdx, 1], pca_transformed[posIdx, 2], 'o', 
        markersize=7, color='blue', alpha=0.7, label='Churn')
ax.set_xlabel('PC1 (%.2f)' % (pca.explained_variance_ratio_[0]))
ax.set_ylabel('PC2 (%.2f)'% (pca.explained_variance_ratio_[1]))
ax.set_zlabel('PC3 (%.2f)' % (pca.explained_variance_ratio_[2]))
plt.title("PCA (Partial Dataset)")
ax.legend(loc='upper right')
plt.show()

The separation is a little better, which suggests we also need our existing time-series data.

# Which Lapse is Better?

Rocket Games is interested in high recall of churn transactions, so the objective is to maximize the area under the precision-recall curve. However, this depends on how we define churn transactions. We have the options of 7 days, 14 days, or even 30 days. Which definition is chosen will depend on the tradeoff between AUCs and how frequent Rocket Games would like the purchases to be. Let us first contruct the `hours_prior` feature for `lapse14` and `lapse30` respectively, and compare all the AUCs to see which lapse will allow both maximal recall and precision.

### Lapse14

In [None]:
data7 = data7.ix[:, ['hours_prior', "oocRange", "ssRange", "hbRange", "qwRange", "spinsRange", "luRange", 
                     "purchaseRange", "revRange", "bonusRange"]]
print(data7.shape)

Scale and fit the data

It looks like these new features can provide some degree of class separation. Let us visualize the PCA plot with all features.

In [None]:
data7 = pp.scale(data7)
pca = PCA(n_components=153, whiten=True)
pca_transformed = pca.fit_transform(data7)