# *DOCUMENTATION - Modeling Churn*

<font color=blue>Wednesday, 25 - 2017</font>

## Introduction

The preferred way to approach this challenge is by first understanding the problem which we have in hand. The definition of churn is very specific to industries and companies. Churn can be analyzed with different techniques, some taking time into consideration. What we are trying to predict is a binary value translated into a probability of how likely a customer is to churn *(is_churn == 1)*. Therefore we most likely will rely on the power of classification algorithms. We have also seen that in the training data, only around 6% of customers have churned, which shows that we have to be careful and potentially oversample to capture enough signal to train the algorithms. 


#### What Can Cause Churn?
The entire premise of this long exercise is to be able to predict churn; that is, to understand the variables influencing a customer to not make a subscription within a period of time. Therefore we are modeling the risk of a customer not "coming back". Below are a few thoughts on why people would leave:
- There is a better service from competitors;
- Customer does not find utility in the service and decide to not renew (in this case, the user_log evolution should provide an indication, either compared to oneself in the past [autoregressive] or to average users);
- Customers who forget to renew and do not have auto-renewal;


### When do Customers Churn?
Even though we have a clear definition for the time where churn is **registered**, the customer might have decided not to continue with KKBox services **before** the event is marked. In an ideal world, we would have, or know, on a timely basis the likelihood of a customer churning. In our case the only two indicators are the transaction and user logs.


#### Churn Defined:
For KKBox and this exercise, the definition of churn is if a member did not make a new service subscription **after 30 days of the expiration date** of current subscription. Therefore it becomes vital to understand the difference between *active* and *churn*. A customer might be inactive for 28 days, and then make a service subscription, without churning.

#### Subscriptions Canceled
Whenever subscriptinos are canceled, the dates in the dataset for the expiration dates are updated. This is tricky because if a member has a subscription that is supposed to last until 2017-02-28 and he actively cancels it on 2017-02-10 *(Transactions.is_cancel == True)*, then the new expiration date is 2017-02-11. 


## Data Manipulation
We have one dimension file (**members.csv**) and two fact files (**transactions.csv** and **user_logs.csv**).

- **Members**:
    -  *Gender*: Too many NaN values, more than half of the entire dataset;
    -  *BD*: Plenty of outliers (<= 0 and > 100); 
    -  *City*: Categorical number. It would be interesting to understand the relationship of this data to the actual listening behavior;
    -  *Registered Via*: 7 options, out of which 4 are predominant;
    -  *Tenure*: *Expiration Date* - *Registration Init time*. It can be potentially a great variable to impact churn; we might decide to use '2017-02-28' - *Registration Init time*, to fix a point in time when they will all be judged;


-  **Transactions**:
    -  *Payment Method Id*:
    -  *Payment Plan Days*:
    -  *Plan List Price*:
    -  *Actual Amount Paid*:
    -  *Is Auto Renew*:
    -  *Transaction Date*:
    -  *Membership Expire Date*:

- **User Logs**:
    -  *25%*:
    -  *50%*:
    -  *75%*:
    -  *98.5%*:
    -  *100%*:
    -  *Unique Songs*:
    -  *Total Seconds*: It might make sense to use log scale given how right-skewed the data is.
    -  *Date*:
    

## Modeling

For classification overall there are a few tecniques we can explore (see below). They show some pros and cons, and while most are easy to implement, it is good to know a thing or two about them.

- **Logistic Regression**

<center>
$\begin{equation*}\sigma(x) = \frac{1}{1+e^{ -t}}\end{equation*}$
where
$\begin{equation*}t = \beta_0+\beta_1x\end{equation*}$
</center>

- **Support Vector Machine**


- **Decision Trees**

    Using Entropy - $H(T)$
<center>$\begin{equation*} H(T) = \sum_{i=1}^j p_ilog_2p_i  \end{equation*}$</center>

- **Random Forest**

    It is an emsemble of decision trees. That is, by taking the average (mean of prediction, mode of classes) of multiple decision tress.


- **Neural Networks**


## Evaluation

The evaluation will be based on the Log Loss function (see below). It is therefore, important to understand how it works, and how it will measure the performance of the models applied.

<center>$logloss=−1N∑i=1N(yilog(pi)+(1−yi)log(1−pi))$</center>

In python we can use the package [sklearn.metrics.log_loss](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html) or simply calculate the log loss directly with vectorization.