# Predicting Customer Churn

**Credit:** http://blog.yhathq.com/posts/predicting-customer-churn-with-sklearn.html

We first import the required packages: 

1. Numpy for scientific computing;
2. Pandas for data processing.

In [6]:
import numpy as np
import pandas as pd

We then import the Scikit-Learn package. In this demo, we will try 3 machine learning algorithms that are

1. *k*-Nearest Neigbors;
2. Naive Bayes;
3. Support Vector Machine.

In [7]:
from sklearn import neighbors, naive_bayes, svm

Moreover, we import the modules for evaluating the algorithms and proprocessing data.

In [8]:
from sklearn.metrics import accuracy_score, classification_report

## Preprocessing Data

We use Pandas to load the data from a CSV file.

In [12]:
churn_df = pd.read_csv('data/churn.csv')
col_names = churn_df.columns.tolist()

print("Column names:")
print(col_names)

Column names:
['State', 'Account Length', 'Area Code', 'Phone', "Int'l Plan", 'VMail Plan', 'VMail Message', 'Day Mins', 'Day Calls', 'Day Charge', 'Eve Mins', 'Eve Calls', 'Eve Charge', 'Night Mins', 'Night Calls', 'Night Charge', 'Intl Mins', 'Intl Calls', 'Intl Charge', 'CustServ Calls', 'Churn?']


In [13]:
print("Sample data:")
churn_df.head(6)

Sample data:


Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.
5,AL,118,510,391-8027,yes,no,0,223.4,98,37.98,...,101,18.75,203.9,118,9.18,6.3,6,1.7,0,False.


Separate the classification target from the data.

In [14]:
churn_result = churn_df['Churn?']
y = np.where(churn_result == 'True.', 1, 0)

Remove some unnecessary data.

In [15]:
to_drop = ['State', 'Area Code', 'Phone','Churn?']
churn_feat_space = churn_df.drop(to_drop, axis=1)

Since some columns contain 'yes' or 'no' data, we should convert them into Boolean. This will make it easier for processing later.

In [16]:
yes_no_cols = ["Int'l Plan", "VMail Plan"]
churn_feat_space[yes_no_cols] = churn_feat_space[yes_no_cols] == 'yes'

Convert the data frame to Numpy because Scikit-Learn uses a Numpy array to process.

In [20]:
print(type(churn_feat_space))
X = churn_feat_space.as_matrix().astype(np.float)
print(type(X))

<class 'pandas.core.frame.DataFrame'>
<class 'numpy.ndarray'>


It is important to normalize the data in building a machine learning model. The data will be in the same range and it can improve the results.

In [21]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X = scaler.fit_transform(X)

print("Feature space holds %d observations and %d features" % X.shape)
print("Unique target labels:", np.unique(y))

Feature space holds 3333 observations and 17 features
Unique target labels: [0 1]


## Preparing Training and Testing Data

Prepare the training and testing data.

In [22]:
np.random.seed(0)

indices = np.random.permutation(len(X))

split_index = -(len(X) - int(len(X) * 0.8))

X_train = X[indices[:split_index]]
y_train = y[indices[:split_index]]

X_test = X[indices[split_index:]]
y_test = y[indices[split_index:]]

## Building Predictive Model

Define the target for data visualization later.

In [23]:
target_names = ['yes', 'no']

### k-Nearest Neighbors

In [25]:
n_neighbors = 15

knn = neighbors.KNeighborsClassifier(n_neighbors=n_neighbors)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, target_names=target_names))

0.866566716642
             precision    recall  f1-score   support

        yes       0.87      0.99      0.93       562
         no       0.79      0.21      0.33       105

avg / total       0.86      0.87      0.83       667



### Naive Bayes

In [26]:
gnb = naive_bayes.GaussianNB()
gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)

print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, target_names=target_names))

0.845577211394
             precision    recall  f1-score   support

        yes       0.90      0.92      0.91       562
         no       0.51      0.46      0.48       105

avg / total       0.84      0.85      0.84       667



### Support Vector Machines (SVMs)

In [27]:
clf = svm.SVC()
clf.fit(X_train, y_train)

y_pred = gnb.predict(X_test)

print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, target_names=target_names))

0.845577211394
             precision    recall  f1-score   support

        yes       0.90      0.92      0.91       562
         no       0.51      0.46      0.48       105

avg / total       0.84      0.85      0.84       667

