# Comparing classifiers for the Churn problem.

The churn problem is of great interest to companies that rely on user subscriptions: will a customer renew their contract or not? We'll load a small public churn dataset using the magic of the <a href=http://pandas.pydata.org/>Pandas package</a>. Then we'll compare the performance of a variety of classifiers.

As you'll see, in certain domains like this one, it can be quite useful to model subtle interactions between variables. Personally, I find it somewhat remarkable that an ensemble of decision trees performs so much better than a linear model. It would be interesting to investigate further -- which interactions are so useful? And what do they reveal about churn? You might think about what analysis you would do to study these interactions.

In [None]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# Import a bunch of libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn import metrics

Setup data.

In [None]:
# Use pandas to read the csv.
churn_df = pd.read_csv('../Data/churn.csv')

# Isolate target data.
churn_result = churn_df['Churn?']
Y = np.where(churn_result == 'True.', 1, 0)

# Remove target and sparse features.
to_drop = ['State', 'Area Code', 'Phone', 'Churn?']
churn_feat_space = churn_df.drop(to_drop, axis = 1)

# 'yes'/'no' has to be converted to boolean values
# NumPy converts these from boolean to 1. and 0. later
yes_no_cols = ["Int'l Plan", "VMail Plan"]
churn_feat_space[yes_no_cols] = churn_feat_space[yes_no_cols] == 'yes'

# Get the feature names.
feature_names = np.array(churn_feat_space.columns)

# Thanks pandas! We'll use numpy from here on out.
X = churn_feat_space.as_matrix().astype(np.float)

# Shuffle data.
np.random.seed(0)
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, Y = X[shuffle], Y[shuffle]

test_data, test_labels = X[2833:], Y[2833:]
dev_data, dev_labels = X[2333:2833], Y[2333:2833]
train_data, train_labels = X[:2333], Y[:2333]

print "Majority class (no churn) for dev data:", sum(dev_labels == 0) / 500.
print "Majority class (no churn) for test data:", sum(test_labels == 0) / 500.

for i, name in enumerate(feature_names):
    print i, name

Scale the data to have 0 mean and 1 standard deviation. Note that the feature means and standard deviations are calculated on the training data (using fit_transform). Then we subtract these means and divide by these standard deviations to get the scaled dev and test data (using transform). Fitting the scaler on the evaluation data would be a subtle form of cheating!

In [None]:
scaler = StandardScaler()

# This overwrites the data, so be careful!
train_data = scaler.fit_transform(train_data)
dev_data = scaler.transform(dev_data)
test_data = scaler.transform(test_data)
print train_data.shape
print dev_data.shape
print test_data.shape

Try a bunch of different classifiers. Experiment with different sorts of non-linear models.

In [4]:
classifiers = [
    (GaussianNB(),             'Gaussian NB'),
    (LogisticRegression(),     'Logistic Regression'),
    (SVC(kernel='linear'),     'SVM, linear kernel'),
    (SVC(kernel='poly'),       'SVM, poly kernel'),
    (DecisionTreeClassifier(), 'Decision Tree'),
    (RandomForestClassifier(n_estimators=25), 'Random Forest'),
]

# So we always get the same result.
np.random.seed(0)

for clf, name in classifiers:
    clf.fit(train_data, train_labels)
    print '%40s accuracy: %.3f (dev) %.3f (test)' %(name, 
                                                    clf.score(dev_data, dev_labels), 
                                                    clf.score(test_data, test_labels))

    # Get all pairwise combinations of the features.
    polynomial_features = PolynomialFeatures(degree=2, include_bias=False)

    # Run the classifier using the polynomial feature space.
    pipeline = Pipeline([
        ('polynomial_features', polynomial_features),
        (name, clf)
    ])
    pipeline.fit(train_data, train_labels)
    print '%40s accuracy: %.3f (dev) %.3f (test)' %('(with degree 2) ', 
                                                    pipeline.score(dev_data, dev_labels),
                                                    pipeline.score(test_data, test_labels))
    print

                             Gaussian NB accuracy: 0.848 (dev) 0.844 (test)
                        (with degree 2)  accuracy: 0.796 (dev) 0.808 (test)

                     Logistic Regression accuracy: 0.864 (dev) 0.838 (test)
                        (with degree 2)  accuracy: 0.906 (dev) 0.890 (test)

                      SVM, linear kernel accuracy: 0.866 (dev) 0.834 (test)
                        (with degree 2)  accuracy: 0.916 (dev) 0.898 (test)

                        SVM, poly kernel accuracy: 0.916 (dev) 0.908 (test)
                        (with degree 2)  accuracy: 0.898 (dev) 0.884 (test)

                           Decision Tree accuracy: 0.910 (dev) 0.898 (test)
                        (with degree 2)  accuracy: 0.886 (dev) 0.900 (test)

                           Random Forest accuracy: 0.952 (dev) 0.942 (test)
                        (with degree 2)  accuracy: 0.948 (dev) 0.932 (test)

