# Building decision trees for classification

In this activity we will build a decision tree for a churn dataset. Let's first import the data.

## Dataset

In [None]:
##### added line to ensure plots are showing
%matplotlib inline
#####

import pandas as pd
import numpy as np

df = pd.read_csv('churn_ibm.csv')
df.head()

We see various variables, both demographic ones (Partner (Yes/No), tenure (length of contract), gender, etc.), as well as service-related variables (DeviceProtection, PaymentMethod, Contact, MonthlyCharges, etc.). All variables are relatively self-explanatory.

Let's split the data into independent and dependent parts (we don't use the customerID to predict and drop it):

In [None]:
y = df['Churn']
X = df.drop(['Churn','customerID'],axis=1)

Something else we should consider is that the implementation in Python can only deal with numeric variables. Let's check whether we need to convert variables:

In [None]:
X.dtypes

It appears there are a lot of categorical variables of type 'object'. Let's convert those:

In [None]:
for column in X.columns:
    if X[column].dtype == object:
        print('Converting ', column)
        X = pd.concat([X,pd.get_dummies(X[column], prefix=column, drop_first=True)],axis=1).drop([column],axis=1)

The dependent variable is also still categorical. This is fine, but can't be used to calcualte the AUC later. We convert it as well:

In [None]:
y = pd.get_dummies(y, prefix='churn', drop_first=True)
y.head()

## Building the tree

We can very easily calculate a decision tree without changing any parameters using a training and test set:

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train,y_train)
prediction = decision_tree.predict(X_test)

print('Accuracy:', accuracy_score(y_test,prediction))
print('AUC:',roc_auc_score(y_test,prediction))

Our tree has the following size:

In [None]:
print('Number of nodes:', decision_tree.tree_.node_count)

We can also plot the AUC curve (although for a decision tree we only have 1 threshold):

In [None]:
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt

fpr, tpr, thresholds = roc_curve(y_test, prediction)
plt.plot(fpr, tpr, lw=1, alpha=0.3,label='ROC')
plt.title('ROC curve')
plt.show()

We can also change the parameters, using a tree with a max depth of 3, and a minimum of 10 samples per leaf. This time, we also use entropy/information gain instead of Gini index:

In [None]:
decision_tree2 = DecisionTreeClassifier(max_depth=3,min_samples_leaf=10,criterion='entropy')
decision_tree2.fit(X_train,y_train)
prediction = decision_tree2.predict(X_test)

print('Accuracy:', accuracy_score(y_test,prediction))
print('AUC:',roc_auc_score(y_test,prediction))
print('Number of nodes:', decision_tree2.tree_.node_count)

Both our accuracy and AUC are up, possibly due to less overfitting with having a smaller tree with less leaves.

## Visualising the tree

We can also easily visualise our tree by using the following code:

In [None]:
!pip install pydotplus
from sklearn.tree import export_graphviz
from six import StringIO
from IPython.display import Image  
import pydotplus

# create a file to store the figure in
dot_data = StringIO()

export_graphviz(decision_tree2, out_file=dot_data,  
                filled=True, rounded=True,class_names=['churn_no','churn_yes'],
                special_characters=True,feature_names=X.columns)

# import and display the figure
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

The value list gives us the number of each class present in each of the leaves. We can see that most leaves predict the churn_yes=0.
Note that I used the second tree for this figure. The first one was very big (>1,000 number of nodes), which would be hard to display (and read) anyway.
Unfortunately, scikit-learn does not offer any pruning capabilities (yet).