### LSE Data Analytics Online Career Accelerator 

# DA301:  Advanced Analytics for Organisational Impact

## Demostration video: Create a decision tree

#### *This is a possible solution to the demostration video.* 

In this tutorial I’ll demonstrate step by step how to fit a classification decision tree model on the data set from the Telcom National study.

A decision tree is a type of ML algorithm. It is a tree-like model of questions and decisions with their possible consequences, outcomes, resources costs, and utility. It is a graphic representation of various alternative solutions available at a certain point in time. Simply stated, the decision tree, which is based on how humans reason, is created by answering several questions that are continued after each affirmative or negative answer until a final choice can be made.

For example, the decision-making process for deciding whether to stay living at your current location or move to another country may lead you to consider what to do with your possessions, and then, for example, to a range of more or less affordable (and reliable) options.

Consider the following scenario:

Telcom National (TN) wants to determine if a customer is likely to churn; in other words, they want to be able to predict the likelihood of a new customer 'churning' or leaving a specific service provider. As a data analyst, we will explore the decision tree model and how we can utilise it to get a satisfactory answer.

# 

# 1. Prepare your workstation

In [None]:
# Import all necessary libraries.
import pandas as pd 
import numpy as np 
import scipy as scp
import sklearn
from sklearn import metrics

# Provides classes and functions to estimate many different statistical methods.
import statsmodels.api as sm  

# Note: Helps split data into sets to create BLR.
from imblearn.over_sampling import SMOTE  
from sklearn.model_selection import train_test_split

# Note: Indicates situations that aren’t necessarily exceptions.
import warnings  
warnings.filterwarnings('ignore')  

# Read the provided CSV file/data set.
df = pd.read_csv('customer_data.csv')  

# Print a summary of the DataFrame to sense-check it.
df.info()

# 

# 2. Update variables

In [None]:
# Specify the DataFrame column and add/determine the values.
df['Edu'].value_counts() 

In [None]:
# Update all the details of the education column.
df.loc[df['Edu'].str.contains('basic'),'Edu' ] = 'pre-school'
df.loc[df['Edu'].str.contains('university'),'Edu' ] = 'uni'
df.loc[df['Edu'].str.contains('high'),'Edu' ] = 'high-school'
df.loc[df['Edu'].str.contains('professional') ,'Edu'] = 'masters'
df.loc[df['Edu'].str.contains('illiterate'),'Edu' ] = 'other'
df.loc[df['Edu'].str.contains('unknown'),'Edu' ] = 'other'

# Display all the unique values/check changes.
df['Edu'].unique() 

# 

# 3. Create dummy variables

In [None]:
# Name new DataFrame and convert categorical variables to dummy variables.
cat_vars=['Occupation','Status','Edu','House','Loan',
          'Comm','Month','DOW','Last_out']

# Use the for loop keyword to specify what actions to
# apply to all the 'var' items.
# Specify what needs to apply to all the variables.
for var in cat_vars:  
    # Specify details of the categorical list.
    cat_list = pd.get_dummies(df[var], prefix=var)  
    # Indicate the joining of the DataFrames.
    df = df.join(cat_list) 

df_fin = df.drop(cat_vars, axis=1) 

# We have already specified the column names. 
# This code snippet is only used in the video for explanation purposes.
# Specify the column names:
# cat_vars=['Occupation','Status','Edu','House','Loan',
#           'Comm','Month','DOW','Last_out']

# Set a temporary DataFrame and add values.
df_vars = df.columns.values.tolist()  

# Indicate what columns are kept.
to_keep = [i for i in df_vars if i not in cat_vars] 

# Define new DataFrame.
df_fin = df[to_keep]  

# Print the column.
df_fin.columns.values 

# 

# 4. Balance the data

In [None]:
# Create a DataFrame to use as df_fin and replace missing values with zero.
df_fin = df_fin.fillna(0)  

# Select necessary columns. 
nec_cols = [ 'Status_divorced', 'Status_married',
            'Status_single', 'Status_unknown', 
            'Edu_high-school', 'Edu_masters', 
            'Edu_other', 'Edu_pre-school', 
            'Edu_uni', 'House_no', 'House_unknown',
            'House_yes', 'Loan_no', 'Loan_unknown', 
            'Loan_yes', 'DOW_fri', 'DOW_mon']

X = df_fin[nec_cols]
y = df_fin['Target']

# Create a new DataFrame and apply SMOTE as the target variable is not balanced.
os = SMOTE(random_state=0)  

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=0)

# Specify column values.
columns = X_train.columns  
# Specify the new data sets.
os_data_X,os_data_y=os.fit_resample(X_train, y_train)  

# Create two DataFrames for X and one for y.
os_data_X = pd.DataFrame(data=os_data_X,columns=columns )
os_data_y= pd.DataFrame(data=os_data_y,columns=['Target'])

# Print/check the DataFrame.
print("Length of oversampled data is ",len(os_data_X))
os_data_y

In [None]:
# Determine if values in a column are balanced.
os_data_y['Target'].value_counts()  

# 

# 5. Build and fit the decision tree model

In [None]:
# Import the DecisionTreeClassifier class from sklearn. 
from sklearn.tree import DecisionTreeClassifier  

# Create a classification decision tree classifier object as dtc. 
dtc = DecisionTreeClassifier(criterion='gini',
                             max_depth=4,
                             random_state=1)

# Train the decision tree classifier.
dtc = dtc.fit(os_data_X, os_data_y) 

# Predict the response for the test data set.
y_pred = dtc.predict(X_test)  

# 

# 6. Determine the accuracy of the model

In [None]:
# Import scikit-learn metrics module for accuracy calculation.
from sklearn.metrics import confusion_matrix

# Use the print() function to display the confusion matrix results.
print(confusion_matrix(y_test, y_pred))

# Metrics for accuracy: (TP + TN)/(TP + FP + TN + FN).
print("Accuracy:",metrics.accuracy_score(y_test, y_pred)) 

# Metrics for precision: TP/(TP + FP).
print("Precision:",metrics.precision_score(y_test, y_pred)) 

# Metrics for recall: TP/(FN + TP).
print("Recall:",metrics.recall_score(y_test, y_pred)) 

> Extra code snippets for clarity

In [None]:
# Import Seaborn for visualisation.
import seaborn as sns

# Plot the confusion_matrix.
sns.heatmap(confusion_matrix, annot=True, fmt='g')

In [None]:
# Create a DataFrame to display the confusion matrix. 
pd.DataFrame(confusion_matrix, index=['observed_notchurn','observed_churn'],
columns=['predicted_notchurn', 'predicted_churn'])

In [None]:
# Import the necessary package.
from sklearn.metrics import classification_report  

# Print a report on the model's accuracy.
print(classification_report(y_test, y_pred))  

# 

# 7. Visualise the model

In [None]:
# Import matplotlib to create a visualisation and the tree package from sklearn.
import matplotlib.pyplot as plt 
from sklearn import tree

# Plot the decision tree to create the visualisation.
fig, ax = plt.subplots(figsize=(20, 20))
tree.plot_tree(dtc, fontsize=12)

# Print the plot with plt.show().
plt.show()  

# 

# 8. Conclusion

What does the decision tree tell us? What can we report back to Telcom National?
> Accuracy and recall indicate a good model fit. Although an accuracy of 73% is not as good as 82% (BLR model), we still get it more right than wrong. 

# 