Predicting Customer Churn Using Logistic Regression and Decision Trees

This was an assignment completed during my coursework at the University of Illinois. The churn modeling dataset was provided to us by the instructor which contains 10,000 observations of bank customer information. I demonstrate the use of logistic regression and decision trees as well as downsampling techniques to predict customer churn.

Click to see my code

Problem Definition and Goals

The purpose of this analysis is to utilize logistic regression and decision tree model to predict customer churn at a bank given their attributes. There are 10,000 observations of 11 variables in the raw data. RowNumber, CustomerId, and Surname variables were removed since they are not relevant to the analysis at hand. We are left with the following variables:

CreditScore
Geography
Gender
Age
Tenure
Balance
NumOfProducts
HasCrCard
IsActiveMember
EstimatedSalary
Exited (target)

Data Exploration and Preprocessing

First, the categorical variables that were associated with our target variable (Exited) will be identified using Mosiac Plots and Chi Squared tests. A Mosiac Plot in R is very useful to visualize the data from a contingency table or two-way frequency table.

Red block means observed cell frequency is lower than the expected cell frequency if the data were random
Blue block means observed cell frequency is higher than the expected cell frequency if the data were random
White block means there is not much difference between the observed and expected frequency if data were random Variables that were not related to our target variable were removed.

The results show that the only variable not related to Exited was HasCrCard because it's p-value was higher than our alpha = 0.05. We can also see that the mosiac plot for hcct(HasCrCard) is also completely white indicating that there is no relationship between Exited and HasCrCard. HasCrCard variable was then removed.

For comparing the numeric variables with Exited, boxplots and t-tests were used.

From the t-tests we found that CreditScore, Age, Balance, NumOfProducts, and ActiveMember are all related to Exited. Tenure and EstimatedSalary had higher p-values than our alpha 0.05, therefore, they are not related and we will remove them in the next step.

Data Analysis and Experimental Results

The data was first trained using logistic regression in the "glm" package in R. We split the data into 80% training and 20% test data. Here are the results of the inital model: The following is a cross table of the model's results. y-axis = predicted label, x-axis = actual label

	Exit	Stay
Exit	87	68
Stay	316	1529

total_error = 0.192
false positive rate = .44
false negative rate = .17

Then, we downsampled the data using the "dplyr" package "sample_n" function. I retrained the model with the new downsampled data to get the following results: y-axis = predicted label, x-axis = actual label

	Exit	Stay
Exit	270	469
Stay	133	1128

total error: .2955
false positives: 0.62
false negatives: 0.11

In this case, we want to reduce the amount of false negatives meaning that we incorrectly predict that the customer will stay with the bank since that will have the greater impact on the company. The second model is better for this because the false negative rate is lower. However, the total error of this model is greater than the previous.

Second, we will use a c5.0 decision tree model on the original (non-downsampled data) to predict Exited to see if it produces a better or worse result that the logistic regression model.

	0	1	row total
0	1533	64	1597
1	198	205	403
col total	1731	269	2000

total error: .131
false positives: 0.04
false negatives: 0.49

Then we ran the c5.0 decision tree model again on the downsampled data.

	0	1	row total
0	1275	322	1597
1	100	303	403
col total	1375	625	2000

total error: .214
false positives: 0.20
false negatives: 0.25

Conclusion

Based on the total error rate, the c5.0 decision tree model using the non-downsampled data produced the lowest total error rate and is therefore considered the best model. It is also important to consider the false negative rate. The false negative rate indicated customers whom we predicted to stay with the bank but actually ended up leaving. These customers are more crucial to a company than false positives because false positives do not have as big of an impact. The downsampled logisic regression model produced the lowest false negative rate but also had the highest overall total error. Considering these factors I would still consider the c5.0 non-downsampled model to be the best due to the total error of the regression model being almost double that of the c5.0.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Boxplots.png		Boxplots.png
Churn.Rmd		Churn.Rmd
Churn_Modelling.csv		Churn_Modelling.csv
README.md		README.md
mosaicplots.png		mosaicplots.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting Customer Churn Using Logistic Regression and Decision Trees

Problem Definition and Goals

Data Exploration and Preprocessing

Data Analysis and Experimental Results

Conclusion

About

Releases

Packages

carissa406/customer-churn

Folders and files

Latest commit

History

Repository files navigation

Predicting Customer Churn Using Logistic Regression and Decision Trees

Problem Definition and Goals

Data Exploration and Preprocessing

Data Analysis and Experimental Results

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages