# Telecoms Churn Data Analysis and Predictive Modelling


A telecommunication company requires a predictive model to choose which customers will leave their plan. The restults will inform the Marketing and Cusomter Retention teams about which customer are likely to leave their plan so that resources can be directed to these customers.

In [None]:
churn_raw <- read.csv("https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-Telco-Customer-Churn.csv", header = T)

str(churn_raw)

## Checking for missing values

In [None]:
sum(is.na(churn_raw))

## Where are the 11 missing values?

In [None]:
sapply(churn_raw, function(x) sum(is.na(x)))

All of the NAs are in the TotalCharges column, but I might be able to compute the total charges since we have data for monthly charges and tenure in months


In [None]:
churn_raw[is.na(churn_raw$TotalCharges),]

These customers all show tenure of zero months so they haven't made their first payment yet. Are there any other zero tenure customers in the data set?

In [None]:
library(tidyverse)

churn_raw %>%
    filter(tenure == 0) %>%
    summarize("Zero Tenure" = n())

These eleven are the only customers with zero tenure so they can safely be removed


In [None]:
churnnoNAs <- churn_raw[complete.cases(churn_raw),]
dim(churnnoNAs)

## Data Cleaning

Customer ID isn't useful to our analysis, neither is Total Charges since it is highly correlated with Monthly Charges

In [None]:
churn_neat <- churnnoNAs %>%
                select(-customerID, -TotalCharges) %>%
                rename(Gender = gender, Tenure = tenure)

table(churn_neat$SeniorCitizen)

In [None]:
churn_neat$SeniorCitizen <- as.factor(ifelse(churn_neat$SeniorCitizen == 1, "Yes", "No"))

table(churn_neat$SeniorCitizen)

In [None]:
str(churn_neat)

The variables OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV and StreamingMovies all require an internet connection and the variable MultipleLines needs a phone service so I will replace "No internet service" and "No phone service" with "No".

In [None]:
factorrenames <- names(churn_neat[9:14])
  
data <- churn_neat %>%
                        mutate_at(
                        .vars=factorrenames,
                        .funs=~recode_factor(., `No internet service`="No")) %>%
                                    mutate_at(
                                              .vars="MultipleLines",
                                              .funs=~recode_factor(., `No phone service`="No"))
  

str(data)


## Data Exploration

In [None]:
churnrate <- table(data$Churn) / nrow(data)

churnrate

Over the entire data set, 26.5% of customers churned. 

Creating a trainControl object so that all of the models use 10-fold cross validation on the training data. I will then use the remaining 30% of the data to test the model accuracy.

I will be using the both area under the ROC curve (AUC) and Accuracy percentage as metrics for assessing model accuracy

In [None]:
set.seed(1)

#Shuffling data
rowindices <- sample(nrow(data))

data_shuffled <- data[rowindices,]

#Identifying row to split on for 70/30 training/test split
split <- round(nrow(data_shuffled) * 0.7)
split

In [None]:
train <- data_shuffled[1:split,]
test <- data_shuffled[(split+1):nrow(data_shuffled),]

dim(train)
dim(test)
library(caret)

#Using 3 repeats of 10-fold cross validation to fit each model to the training data
control <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 3,
  summaryFunction = twoClassSummary,
  classProbs = TRUE,
  verboseIter = FALSE
)

## Logistic Regression Model

In [None]:
glm_model <- train(Churn ~ ., data = train,
                method="glm", 
                trControl = control,
                preProcess = c("center","scale")
                  )

glm_model

### Predictive Capability

In [None]:
glm_pred <- predict(glm_model, newdata = test)


glmcm <- confusionMatrix(glm_pred, test[["Churn"]])
glmaccuracy <- glmcm$overall[c(1,3,4)]
glmcm

## Generalised Linear Model - Ridge and Lasso Regression

In [None]:
glmnet_model <- train(Churn ~ ., data = train,
  metric = "ROC",
  method = "glmnet",
  trControl = control,
  preProcess = c("center","scale")
)

plot(glmnet_model)

In [None]:
glmnet_model$bestTune$alpha

Alpha = 0.55 maximises AUC

### Predictive Capability

In [None]:
glmnet_pred <- predict(glmnet_model, newdata = test)

glmnetcm <- confusionMatrix(glmnet_pred, test[["Churn"]])
glmnetaccuracy <- glmnetcm$overall[c(1,3,4)]
glmnetcm

## Random Forest

In [None]:
rf_model <- train(Churn ~ ., data=train,
  metric = "ROC",
  method = "ranger",
  trControl = control
)

plot(rf_model)

mtry = 2 yields the largest AUC value

### Predictive Capability

In [None]:
rf_pred <- predict(rf_model, newdata = test)
rfcm <- confusionMatrix(rf_pred, test[["Churn"]])
rfaccuracy <- rfcm$overall[c(1,3,4)]

## K-Nearest Neighbours

In [None]:
knn_model <- train(Churn ~ ., data = train, 
                   method = "knn", trControl = control,
                   preProcess = c("center","scale"), tuneLength = 50)

In [None]:
knn_model
plot(knn_model, type="s", main = "AUC for KNN", xlab = "k")

K = 75 maximised the AUC

### Predictive Capability

In [None]:
knn_pred <- predict(knn_model, newdata = test)
knncm <- confusionMatrix(knn_pred, test[["Churn"]])
knnaccuracy <- knncm$overall[c(1,3,4)]
knncm

## Support Vector Classifier

In [None]:
#Trying a range of values for the Cost parameter
grid <- expand.grid(C = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1, 1.25))

svm_linear_model <- train(Churn ~., data = train, method = "svmLinear",
                 trControl= control,
                 preProcess = c("center", "scale"),
                 tuneLength = 10,
                 tuneGrid = grid)

svm_linear_model

In [None]:
plot(svm_linear_model)

Cost of 0.01 produced the model with the greatest AUC

### Predictive Capability

In [None]:
svm_linear_pred <- predict(svm_linear_model, newdata = test)
svmcm <- confusionMatrix(svm_linear_pred, test[["Churn"]])
svmcm

svm_linearaccuracy <- svmcm$overall[c(1,3,4)]

# Model Comparison

In [None]:
svm_linear_pred <- predict(svm_linear_model, newdata = test)
svmcm <- confusionMatrix(svm_linear_pred, test[["Churn"]])
svmcm

svm_linearaccuracy <- svmcm$overall[c(1,3,4)]

In [None]:
dotplot(resamples, metric="ROC", main = "Area Under Curve with 95% CI")

GLMnet and Logistic regression show the greatest predictive accuracy 

In [None]:
models <- c("Logistic", "GLMnet", "Random Forest", "SVM", "KNN")

accuracysummary <- bind_rows(Logistic = glmaccuracy, GLMnet = glmnetaccuracy, RandomForest = rfaccuracy, kNN = knnaccuracy, SVM = svm_linearaccuracy)

library(tibble)

accuracysummary2 <- add_column(accuracysummary, "Model" = models, .before = "Accuracy")

accuracysummary2

In [None]:
library(ggthemes)

ggplot(accuracysummary2, aes(x = Model, y = Accuracy)) + geom_bar(stat = "identity") + 
        geom_errorbar(width = 0.2, aes(ymin = AccuracyLower, ymax = AccuracyUpper), color = "black") +
        coord_cartesian(ylim = c(0.7, 0.85)) +
        labs(y = "Accuracy %", x = "Model", title = "Model Prediction Accuracy with 95% CI") +
        theme_minimal()

In this case all models show significantly greater predictive accuracy than the null model that predicts 'No' for every customer and has accuracy of 72.6%. The Logistic and GLMnet methods are almost identical in terms of results. 

Consequenly I would reccommend that the logistic model be used since it is much more interpretable than the GLMnet model that blends lasso and ridge regression since it is more interpretable.

## What does the profile customer that is likely to churn look like?

In [None]:
summary(glm_model)

Tenure, ContratOneYear, ContractTwoYear and PaymentMethod are the most significant variables. This suggests that customers that have been with the company for a long time and those with contracts are the most likely to stay. 

In [None]:
levels(data$Contract)

Customers on month-to-month contracts are the most likely to churn, so incentivising these customers to take longer term contracts seems likely to reduce the churn rate

In [None]:
levels(data$PaymentMethod)

Payment by Electronic Check tend to churn much less, so incentivising this payment method is likely to reduce the churn rate