# ***Customer Churn Prediction with H2O on R***

**Intuition:** We will use H20 to solve this binary classification problem of whether somebody would drop out as a customer of the bank or not.

**Dataset Source:** https://www.kaggle.com/shrutimechlearn/churn-modelling

### **Importing the libraries**

In [2]:
options(warn = -1)
library(caTools)
library(h2o)
library(caret)

### **Importing the dataset**

In [3]:
dataset = read.csv('../input/churn-modelling/Churn_Modelling.csv')
head(dataset, 5)

Unnamed: 0_level_0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
Unnamed: 0_level_1,<int>,<int>,<fct>,<int>,<fct>,<fct>,<int>,<int>,<dbl>,<int>,<int>,<int>,<dbl>,<int>
1,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
2,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
3,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
4,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
5,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


### **Dropping the columns that aren't required**

In [4]:
dataset = dataset[4:14]
head(dataset, 5)

Unnamed: 0_level_0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
Unnamed: 0_level_1,<int>,<fct>,<fct>,<int>,<int>,<dbl>,<int>,<int>,<int>,<dbl>,<int>
1,619,France,Female,42,2,0.0,1,1,1,101348.88,1
2,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
3,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
4,699,France,Female,39,1,0.0,2,0,0,93826.63,0
5,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


### **Encoding the Categorical Variables as Factors**

In [6]:
dataset$Geography = as.numeric(factor(dataset$Geography,
                                      levels = c('France', 'Spain',
                                                 'Germany'),
                                      labels = c(1, 2, 3)))
dataset$Gender = as.numeric(factor(dataset$Gender,
                                      levels = c('Female', 'Male'),
                                      labels = c(1, 2)))
head(dataset, 5)

Unnamed: 0_level_0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<int>,<int>,<dbl>,<int>,<int>,<int>,<dbl>,<int>
1,619,1,1,42,2,0.0,1,1,1,101348.88,1
2,608,2,1,41,1,83807.86,1,0,1,112542.58,0
3,502,1,1,42,8,159660.8,3,1,0,113931.57,1
4,699,1,1,39,1,0.0,2,0,0,93826.63,0
5,850,2,1,43,2,125510.82,1,1,1,79084.1,0


### **Splitting the dataset into Training Set & Test Set**

set.seed(420)
split = sample.split(dataset$Exited, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
cat("Number of rows in Training Set = ", nrow(training_set), "& Number of rows in Test Set =", nrow(test_set))

### **Feature Scaling**

In [8]:
training_set[-11] = scale(training_set[-11])
test_set[-11] = scale(test_set[-11])

### **Initializing H2O**

In [9]:
h2o.init(nthreads = -1, max_mem_size = "8g")


H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /tmp/RtmpiCzJoO/h2o_UnknownUser_started_from_r.out
    /tmp/RtmpiCzJoO/h2o_UnknownUser_started_from_r.err


Starting H2O JVM and connecting: .. Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         2 seconds 408 milliseconds 
    H2O cluster timezone:       Etc/UTC 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.26.0.10 
    H2O cluster version age:    1 month and 11 days  
    H2O cluster name:           H2O_started_from_R_root_oop571 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   7.11 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Amazon S3, XGBoos

### **Training the Classification Model**

In [10]:
classifier = h2o.deeplearning(y = 'Exited',
                              training_frame = as.h2o(training_set),
                              activation = 'Rectifier',
                              hidden = c(6, 6),
                              epochs = 500,
                              train_samples_per_iteration = -2)



### **Predicting the Test Set results**

In [11]:
prob_pred = h2o.predict(classifier, newdata = as.h2o(test_set[-11]))
y_pred = (prob_pred > 0.5)
y_pred = as.vector(y_pred)
reference = test_set[, 11]
y_pred_f = factor(y_pred)
reference_f = factor(reference)
pred_vs_actl = merge(as.data.frame(y_pred_f), as.data.frame(reference_f))
colnames(pred_vs_actl) <- c("Prediction", "Actual")



In [12]:
head(pred_vs_actl, 10)

Unnamed: 0_level_0,Prediction,Actual
Unnamed: 0_level_1,<fct>,<fct>
1,0,0
2,1,0
3,0,0
4,0,0
5,0,0
6,0,0
7,0,0
8,0,0
9,0,0
10,0,0


### **Performance of the Model**

In [13]:
cm <- confusionMatrix(y_pred_f, reference_f)
cm

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1520  216
         1   73  191
                                          
               Accuracy : 0.8555          
                 95% CI : (0.8393, 0.8706)
    No Information Rate : 0.7965          
    P-Value [Acc > NIR] : 5.509e-12       
                                          
                  Kappa : 0.4872          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9542          
            Specificity : 0.4693          
         Pos Pred Value : 0.8756          
         Neg Pred Value : 0.7235          
             Prevalence : 0.7965          
         Detection Rate : 0.7600          
   Detection Prevalence : 0.8680          
      Balanced Accuracy : 0.7117          
                                          
       'Positive' Class : 0               
                        

In [14]:
prec <- posPredValue(y_pred_f, reference_f)
rec <- recall(y_pred_f, reference_f)
f1_score <- (2 * prec * rec) / (prec + rec)
cat("Precision = ", prec, ", Recall =", rec, ", F1 Score =", f1_score)

Precision =  0.875576 , Recall = 0.9541745 , F1 Score = 0.9131871