# XGBoost

### Data preprocessing

In [1]:
# Importing the dataset
dataset = read.csv('Churn_Modelling.csv')

In [2]:
head(dataset, 5)

RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [3]:
dataset = dataset[4:14]
dataset = dataset[-c(3)]

In [4]:
head(dataset, 5)

CreditScore,Geography,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
619,France,42,2,0.0,1,1,1,101348.88,1
608,Spain,41,1,83807.86,1,0,1,112542.58,0
502,France,42,8,159660.8,3,1,0,113931.57,1
699,France,39,1,0.0,2,0,0,93826.63,0
850,Spain,43,2,125510.82,1,1,1,79084.1,0


In [5]:
# Encoding the categorical variables as factors
dataset$Geography = as.numeric(factor(dataset$Geography,
                                      levels = c('France', 'Spain', 'Germany'),
                                      labels = c(1, 2, 3)))

In [6]:
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(1234)
split = sample.split(dataset$Exited, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

### Fitting XGBoost to the Training set

In [7]:
# install.packages('xgboost')
library(xgboost)

In [18]:
classifier = xgboost(data = as.matrix(training_set[-10]), 
                     label = training_set$Exited,
                     nrounds = 10)

[1]	train-rmse:0.417976 
[2]	train-rmse:0.369643 
[3]	train-rmse:0.341009 
[4]	train-rmse:0.325538 
[5]	train-rmse:0.316370 
[6]	train-rmse:0.309533 
[7]	train-rmse:0.306012 
[8]	train-rmse:0.302703 
[9]	train-rmse:0.300868 
[10]	train-rmse:0.298456 


***

Its better to think that the converging point is 0.30. But it is worth noting that we can decrease the train-rmse(which stands for root mean square deviation) if we keep on inscreasing nrounds. I tried it myself but the result will be too long to display here. Better try it by running this (below code is just for cross-validation to assess model performance):

```R
set.seed(1234)
cv = xgb.cv(data = as.matrix(training_set[-10]), 
             label = training_set$Exited, 
             nfold = 5,
             nrounds = 500)
```

Also if you don't want to display **train-rmse** then pass **verbose = 0** as a parameter.
***

### Applying k-Fold Cross Validation

In [9]:
library(caret)

Loading required package: lattice
Loading required package: ggplot2


In [10]:
folds = createFolds(training_set$Exited, k = 10)
cv = lapply(folds, function(x){
    training_fold = training_set[-x, ]
    test_fold = training_set[x, ]
    classifier = xgboost(data = as.matrix(training_set[-10]), 
                         label = training_set$Exited,
                         nrounds = 10)
    y_fold_pred = predict(classifier, newdata = as.matrix(test_fold[-10]))
    y_fold_pred = (y_fold_pred > 0.5)
    cm = table(test_fold[, 10], y_fold_pred)
    accuracy = (cm[1, 1] + cm[2, 2])/(sum(cm))
    return(accuracy)
})

[1]	train-rmse:0.417976 
[2]	train-rmse:0.369643 
[3]	train-rmse:0.341009 
[4]	train-rmse:0.325538 
[5]	train-rmse:0.316370 
[6]	train-rmse:0.309533 
[7]	train-rmse:0.306012 
[8]	train-rmse:0.302703 
[9]	train-rmse:0.300868 
[10]	train-rmse:0.298456 
[1]	train-rmse:0.417976 
[2]	train-rmse:0.369643 
[3]	train-rmse:0.341009 
[4]	train-rmse:0.325538 
[5]	train-rmse:0.316370 
[6]	train-rmse:0.309533 
[7]	train-rmse:0.306012 
[8]	train-rmse:0.302703 
[9]	train-rmse:0.300868 
[10]	train-rmse:0.298456 
[1]	train-rmse:0.417976 
[2]	train-rmse:0.369643 
[3]	train-rmse:0.341009 
[4]	train-rmse:0.325538 
[5]	train-rmse:0.316370 
[6]	train-rmse:0.309533 
[7]	train-rmse:0.306012 
[8]	train-rmse:0.302703 
[9]	train-rmse:0.300868 
[10]	train-rmse:0.298456 
[1]	train-rmse:0.417976 
[2]	train-rmse:0.369643 
[3]	train-rmse:0.341009 
[4]	train-rmse:0.325538 
[5]	train-rmse:0.316370 
[6]	train-rmse:0.309533 
[7]	train-rmse:0.306012 
[8]	train-rmse:0.302703 
[9]	train-rmse:0.300868 
[10]	train-rmse:0.2984

In [11]:
mean(as.numeric(cv)) # accuracy

In [12]:
sd(as.numeric(cv)) # Standard deviation