# Term Deposit Prediction with Naive Baeys, Decision Trees and Random Forest

#### About dataset:

The dataset is related with direct marketing campaigns (phone calls) of a Portuguese banking institution.The classification goal of this dataset is to predict if the client or the customer will subscribe a term deposit product of the bank or not. Now the question comes:

#### What is a term deposit?

A Term deposit is a deposit that a bank or a financial institurion offers with a fixed rate (often better than just opening deposit account) in which your money will be returned back at a specific maturity time. For more information with regards to Term Deposits please click on this link from Investopedia: https://www.investopedia.com/terms/t/termdeposit.asp

<img src='https://i.imgur.com/qCaNTiu.png' style='width:1000px;height:500px'/>

#### Approach:

#### In order to optimize marketing campaigns with the help of the dataset, I will take the following steps:

- Import librarys and data from dataset: perform initial analysis taking a first look in the data and
    analysing the importance of "duration" feature
    
- Exploratory Analysis: Look at rows, columns, structure and source for missing values, univariate analysis 
    and look and our class proportion to identify unbalance class problem
    
- Deal with the unbalance class problem: Using oversampling technic in order to solve unbalance class problem
    
- Feature engenering: Removing outlayers, converting and creating new features based 
    of insights from Exploratory analysis
    
- Modelling: Fit naive baeys, decision trees and random forest models, analyse results 
    with confusion matrixs and ROC curve
    
- Conclusion

### Importing Required librarys and datasets

In [None]:
library(tidyverse) # Manipulation
library(gridExtra) # Visualization
library(GGally) # Visualization
library(caret) # Modeling
library(naivebayes) # Modeling
library(rpart) # Modeling
library(rpart.plot) # Visualization
library(randomForest)
library(ROCR) # Visualization
library(e1071) # Modeling
library(repr) # Visualization
library(devtools) # Loading external functions from github

In [None]:
# Loading the database
bank <- read.csv("../input/bank-full.csv",header=T,sep=';')

Let's take a first look into our dataset seeing the first 5 rows

In [None]:
head(bank) # taking a first look

**The feature duration have a Important note:** this attribute highly affects the output 
target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed.

Let's see the impact of this feature fitting a quick RF

In [None]:
# Fitting a random forest to analyse duration importance
rf <- randomForest(y ~.,bank,ntree=100,importance=T)

In [None]:
# Catting var importance from model
importance <- as.data.frame(rf$importance)
importance$features <- row.names(importance)
importance$MeanDecreaseGini <- round(importance$MeanDecreaseGini,2)

# Ploting a bar graphic with features importance proportion
options(repr.plot.width = 24, repr.plot.height = 10)
ggplot(importance, aes(x = features, y = MeanDecreaseGini)) +
  geom_bar(fill = "#009999", stat = "identity") +
  geom_text(aes(label=MeanDecreaseGini),size=6, vjust = -0.3) +
  ggtitle('Features Importance')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1)
        ,axis.title = element_text(size = rel(2.2), angle = 1)
       ,plot.title = element_text(size = rel(2.5))
        ,axis.text =  element_text(size = rel(2))
        ,axis.ticks = element_line(size = 1))

As we can see the feature 'durations' have GREAT importance to predict our target bigger than double of all another features Let's remove this feature to do a realistic model

In [None]:
# Dropping duration feature
bank$durations <- NULL

## Exploratory Analysis

First of all let's take a quick look on our dataset before any plots

In [None]:
head(bank) # seeing the first 5 rows

Durations was removed and everything seems to be okay let's verify dimentions, structure and look for missing values

In [None]:
paste('Rows: ',dim(bank)[1])
paste('Columns: ',dim(bank)[2])

In [None]:
str(bank) # verifying data structure

In [None]:
sum(is.na(bank)) # verifying missing values

Our class(y) represent if the client submit or not the term deposit,
We don't have missing values, let's first take a look into our numerical 
features vs our class with boxplots

In [None]:
# Creating a boxplot for each numerical variables vs target
bp_age <- ggplot(bank, aes(x = y, y = age)) +
  geom_boxplot(fill = "#228B22", colour = "#1F3552", alpha = 0.6) +
  scale_y_continuous(name = "Age") +
  scale_x_discrete(name = "Target") +
  ggtitle("Age") +
  theme_gray() + 
  theme(axis.text.x = element_text(angle = 1, hjust = 1)
        ,axis.title = element_text(size = rel(2), angle = 1)
       ,plot.title = element_text(size = rel(2.2))
        ,axis.text =  element_text(size = rel(1.8))
        ,axis.ticks = element_line(size = 1))

bp_balance <- ggplot(bank, aes(x = y, y = balance )) +
  geom_boxplot(fill = "#228B22", colour = "#1F3552", alpha = 0.6) +
  scale_y_continuous(name = "Balance") +
  scale_x_discrete(name = "Target") +
  ggtitle("Balance") +
  theme_gray() + 
  theme(axis.text.x = element_text(angle = 1, hjust = 1)
        ,axis.title = element_text(size = rel(2), angle = 1)
       ,plot.title = element_text(size = rel(2.2))
        ,axis.text =  element_text(size = rel(1.8))
        ,axis.ticks = element_line(size = 1))

bp_day <- ggplot(bank, aes(x = y, y = day )) +
  geom_boxplot(fill = "#228B22", colour = "#1F3552", alpha = 0.6) +
  scale_y_continuous(name = "Day") +
  scale_x_discrete(name = "Target") +
  ggtitle("Day") +
  theme_gray() + 
  theme(axis.text.x = element_text(angle = 1, hjust = 1)
        ,axis.title = element_text(size = rel(2), angle = 1)
       ,plot.title = element_text(size = rel(2.2))
        ,axis.text =  element_text(size = rel(1.8))
        ,axis.ticks = element_line(size = 1))

bp_campaign <- ggplot(bank, aes(x = y, y = campaign )) +
  geom_boxplot(fill = "#228B22", colour = "#1F3552", alpha = 0.6) +
  scale_y_continuous(name = "Campaign") +
  scale_x_discrete(name = "Target") +
  ggtitle("Campaign") +
  theme_gray() + 
  theme(axis.text.x = element_text(angle = 1, hjust = 1)
        ,axis.title = element_text(size = rel(2), angle = 1)
       ,plot.title = element_text(size = rel(2.2))
        ,axis.text =  element_text(size = rel(1.8))
        ,axis.ticks = element_line(size = 1))

bp_pdays <- ggplot(bank, aes(x = y, y = pdays )) +
  geom_boxplot(fill = "#228B22", colour = "#1F3552", alpha = 0.6) +
  scale_y_continuous(name = "pdays") +
  scale_x_discrete(name = "Target") +
  ggtitle("Pdays") +
  theme_gray() + 
  theme(axis.text.x = element_text(angle = 1, hjust = 1)
        ,axis.title = element_text(size = rel(2), angle = 1)
       ,plot.title = element_text(size = rel(2.2))
        ,axis.text =  element_text(size = rel(1.8))
        ,axis.ticks = element_line(size = 1))

bp_previous <- ggplot(bank, aes(x = y, y = previous )) +
  geom_boxplot(fill = "#228B22", colour = "#1F3552", alpha = 0.6) +
  scale_y_continuous(name = "Previous") +
  scale_x_discrete(name = "Target") +
  ggtitle("Previous") +
  theme_gray() + 
  theme(axis.text.x = element_text(angle = 1, hjust = 1)
        ,axis.title = element_text(size = rel(2), angle = 1)
       ,plot.title = element_text(size = rel(2.2))
        ,axis.text =  element_text(size = rel(1.8))
        ,axis.ticks = element_line(size = 1))

#### Boxplot of each numerical features vs our class

With this boxplots we can see that our numerical features have outlayers let's deal with then latter, in feature engenering

In [None]:
# Plotting all boxplots into a unique graph
options(repr.plot.width = 24, repr.plot.height = 12)
grid.arrange(bp_age, bp_balance, bp_day
             ,bp_campaign, bp_pdays, bp_previous
             , nrow = 2, ncol = 4)

In [None]:
# Creating variables with freequency of yes and no for each category of each discrete variables
job <- bank %>% count(job, y)

marital <- bank %>% count(marital, y)

education <- bank %>% count(education, y)

default <- bank %>% count(default, y)

housing <- bank %>% count(housing, y)

loan <- bank %>% count(loan, y)

contact <- bank %>% count(contact, y)

month <- bank %>% count(month, y)

poutcome <- bank %>% count(poutcome, y)

In [None]:
# Creating and storaging in variables a graphic with the freaquencys above
bp_job <- ggplot(job, aes(x = job, y = n )) +
  geom_bar(aes(fill = job), stat = "identity", color = "white") +
  scale_y_discrete(name = "Frequencia") +
  scale_x_discrete(name = "Job") +
  ggtitle("Job") +
  facet_wrap(~y) + 
  theme_gray() + 
  theme(axis.text.x = element_text(angle = 1, hjust = 1)
        ,axis.title = element_text(size = rel(2), angle = 1)
       ,plot.title = element_text(size = rel(2.2))
        ,axis.text =  element_text(size = rel(1.8))
        ,axis.ticks = element_line(size = 1)
        ,legend.text = element_text(size = 16, colour = "black")
       ,legend.title = element_text(size = 16,face = "bold"))

bp_marital <- ggplot(marital, aes(x = marital, y = n )) +
  geom_bar(aes(fill = marital), stat = "identity", color = "white") +
  scale_y_discrete(name = "Frequencia") +
  scale_x_discrete(name = "Marital") +
  ggtitle("Marital") +
  facet_wrap(~y) + 
  theme_gray() +
  theme(axis.text.x = element_text(angle = 1, hjust = 1)
        ,axis.title = element_text(size = rel(2), angle = 1)
       ,plot.title = element_text(size = rel(2.2))
        ,axis.text =  element_text(size = rel(1.8))
        ,axis.ticks = element_line(size = 1)
        ,legend.text = element_text(size = 16, colour = "black")
       ,legend.title = element_text(size = 16,face = "bold"))

bp_education <- ggplot(education, aes(x = education, y = n )) +
  geom_bar(aes(fill = education), stat = "identity", color = "white") +
  scale_y_discrete(name = "Frequencia") +
  scale_x_discrete(name = "education") +
  ggtitle("Education") +
  facet_wrap(~y) + 
  theme_gray() + 
  theme(axis.text.x = element_text(angle = 1, hjust = 1)
        ,axis.title = element_text(size = rel(2), angle = 1)
       ,plot.title = element_text(size = rel(2.2))
        ,axis.text =  element_text(size = rel(1.8))
        ,axis.ticks = element_line(size = 1)
        ,legend.text = element_text(size = 16, colour = "black")
       ,legend.title = element_text(size = 16,face = "bold"))

bp_default <- ggplot(default, aes(x = default, y = n )) +
  geom_bar(aes(fill = default), stat = "identity", color = "white") +
  scale_y_discrete(name = "Frequencia") +
  scale_x_discrete(name = "default") +
  ggtitle("Default") +
  facet_wrap(~y) + 
  theme_gray() + 
  theme(axis.text.x = element_text(angle = 1, hjust = 1)
        ,axis.title = element_text(size = rel(2), angle = 1)
       ,plot.title = element_text(size = rel(2.2))
        ,axis.text =  element_text(size = rel(1.8))
        ,axis.ticks = element_line(size = 1)
        ,legend.text = element_text(size = 16, colour = "black")
       ,legend.title = element_text(size = 16,face = "bold"))

bp_housing <- ggplot(housing, aes(x = housing, y = n )) +
  geom_bar(aes(fill = housing), stat = "identity", color = "white") +
  scale_y_discrete(name = "Frequencia") +
  scale_x_discrete(name = "housing") +
  ggtitle("Housing") +
  facet_wrap(~y) + 
  theme_gray() + 
  theme(axis.text.x = element_text(angle = 1, hjust = 1)
        ,axis.title = element_text(size = rel(2), angle = 1)
       ,plot.title = element_text(size = rel(2.2))
        ,axis.text =  element_text(size = rel(1.8))
        ,axis.ticks = element_line(size = 1)
        ,legend.text = element_text(size = 16, colour = "black")
       ,legend.title = element_text(size = 16,face = "bold"))

bp_loan <- ggplot(loan, aes(x = loan, y = n )) +
  geom_bar(aes(fill = loan), stat = "identity", color = "white") +
  scale_y_discrete(name = "Frequencia") +
  scale_x_discrete(name = "loan") +
  ggtitle("Loan") +
  facet_wrap(~y) + 
  theme_gray() + 
  theme(axis.text.x = element_text(angle = 1, hjust = 1)
        ,axis.title = element_text(size = rel(2), angle = 1)
       ,plot.title = element_text(size = rel(2.2))
        ,axis.text =  element_text(size = rel(1.8))
        ,axis.ticks = element_line(size = 1)
        ,legend.text = element_text(size = 16, colour = "black")
       ,legend.title = element_text(size = 16,face = "bold"))

bp_contact <- ggplot(contact, aes(x = contact, y = n )) +
  geom_bar(aes(fill = contact), stat = "identity", color = "white") +
  scale_y_discrete(name = "Frequencia") +
  scale_x_discrete(name = "contact") +
  ggtitle("Contact") +
  facet_wrap(~y) + 
  theme_gray() + 
  theme(axis.text.x = element_text(angle = 1, hjust = 1)
        ,axis.title = element_text(size = rel(2), angle = 1)
       ,plot.title = element_text(size = rel(2.2))
        ,axis.text =  element_text(size = rel(1.8))
        ,axis.ticks = element_line(size = 1)
        ,legend.text = element_text(size = 16, colour = "black")
       ,legend.title = element_text(size = 16,face = "bold"))

bp_month <- ggplot(month, aes(x = month, y = n )) +
  geom_bar(aes(fill = month), stat = "identity", color = "white") +
  scale_y_discrete(name = "Frequencia") +
  scale_x_discrete(name = "month") +
  ggtitle("month") +
  facet_wrap(~y) + 
  theme_gray() + 
  theme(axis.text.x = element_text(angle = 1, hjust = 1)
        ,axis.title = element_text(size = rel(2), angle = 1)
       ,plot.title = element_text(size = rel(2.2))
        ,axis.text =  element_text(size = rel(1.8))
        ,axis.ticks = element_line(size = 1)
        ,legend.text = element_text(size = 16, colour = "black")
       ,legend.title = element_text(size = 16,face = "bold"))

bp_poutcome <- ggplot(poutcome, aes(x = poutcome, y = n )) +
  geom_bar(aes(fill = poutcome), stat = "identity", color = "white") +
  scale_y_discrete(name = "Frequencia") +
  scale_x_discrete(name = "poutcome") +
  ggtitle("Poutcome") +
  facet_wrap(~y) + 
  theme_gray() + 
  theme(axis.text.x = element_text(angle = 1, hjust = 1)
        ,axis.title = element_text(size = rel(2), angle = 1)
       ,plot.title = element_text(size = rel(2.2))
        ,axis.text =  element_text(size = rel(1.8))
        ,axis.ticks = element_line(size = 1)
        ,legend.text = element_text(size = 16, colour = "black")
       ,legend.title = element_text(size = 16,face = "bold"))

#### Analysis of each categorical feature vs our class(y) with graphic bar

In [None]:
# Plotting the graphics together
options(repr.plot.width = 30, repr.plot.height = 14)
grid.arrange(bp_marital, bp_education, bp_default, bp_housing
             , nrow = 2, ncol = 2)

Clearly peopple that are married, have secondary education and have no house tends to subscrib a term deposit, let's use this to create new features in feature engineering

In [None]:
options(repr.plot.width = 30, repr.plot.height = 14)
grid.arrange(bp_loan, bp_contact, bp_month, bp_poutcome
             , nrow = 2, ncol = 2)

People with no loan and with unknown poutcome tends more to subscrib to a term deposit, months of may, aug and jul are also better to execute the campaign

As we can see above same categorys have more power to predict the class yes, latter in features engenering let's deal with that and create new features

#### Analysing correlations between numerical features we can see that are not necessary treat high correlations

In [None]:
# Ploting correlation between numerical variables
options(repr.plot.width = 20, repr.plot.heigt = 20)
bank %>%
  keep(is.numeric) %>%
  ggcorr(name = 'correlations'
         ,label = T
         ,size = 10
         ,label_alpha = T
         ,label_color = 'black'
         ,label_round = 2
         ,label_size = 16)+
theme(legend.text = element_text(size = 25, colour = "black")
       ,legend.title = element_text(size = 25,face = "bold"))

#### Ploting class proportion
We have here a clearly unbalance class problem, let's solve this using oversampling technic

In [None]:
# counting yes and no frequency
proportion <- bank %>% count(y)

# Ploting a bar graphic with class proportion
options(repr.plot.width = 14, repr.plot.height = 10)
ggplot(proportion, aes(x = y, y = n)) +
  geom_bar(fill = c('blue','red'), stat = "identity") +
  geom_text(aes(label=n),size = 7, vjust = -0.3) +
  ggtitle('Class proportion')+
theme(axis.text.x = element_text(angle = 1, hjust = 1)
        ,axis.title = element_text(size = rel(2), angle = 1)
       ,plot.title = element_text(size = rel(2.2))
        ,axis.text =  element_text(size = rel(1.8))
        ,axis.ticks = element_line(size = 1))

## Dealing with unbalance class problem

Let's use Oversampling to deal with unbalance class problem, ***But what is oversampling?***

***Oversampling*** it's one of the best technics to solve unbalance class problem , with oversampling we can increases the number of instances in the minority class by randomly replicating them in order to present a higher representation of the minority class in the sample, with this technic we can avoid that our model learns only how to precit the majority class.

<img src='https://i.imgur.com/yROWkoG.jpg' style='width:386px;height:248px'/>

First of all let's create a new dataset with our class balanced with oversampling

In [None]:
set.seed(42)
bank_oversample <- upSample(bank, bank$y) # creating a new dataset with oversampling

dim(bank_oversample) # seeing new database structure

Let's see the new class proportion

In [None]:
# Counting categorys frequency
proportion2 <- bank_oversample %>% count(Class)

# Ploting a bar graphic with class proportion
options(repr.plot.width = 14, repr.plot.height = 10)
ggplot(proportion2, aes(x = Class, y = n)) +
  geom_bar(fill = c('blue','red'), stat = "identity") +
  geom_text(aes(label=n),size=7, vjust = -0.3) +
  ggtitle('Class proportion')+
theme(axis.text.x = element_text(angle = 1, hjust = 1)
        ,axis.title = element_text(size = rel(2), angle = 1)
       ,plot.title = element_text(size = rel(2.2))
        ,axis.text =  element_text(size = rel(1.8))
        ,axis.ticks = element_line(size = 1))

Percfect! Our class now are balanced and we can create our first model, now let's create a train and test data without oversampling to do validation

In [None]:
set.seed(42) # setting a seed to reproduce this model 
inTrain <- createDataPartition(bank$y, p = 0.7, list = FALSE) # Partitioning the dataset 70% train 30% test

train_noover <- bank[inTrain, ] # Creating the train dataset

test_noover <- bank[-inTrain, ] # Creating the test dataset

cat('no_oversampling train dataset dimensions: ',dim(train_noover),'\n')
cat('no_oversampling test dataset dimensions: ',dim(test_noover),'\n')

Now let's create a train and test data with oversampling to train our model

In [None]:
set.seed(42) # setting a seed to reproduce this model 
inTrain2 <- createDataPartition(bank_oversample$Class, p = 0.7, list = FALSE) # Partitioning the dataset 70% train 30% test

train_over <- bank_oversample[inTrain2, ] # Creating the train dataset

test_over <- bank_oversample[-inTrain2, ] # Creating the test dataset

cat('Oversampling train dataset dimensions: ',dim(train_over),'\n')
cat('Oversampling test dataset dimensions: ',dim(test_over),'\n')

### Creating the first model using naive bayes for classification

Here i'll use this first model to compare performance before and after feature engenering

In [None]:
# Training the model
set.seed(42)
nb1 = naive_bayes(Class ~ . , laplace = 1, usekernel = F, data = train_over)

# Predicting on train_over dataset
nb_train_pred1 <- predict(nb1, train_over, type = "class")

Changing directory and loading a custom function to plot confusion matrix

In [None]:
# Importing draw_confusion_matrix.R function to plot confusion matrix results finded in:
# https://github.com/wellingtsilvdev/codes-with-real-utilities/commit/1e7cc00ce21b2edd29922de868189bf9779a5b57

source_url('https://github.com/wellingtsilvdev/codes-with-real-utilities/blob/master/draw_confusion_matrix.R?raw=TRUE') # calling travel function

In [None]:
# Plotting confusion matrix
confusion_train <- confusionMatrix(nb_train_pred1, train_over$Class, positive = 'yes')
draw_confusion_matrix(confusion_train)

68% of accuracy with sensitivity and specificity very balanced, let's test our model in test dataset

In [None]:
# Predicting in test_noover dataset
nb_test_pred1 <- predict(nb1, test_noover, type = "class")

confusion_test <- confusionMatrix(nb_test_pred1, test_noover$y, positive = 'yes')
draw_confusion_matrix(confusion_test) # plotting confusionmatrix

68% of accuracy, not good enough but we have a stable model, 66% sensitivity and 68% especificity
, let's try to improve this with feature engineering

## Feature Engineering

###### Dealing with numerical features

Treating campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

As we can see in our boxplot graphic, our campaign feature have
more predict power with = or less then 4 contacts, let's split campaign into
priority(<=4) and not_priority(>4)

In [None]:
bank$campaign_cat <- as.factor(ifelse(bank$campaign<=4,'priority','not_priority'))
bank$campaign <- NULL # Removing the old feature

Treating age and classifying into young, mature and old

Here I defyne clients with less or equal 30 as young, great or equal 50 as old and between 30 and 50 mature

In [None]:
bank$age_cat <-  ifelse(bank$age>=50,'old',NA)
bank$age_cat <-  ifelse(bank$age<=30,'young',bank$age_cat)
bank$age_cat <-  ifelse(bank$age>30 & bank$age<50,'mature',bank$age_cat)
bank$age_cat <- as.factor(bank$age_cat)

Previous 0 means that this client was never contacted before, let's separaty
clients that are never contacted before

In [None]:
# Classifying previous into contacted before and non_contacted before
bank$cont_before <- as.factor(ifelse(bank$previous==0,'no','yes'))

Classifying balances of clients by wealth levels, <1500 i considere as negative

In [None]:
bank$balance_lvl <- ifelse(bank$balance<1500,'negative',NA)
bank$balance_lvl <- ifelse(bank$balance<=10000 & bank$balance>=0,'lvl1',bank$balance_lvl)
bank$balance_lvl <- ifelse(bank$balance>10000 & bank$balance<=40000,'lvl2',bank$balance_lvl)
bank$balance_lvl <- ifelse(bank$balance>40000,'lvl3',bank$balance_lvl)
bank$balance_lvl <- as.factor(bank$balance_lvl)

Converting day feature into start of month, middle of month and final of month

In [None]:
bank$moth_stage <- ifelse(bank$day<=7,'start_m',NA)
bank$moth_stage <- ifelse(bank$day>=22,'final_m',bank$moth_stage)
bank$moth_stage <- ifelse(bank$day<22 & bank$day>7,'middle_m',bank$moth_stage)
bank$moth_stage <- as.factor(bank$moth_stage)

Everything seems okay let's verify our dataset and look for possible missing values

In [None]:
head(bank) # seeing first 5 rows

In [None]:
sum(is.na(bank)) # verifying na's

#### Let's visualize our new features with bar graphics

In [None]:
# Visualizing new categorical features
campaign_cat <- bank %>% count(campaign_cat, y)
age_cat <- bank %>% count(age_cat, y)
cont_before <- bank %>% count(cont_before, y)
balance_lvl <- bank %>% count(balance_lvl, y)
moth_stage <- bank %>% count(moth_stage, y)

# Creating and storaging in variables a graphic with the freaquencys above
options(repr.plot.width = 10, repr.plot.height = 7)
campaign_cat <- ggplot(campaign_cat, aes(x = campaign_cat, y = n )) +
  geom_bar(aes(fill = campaign_cat), stat = "identity", color = "white") +
  scale_y_discrete(name = "Frequencia") +
  scale_x_discrete(name = "campaign_cat") +
  ggtitle("campaign_cat") +
  facet_wrap(~y) + 
  theme_gray() + 
  theme(axis.text.x = element_text(angle = 1, hjust = 1)
        ,axis.title = element_text(size = rel(2), angle = 1)
       ,plot.title = element_text(size = rel(2.2))
        ,axis.text =  element_text(size = rel(1.8))
        ,axis.ticks = element_line(size = 1)
        ,legend.text = element_text(size = 16, colour = "black")
       ,legend.title = element_text(size = 16,face = "bold"))

age_cat <- ggplot(age_cat, aes(x = age_cat, y = n )) +
  geom_bar(aes(fill = age_cat), stat = "identity", color = "white") +
  scale_y_discrete(name = "Frequencia") +
  scale_x_discrete(name = "age_cat") +
  ggtitle("age_cat") +
  facet_wrap(~y) + 
  theme_gray() + 
  theme(axis.text.x = element_text(angle = 1, hjust = 1)
        ,axis.title = element_text(size = rel(2), angle = 1)
       ,plot.title = element_text(size = rel(2.2))
        ,axis.text =  element_text(size = rel(1.8))
        ,axis.ticks = element_line(size = 1)
        ,legend.text = element_text(size = 16, colour = "black")
       ,legend.title = element_text(size = 16,face = "bold"))

cont_before <- ggplot(cont_before, aes(x = cont_before, y = n )) +
  geom_bar(aes(fill = cont_before), stat = "identity", color = "white") +
  scale_y_discrete(name = "Frequencia") +
  scale_x_discrete(name = "cont_before") +
  ggtitle("cont_before") +
  facet_wrap(~y) + 
  theme_gray() + 
  theme(axis.text.x = element_text(angle = 1, hjust = 1)
        ,axis.title = element_text(size = rel(2), angle = 1)
       ,plot.title = element_text(size = rel(2.2))
        ,axis.text =  element_text(size = rel(1.8))
        ,axis.ticks = element_line(size = 1)
        ,legend.text = element_text(size = 16, colour = "black")
       ,legend.title = element_text(size = 16,face = "bold"))

balance_lvl <- ggplot(balance_lvl, aes(x = balance_lvl, y = n )) +
  geom_bar(aes(fill = balance_lvl), stat = "identity", color = "white") +
  scale_y_discrete(name = "Frequencia") +
  scale_x_discrete(name = "balance_lvl") +
  ggtitle("balance_lvl") +
  facet_wrap(~y) + 
  theme_gray() + 
  theme(axis.text.x = element_text(angle = 1, hjust = 1)
        ,axis.title = element_text(size = rel(2), angle = 1)
       ,plot.title = element_text(size = rel(2.2))
        ,axis.text =  element_text(size = rel(1.8))
        ,axis.ticks = element_line(size = 1)
        ,legend.text = element_text(size = 16, colour = "black")
       ,legend.title = element_text(size = 16,face = "bold"))

moth_stage <- ggplot(moth_stage, aes(x = moth_stage, y = n )) +
  geom_bar(aes(fill = moth_stage), stat = "identity", color = "white") +
  scale_y_discrete(name = "Frequencia") +
  scale_x_discrete(name = "moth_stage") +
  ggtitle("moth_stage") +
  facet_wrap(~y) + 
  theme_gray() + 
  theme(axis.text.x = element_text(angle = 1, hjust = 1)
        ,axis.title = element_text(size = rel(2), angle = 1)
       ,plot.title = element_text(size = rel(2.2))
        ,axis.text =  element_text(size = rel(1.8))
        ,axis.ticks = element_line(size = 1)
        ,legend.text = element_text(size = 16, colour = "black")
       ,legend.title = element_text(size = 16,face = "bold"))

In [None]:
# Plotting the graphics together
options(repr.plot.width = 34, repr.plot.height = 14)
grid.arrange(campaign_cat, age_cat, balance_lvl, moth_stage,cont_before
             , nrow = 2, ncol = 3)

Here we can see that clients in mature age, with balance lvl1 and that was never contacted before are more intersting to direct our campaign, let's create some new features with this informations

#### Creating new features based on categorical features

In [None]:
# Creating a feature to represent people married and with no house
bank$no_house_married <- as.factor(ifelse(bank$marital=='married' & bank$housing=='no',1,0))

# Creating a feature to represent people with secondary education and no loan
bank$no_loan_secedu <- as.factor(ifelse(bank$education=='secondary' & bank$loan=='no',1,0))

# Creating a feature to represent people with no default and no house
bank$no_credit_no_house <- as.factor(ifelse(bank$default=='no' & bank$housing=='no',1,0))

# Creating a feature to represent people that are concated by celluar and was never concated before
bank$cell_cont_before <- as.factor(ifelse(bank$contact=='cellular' & bank$cont_before=='no',1,0))

# Creating a feature to represent people that are in mature age and have a wealth lvl1
bank$mature_lvl1 <- as.factor(ifelse(bank$age_cat=='mature' & bank$balance_lvl=='lvl1',1,0))

#### Creating new features based on numerical features

In [None]:
# Creating a feature with relation between age and balance
bank$age_balan <- bank$balance/bank$age

# Creating a feature with relation between previous and pdays
bank$previous_pdays <- bank$previous/bank$pdays

Let's take a look into our new numerical features

In [None]:
# visualizing new numerical features
bp_age_balan <- ggplot(bank, aes(x = y, y = age_balan )) +
  geom_boxplot(fill = "#228B22", colour = "#1F3552", alpha = 0.6) +
  scale_y_continuous(name = "age_balan") +
  scale_x_discrete(name = "Target") +
  ggtitle("age_balan") +
  theme_gray() + 
  theme(axis.text.x = element_text(angle = 1, hjust = 1)
        ,axis.title = element_text(size = rel(2), angle = 1)
       ,plot.title = element_text(size = rel(2.2))
        ,axis.text =  element_text(size = rel(1.8))
        ,axis.ticks = element_line(size = 1))

bp_previous_pdays <- ggplot(bank, aes(x = y, y = previous_pdays )) +
  geom_boxplot(fill = "#228B22", colour = "#1F3552", alpha = 0.6) +
  scale_y_continuous(name = "previous_pdays") +
  scale_x_discrete(name = "Target") +
  ggtitle("previous_pdays") +
  theme_gray() + 
  theme(axis.text.x = element_text(angle = 1, hjust = 1)
        ,axis.title = element_text(size = rel(2), angle = 1)
       ,plot.title = element_text(size = rel(2.2))
        ,axis.text =  element_text(size = rel(1.8))
        ,axis.ticks = element_line(size = 1))

# Plotting all boxplots into a unique graph
options(repr.plot.width = 16, repr.plot.height = 10)
grid.arrange(bp_previous_pdays, bp_age_balan
             , nrow = 1, ncol = 2)

###### As we can see on boxplots our new features contain outlayers, let's deal with then converting this features into categorys

age_balan have more prediction power between 0 and 100, let's separaty this reagion from the outlayers

In [None]:
bank$cat_age_balan <- ifelse(bank$age_balan>=100,'out',NA) # more then 100 let's classify as out(outlayer)
bank$cat_age_balan <- ifelse(bank$age_balan<0,'out',bank$cat_age_balan) # less then 0 let's classify as out
bank$cat_age_balan <- ifelse(bank$age_balan>=0 & bank$age_balan<100,'in',bank$cat_age_balan)
bank$cat_age_balan <- as.factor(bank$cat_age_balan) # converting into factors

In previous_pdays almost all data are outlayers, let's separaty then

In [None]:
# let's classify previous_pdays diferent from 0 as outlayers
bank$cat_previous_pdays <- as.factor(ifelse(bank$previous_pdays==0,'inn','outt'))

Now I remove feature that are used to create new features or that was treated(outlayers removed) and
saved into another feature

In [None]:
# removing numerical variables
bank$age <- NULL
bank$balance <- NULL
bank$previous <- NULL
bank$pdays <- NULL
bank$day <- NULL
bank$age_balan <- NULL
bank$previous_pdays <- NULL

## Modelling

### Creating the second model using naive bayes for classification

Here Let's create the samples with oversampling and no oversampling again with new features

In [None]:
# Using oversampling
set.seed(42)
bank_oversample2 <- upSample(bank, bank$y)

# Creating train and test data without oversampling to do validation
set.seed(42)
inTrain <- createDataPartition(bank$y, p = 0.7, list = FALSE)

train_noover <- bank[inTrain, ]

test_noover <- bank[-inTrain, ]

# Creating train and test data with oversampling to training
set.seed(42)
inTrain2 <- createDataPartition(bank_oversample2$Class, p = 0.7, list = FALSE)

train_over <- bank_oversample2[inTrain2, ]

test_over <- bank_oversample2[-inTrain2, ]

In [None]:
# Training the model2
set.seed(42)
nb2 = naive_bayes(Class ~ . , laplace = 1, usekernel = F, data = train_over)

# Predicting on train_over dataset
nb_train_pred2 <- predict(nb2, train_over, type = "class")

confusion_train <- confusionMatrix(nb_train_pred2, train_over$Class, positive = 'yes')
options(repr.plot.width = 16, repr.plot.height = 12) # ajusting the plot
draw_confusion_matrix(confusion_train)

Clearly no improvements in accuracy, let's predict on test dataset

In [None]:
# predicting in no oversampling test dataset
nb_test_pred2 <- predict(nb2, test_noover, type = "class")

confusion_test <- confusionMatrix(nb_test_pred2, test_noover$y, positive = 'yes')
options(repr.plot.width = 16, repr.plot.height = 12) # ajusting the plot
draw_confusion_matrix(confusion_test) # plotting confusion matrix

As we can see our we had almost no improvements but our model are stable, maybe naive baeys are not the best model
to this problem, let's test another algorithms latter

#### Evaluating the model with ROC curve

In [None]:
#Plotting AUC
probabilits <- predict(nb2, type ='prob', test_noover) 

nb2_probs <- prediction(probabilits[,2], test_noover$y)
plot(performance(nb2_probs, "tpr", "fpr"), col = "red", main = "Area Under the Curve - AUC")
abline(0,1, lty = 8, col = "grey")

In [None]:
#AUC
auc <- performance(nb2_probs, "auc")
value_auc <- slot(auc, "y.values")[[1]]
value_auc

**73%**** in AUC, it's intersting but let's try to improve this testing another models

### Creating the Third model using Decision Trees for classification

In [None]:
# Training the model
tree1 = rpart(Class ~ ., data = train_over,method='class')

# Plotting the decision tree
rpart.plot(tree1,box.palette = 'RdBu'
           ,shadow.col = 'gray'
           ,nn=T,main='Decision Tree')

Now let's predict on train and test dataset to compare performance and verify overfitting

In [None]:
# Predicting on train_over dataset
tree_train_pred1 <- predict(tree1, train_over, type = "class")

confusion_train <- confusionMatrix(tree_train_pred1, train_over$Class, positive = 'yes')
options(repr.plot.width = 16, repr.plot.height = 12) # ajusting the plot
draw_confusion_matrix(confusion_train) # plotting confusion matrix

In [None]:
# Predicting in test_noover dataset
tree_test_pred1 <- predict(tree1, test_noover, type = "class")

confusion_test <- confusionMatrix(tree_test_pred1, test_noover$y, positive = 'yes')
options(repr.plot.width = 16, repr.plot.height = 12) # ajusting the plot
draw_confusion_matrix(confusion_test)

As we can see with decision trees we improve the accuracy but sensitivity are too small, only 56%, our model was not capable of generalize the results in no oversampled dataset, let's take a look in ROC curve

In [None]:
# Plotting ROC curve for decision tree model
probabilits <- predict(tree1, type ='prob', test_noover) 

nb2_probs <- prediction(probabilits[,2], test_noover$y)
plot(performance(nb2_probs, "tpr", "fpr"), col = "red", main = "AUC - Decision Tree")
abline(0,1, lty = 8, col = "grey")

In [None]:
# AUC
auc_trees <- performance(nb2_probs, "auc")
value_auc_trees <- slot(auc_trees, "y.values")[[1]]
value_auc_trees

72%, not bad but sensitivity with 56% and not enough, 
let's test another model, let's try to improve our accuracy and AUC to more then 80%

### Creating the Fourth model using random forest for classification 

In [None]:
# Training the model
set.seed(42)
forest1 = randomForest(Class ~ ., data = train_over,ntree=100,importance=T)

# Predicting on train_over dataset
forest_train_pred1 <- predict(forest1, train_over, type = "class")

options(repr.plot.width = 14, repr.plot.height = 10)
plot(forest1,main='Random Forest Error Decreassing') # Ploting the forest error

#### Looking into var importances

In [None]:
# Catting var importance from model
importance <- as.data.frame(forest1$importance)
importance$features <- row.names(importance)
importance$MeanDecreaseGini <- round(importance$MeanDecreaseGini,2)

# Ploting a bar graphic with features importance proportion
options(repr.plot.width = 24, repr.plot.height = 10)
ggplot(importance, aes(x = features, y = MeanDecreaseGini)) +
  geom_bar(fill = "#009999", stat = "identity") +
  geom_text(aes(label=MeanDecreaseGini),size=6, vjust = -0.3) +
  ggtitle('Features Importance')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1)
        ,axis.title = element_text(size = rel(2.2), angle = 1)
       ,plot.title = element_text(size = rel(2.5))
        ,axis.text =  element_text(size = rel(2))
        ,axis.ticks = element_line(size = 1))

The feature month have the create predict power, almost double of another faetures,
with this information we can redirect the marketing campaigh based on the best months, let's predict on train dataset

In [None]:
# Saving and plotting confusion matrix
options(repr.plot.width = 16, repr.plot.height = 12) # ajusting the plot
confusion_forest_train <- confusionMatrix(forest_train_pred1, train_over$Class, positive = 'yes')
draw_confusion_matrix(confusion_forest_train)

#### EURECA! Finally 80%+ of accuracy, sensitivity 76%+, a create improvement, let's test the model in validation dataset

In [None]:
# Predicting in test_noover dataset
forest_test_pred1 <- predict(forest1, test_noover, type = "class")

confusion_forest_test <- confusionMatrix(forest_test_pred1, test_noover$y, positive = 'yes')
draw_confusion_matrix(confusion_forest_test)

**83% of accuracy, sensitivity 74% and specificity 84%**, our model are stable and was capable of generalize,
let's see the ROC curve

In [None]:
#Plotting ROC curve for decision tree model
prob_forest <- predict(forest1, type ='prob', test_noover)

forest_probs <- prediction(prob_forest[,2], test_noover$y)
plot(performance(forest_probs, "tpr", "fpr"), col = "red", main = "AUC - Random Forest")
abline(0,1, lty = 8, col = "grey")

A beathfull ROC curve! Let's take a look into Are Under the Curve(AUC)

In [None]:
#AUC from random forests
auc_forest <- performance(forest_probs, "auc")
value_auc_forest <- slot(auc_forest, "y.values")[[1]]
value_auc_forest

86% of AUC, a GREAT model! Definitly random forests are more intersting to this aplication because have more accuracy and more stability

### Conclusion

Our fourth model with random forests was much better then naive baeys and decision trees, are also much better then
our first model, AUC 86% and accuracy 83% with  good levels of sensitivity and specificity, a good model to apply in the real world.

Now we can use this model to direct the phones calls to the clients that have more probability to 
subscribe to a term deposit, with this we can reduce the costs and improve the profits of the bank.