## INTRODUCTION
Targeting the right audience for a marketing campaign can save a company thousands of dollars, if conducted in the right  direction(taking analytics advantage to make decision) that will lead to hight rate of convertion.
Apriori algorithm will be used to extract rules that can help in target marketing, base on the banking data's and customer demograhpy information.

### About this Dataset
#### Bank Marketing

__Abstract:__ The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

__Data Set Information:__ The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

__Attribute Information:__

__Bank client data:__

__Age (numeric)__
__Job :__ type of job (categorical: 'admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown')

__Marital :__ marital status (categorical: 'divorced', 'married', 'single', 'unknown' ; note: 'divorced' means divorced or widowed)

__Education__ (categorical: 'basic.4y', 'basic.6y', 'basic.9y', 'high.school', 'illiterate', 'professional.course', 'university.degree', 'unknown')

__Default:__ has credit in default? (categorical: 'no', 'yes', 'unknown')

__Housing:__ has housing loan? (categorical: 'no', 'yes', 'unknown')

__Loan:__ has personal loan? (categorical: 'no', 'yes', 'unknown')
Related with the last contact of the current campaign:


__Contact:__ contact communication type (categorical: 'cellular','telephone')

__Month:__ last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')

__Day_of_week:__ last contact day of the week (categorical: 'mon','tue','wed','thu','fri')

__Duration:__ last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
Other attributes:

__Campaign:__ number of contacts performed during this campaign and for this client (numeric, includes last contact)

__Pdays:__ number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

__Previous:__ number of contacts performed before this campaign and for this client (numeric)

__Poutcome:__ outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
Social and economic context attributes

__Emp.var.rate:__ employment variation rate - quarterly indicator (numeric)

__Cons.price.idx:__ consumer price index - monthly indicator (numeric)

__Cons.conf.idx:__ consumer confidence index - monthly indicator (numeric)

__Euribor3m:__ euribor 3 month rate - daily indicator (numeric)

__Nr.employed:__ number of employees - quarterly indicator (numeric)

__Output variable (desired target):__

y - has the client subscribed a term deposit? (binary: 'yes', 'no')

### Analysis Approach 
*  Exploratory Data Analysis
* Associative Rule Mining Using Apriori Algorithm

### Source:

Dataset from : http://archive.ics.uci.edu/ml/datasets/Bank+Marketing#





In [None]:
# List files in the current working directories
list.files(path = "../input")

In [None]:
## Load Libraries
library(ggplot2)
library(tidyverse)
library(tm)
library(cluster)
library(mclust)
library(arules)
library(arulesViz)

In [None]:
# Load the two datasets to explore the content
full_data <- read.csv('../input/bank-additional-full.csv', sep=';')


In [None]:
# structure of the full datasets
str(full_data)

In [None]:
head(full_data,10)

In [None]:
# load the bank-additional-names.txt and explore the content
bank_names <- read.csv('../input/bank-additional-names.txt', sep=';')
str(bank_names)

In [None]:
head(bank_names, 30)

### Data Quality Check and EDA

In [None]:
# check th summary statistics of the datasets,check columns with missing values 
summary(full_data)

There is no missing values from the datasets. The job column has 6156 classifield as other, 80 unknown marital status, 1749 in education classified aa other, and 36548 did not respond to the campaign while 4640 responded.
The details summary statistics is shown above.

In [None]:
# Age distribution
ggplot(full_data, aes(age)) + geom_histogram() + theme_classic() + ggtitle('Age Distributition')

In [None]:
# is there a relationship between job and education?
(table(full_data$job, full_data$education))

In [None]:
# 
ggplot(full_data, aes(job,education)) + geom_count(color='red') + theme_classic() + ggtitle('job and education frequency') + theme(axis.text.x = element_text(angle = 90, hjust = 1))

In [None]:
# is there a trend between type of job and respond?
ggplot(full_data, aes(job,y)) + geom_count(color='red') + theme_classic() + ggtitle('job and respond freq') + theme(axis.text.x = element_text(angle = 90, hjust = 1))

In [None]:
## check the distribution of pdays
table(full_data$pdays)


In [None]:
## 999  are people that are not contacted
## count total records
nrow(subset(full_data,pdays==999))

In [None]:
# campaign
hist(full_data$campaign, xlab='campaign') 

The campaign distribution is highly skewed to the left

In [None]:
table(full_data$y)


### Data Tranformation
Preparing data for association rule mining

In [None]:
#print the column names
colnames(full_data)

In [None]:
# make a copy of the dataframe to keep the original
Tester <- full_data
str(Tester)

In [None]:
Tester$age <- cut(Tester$age, breaks=c(12,30,50,70,Inf), labels=c('young','middle_age','senior_age','old'))
table(Tester$age)

In [None]:
Tester$default <- dplyr::recode(Tester$default, no='default=no',yes='default=yes',unknown='default=unknown')
Tester$housing <- dplyr::recode(Tester$housing, no='housing=no',yes='housing=yes',unknown='housing=unknown')
Tester$loan <- dplyr::recode(Tester$loan, no='loan=no',yes='loan=yes',unknown='loan=unknown')
head(Tester,10)

In [None]:
# duration
boxplot(Tester$duration)

In [None]:
# Equal bins of 3 : Duration
max_duration <- max(Tester$duration)
min_duration <- min(Tester$duration)
bins <- 3
width <- (max_duration - min_duration)/bins

Tester$duration <- cut(Tester$duration, breaks = seq(min_duration, max_duration, width))


table(Tester$duration)


In [None]:
# campaigns
boxplot(Tester$campaign)

In [None]:
# discitize it to 3 equal bins
min_campaign <- min(Tester$campaign)
max_campaign <- max(Tester$campaign)
bins <- 3
width <- (max_campaign - min_campaign)/bins

Tester$campaign <- cut(Tester$campaign, breaks = seq(min_campaign, max_campaign, width))
table(Tester$campaign)

In [None]:
test <- Tester

#test$pdays <- cut(test$pdays, breaks= seq(0,999,sd(test$pdays)))
test$pdays <- cut(test$pdays, breaks= c(0,200,Inf), labels= c('priorContact','NoPriorContact'))
table(test$pdays)


In [None]:
head(test,10)

In [None]:
#check the structure before making test to Tester
str(test)

In [None]:
# previous
test$previous <- as.factor(test$previous)
table(test$previous)

In [None]:
# cut the emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m and nr.employed into 3 equal bins
bins <- 3
#emp.var.rate
max_emp.var.rate <- max(test$emp.var.rate)
min_emp.var.rate <- min(test$emp.var.rate)
width <- (max_emp.var.rate - min_emp.var.rate)/bins
test$emp.var.rate <- cut(test$emp.var.rate, breaks= seq(min_emp.var.rate,max_emp.var.rate,width))

#cons.price.idx
max_cons.price.idx <- max(test$cons.price.idx)
min_cons.price.idx <- min(test$cons.price.idx)
width <- (max_cons.price.idx - min_cons.price.idx)/bins
test$cons.price.idx <- cut(test$cons.price.idx, breaks= seq(min_cons.price.idx,max_cons.price.idx,width))

#cons.conf.idx
max_cons.conf.idx <- max(test$cons.conf.idx)
min_cons.conf.idx <- min(test$cons.conf.idx)
width <- (max_cons.conf.idx - min_cons.conf.idx)/bins
test$cons.conf.idx <- cut(test$cons.conf.idx, breaks= seq(min_cons.conf.idx,max_cons.conf.idx,width))

# euribor3m
max_euribor3m <- max(test$euribor3m)
min_euribor3m <- min(test$euribor3m)
width <- (max_euribor3m - min_euribor3m)/bins
test$euribor3m <- cut(test$euribor3m, breaks= seq(min_euribor3m,max_euribor3m,width))

# nr.employed
max_nr.employed <- max(test$nr.employed)
min_nr.employed <- min(test$nr.employed)
width <- (max_nr.employed - min_nr.employed)/bins
test$nr.employed <- cut(test$nr.employed, breaks= seq(min_nr.employed,max_nr.employed,width))


# target variable : y
test$y <- dplyr::recode(test$y,yes='target=yes', no='target=no')

str(test)


In [None]:
Tester <- test
head(Tester,10)

Datasets ready for modeling

### Model

#### Rules for target equal yes

In [None]:
# rules targiting yes
rules <- apriori(Tester,parameter = list(supp=0.001, conf=0.8, minlen=3),
                appearance =list(default='lhs', rhs='y=target=yes'),
                control=list(verbose=FALSE))

In [None]:
summary(rules)

In [None]:
#first five rules
inspect(rules[1:5])

In [None]:
# sort rules by confidence
rules <- sort(rules, by='confidence', decreasing = TRUE)
inspect(rules[1:5])

In [None]:
#viz
subrules <- head(rules,5)
plot(subrules, method='graph', interactive=FALSE)

In [None]:
# sort rules b support

rules <- sort(rules, by='support', decreasing=TRUE)
inspect(rules[1:5])

In [None]:
#viz
subrules <- head(rules,5)
plot(subrules, method='graph', interactive=FALSE)

In [None]:
# sort rules by lift
rules <- sort(rules, by='lift', decreasing=TRUE)
inspect(rules[1:5])

In [None]:
#viz
subrules <- head(rules,5)
plot(subrules, method='graph', interactive=FALSE)

### Rules for target equal no

In [None]:
rules <- apriori(Tester, parameter=list(supp=0.001, conf=0.8, minlen=3),
                appearance=list(default='lhs', rhs='y=target=no'),
                control=list(verbose=FALSE))
summary(rules)

In [None]:
# sort rules by confidence
rules <- sort(rules, by='confidence', decreasing=TRUE)
inspect(rules[1:5])

In [None]:
#viz
subrules <- head(rules,5)
plot(subrules, method='graph', interactive=FALSE)

In [None]:
#sort rules by support
rules <- sort(rules, by='support', decreasing=TRUE)
inspect(rules[1:5])


In [None]:
#viz
subrules <- head(rules,5)
plot(subrules, method='graph', interactive=FALSE)

In [None]:
# sort by lift
rules <- sort(rules, by='lift', decreasing=TRUE)
inspect(rules[1:5])

In [None]:
#viz
subrules <- head(rules,5)
plot(subrules, method='graph', interactive=FALSE)

### Conclusion

Below are two interesting rules among others.
1.  Customer with outcomes of the previous campaign equal success, cons.price.idx between 93.1 and 93.6, nr.employed between 4960 and 35050 are most likely to subcribed to a term deposite. The client in this category can be targeted for marketing campaign
 
 
 2.  Customer with no loan record, previous campaign equal success, emp.var.rate are between -1.8 and -0.2, cons.conf.idx are between -42.8 and -34.9 are most likely to subcribe to a term deposite. clients in these categories can be targeted for makerting campaigns.