# Exploratory Data Analysis with R - Bank Marketing

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

Sourse of the data https://archive.ics.uci.edu/ml/datasets/bank+marketing

**Citation Request:**
This dataset is public available for research. The details are described in S. Moro, P. Cortez and P. Rita. "A Data-Driven Approach to Predict the Success of Bank Telemarketing." Decision Support Systems, Elsevier, 62:22-31, June 2014

**Attribute Information:**

1 - age (numeric)

2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')

3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)

4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')

5 - default: has credit in default? (categorical: 'no','yes','unknown')

6 - housing: has housing loan? (categorical: 'no','yes','unknown')

7 - loan: has personal loan? (categorical: 'no','yes','unknown')

8 - contact: contact communication type (categorical: 'cellular','telephone')

9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')

10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')

11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

14 - previous: number of contacts performed before this campaign and for this client (numeric)

15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)

17 - cons.price.idx: consumer price index - monthly indicator (numeric)

18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)

19 - euribor3m: euribor 3 month rate - daily indicator (numeric)

20 - nr.employed: number of employees - quarterly indicator (numeric)

21 - y - has the client subscribed a term deposit? (binary: 'yes','no')

## Task

1. Explore the data to find how different features (age, job, education, and others) affect the desired outcome (the client subscribed to a term deposit). For this analysis, I will use a marketing KPI called *Conversion Rate*.  Conversion rate is the percentage of clients who take the desired action.
2. Give recommendations for the Bank's marketing strategy and future marketing campaigns.

## Loading the data and R packages

In [None]:
library(dplyr)
library(ggplot2)
data <- read.csv("../input/bank-marketing-campaigns-dataset/bank-additional-full.csv", 
                 header = TRUE, sep = ";")

head(data)

The column "y" has binary values "yes" and "no" (subscribed to a term deposit). I'm going to encode it into 1s and 0s. After that, I can easily calculate the converstion rate.

In [None]:
data <- data %>%
  mutate(y=ifelse(y=="no", 0, 1))
data$y <- as.integer(data$y)

#total number of conversions
sum(data$y)

#total number of clients in the data
nrow(data)

#conversion rate
sum(data$y)/nrow(data)*100.0

Now that I found the conversion rate of this data set - **11, 26%**, let's find conversion rates depending on the different data features. 

## Conversion Rate by Age

In [None]:
#group clients into 6 age groups(18-30, 30-40, 40-50, 50-60, 60-70, >70)
conversionsAgeGroup <- data %>%
  group_by(AgeGroup=cut(age, breaks=seq(20, 70, by=10))) %>%
  summarize(TotalCount=n(), NumberConversions=sum(y)) %>%
  mutate(ConversionRate=NumberConversions/TotalCount*100)

#rename the 6th group
conversionsAgeGroup$AgeGroup <- as.character(conversionsAgeGroup$AgeGroup)
conversionsAgeGroup$AgeGroup[6] <- "70+"

#visualizing conversions by age group
ggplot(data=conversionsAgeGroup, aes(x=AgeGroup, y=ConversionRate)) +
  geom_bar(width=0.5, stat="identity", fill="darkgreen") + 
  labs(title="Conversion Rates by Age Group")

As we can see on the plot, 60+ age people responded better to the bank marketing campaign compared to the other age groups. 

## Conversions by age group and marital status

In [None]:
# group the data
conversionsAgeMarital <- data %>%
  group_by(AgeGroup=cut(age, breaks=seq(20,70, by=10)),
           Marital=marital) %>%
  summarize(Count=n(), NumConversions=sum(y)) %>%
  mutate(TotalCount=sum(Count)) %>%
  mutate(ConversionRate=NumConversions/TotalCount*100)

#rename the last groups
conversionsAgeMarital$AgeGroup <- as.character(conversionsAgeMarital$AgeGroup)
conversionsAgeMarital$AgeGroup[is.na(conversionsAgeMarital$AgeGroup)] <- "70+"

#visualizing conversions by age group and marrital status
ggplot(conversionsAgeMarital, aes(x=AgeGroup, y=ConversionRate, fill=Marital)) +
  geom_bar(width=0.5, stat = "identity") +
  labs(title="Conversion Rates by Age Group and Marital Status")

In the groups from 30 to 70+ age, married people are more likely to convert (could be because they are the majority in these age groups). People with the "single" marital status convert better in the age group {20, 30]. 

## Conversions by job

In [None]:
#group the data
conversionsJob <- data %>%
  group_by(Job=job) %>%
  summarize(TotalCount=n(), NumberConversions=sum(y)) %>%
  mutate(ConversionRate=NumberConversions/TotalCount*100) %>%
  arrange(desc(ConversionRate))

#order the jobs DESC for the bar chart
conversionsJob$Job <- factor(conversionsJob$Job, 
                                   levels = conversionsJob$Job[order(-conversionsJob$ConversionRate)])

# visualizing conversions by job
ggplot(conversionsJob, aes(x=Job, y=ConversionRate)) +
  geom_bar(width=0.5, stat = "identity", fill="darkgreen") +
  labs(title="Conversion Rates by Job") +
  theme(axis.text.x = element_text(angle = 90))

Students and retired people have a higher conversion rate than other "job" groups. The blue-collar group has the lowest conversion rate. 

## Conversions by education

In [None]:
#group the data
conversionsEdu <- data %>%
  group_by(Education=education) %>%
  summarize(TotalCount=n(), NumberConversions=sum(y)) %>%
  mutate(ConversionRate=NumberConversions/TotalCount*100) %>%
  arrange(desc(ConversionRate))

#order DESC for the bar chart
conversionsEdu$Education <- factor(conversionsEdu$Education, 
                                   levels = conversionsEdu$Education[order(-conversionsEdu$ConversionRate)])
#visualizing conversions by education
ggplot(conversionsEdu, aes(x=Education, y=ConversionRate)) +
  geom_bar(width=0.5, stat = "identity", fill="darkgreen") +
  labs(title="Conversion Rates by Education") +
  theme(axis.text.x = element_text(angle = 90))

The highest conversion rate in the "illiterate" group. But because there are only 18 illiterate clients, I am not going to recommend focusing on this group. "University degree" has a higher than average conversion rate, so I would suggest focusing on this group. Also, I would recommend limit marketing efforts on groups "basic.6y" and "basic.9y". 

## Conversions by having or not a credit in default

In [None]:
#group the data
conversionsDefaultCredit <- data %>%
  group_by(HasCredit=default) %>%
  summarize(TotalCount=n(), NumberConversions=sum(y)) %>%
  mutate(ConversionRate=NumberConversions/TotalCount*100) %>%
  arrange(desc(ConversionRate))

#visualizing the data
ggplot(conversionsDefaultCredit, aes(x=HasCredit, y=ConversionRate, fill=HasCredit)) +
  geom_bar(width=0.5, stat = "identity") +
  labs(title="Conversion Rates by Default Credit")

So if a client doesn't have a credit, the one is more likely to subscribe to a term deposit.

## Conversions by having a housing loan and a personal loan

In [None]:
#group the data - housing loan
conversionsHousing <- data %>%
  group_by(HousingLoan=housing) %>%
  summarize(TotalCount=n(), NumberConversions=sum(y)) %>%
  mutate(ConversionRate=NumberConversions/TotalCount*100) %>%
  arrange(desc(ConversionRate))

#visualizing the data
ggplot(conversionsHousing, aes(x=HousingLoan, y=ConversionRate, fill=HousingLoan)) +
  geom_bar(width=0.5, stat = "identity") +
  labs(title="Conversion Rates by Housing Loan")

#group the data - personal loan
conversionsLoan <- data %>%
  group_by(Loan=loan) %>%
  summarize(TotalCount=n(), NumberConversions=sum(y)) %>%
  mutate(ConversionRate=NumberConversions/TotalCount*100) %>%
  arrange(desc(ConversionRate))

#visualizing the data
ggplot(conversionsLoan, aes(x=Loan, y=ConversionRate, fill=Loan)) +
  geom_bar(width=0.5, stat = "identity") +
  labs(title="Conversion Rates by Personal Loan")

Clients who have a housing loan or don't have a personal loan convert slightly better. 

## Conversions by contact type

In [None]:
conversionsContact <- data %>%
  group_by(Contact=contact) %>%
  summarize(TotalCount=n(), NumberConversions=sum(y)) %>%
  mutate(ConversionRate=NumberConversions/TotalCount*100) %>%
  arrange(desc(ConversionRate))

head(conversionsContact)

Cellular type of contacting clients is more efficient. 

## Conversions by the last contact month of a year

In [None]:
# group the data by months
conversionsMonth <- data %>%
  group_by(Month=month) %>%
  summarize(TotalCount=n(), NumberConversions=sum(y)) %>%
  mutate(ConversionRate=NumberConversions/TotalCount*100) %>%
  arrange(desc(ConversionRate))

#reorder DESC
conversionsMonth$Month <- factor(conversionsMonth$Month, 
                                   levels = conversionsMonth$Month[order(-conversionsMonth$ConversionRate)])
#visualizing the data
ggplot(conversionsMonth, aes(x=Month, y=ConversionRate)) +
  geom_bar(width=0.5, stat = "identity", fill="darkgreen") +
  labs(title="Conversion Rates by Last Contact Month") +
  theme(axis.text.x = element_text(angle = 90))

People who were contacted last time in March, December, September, and October convert much better than others. 

## Conversions by the last contact day of a week

In [None]:
#group the data by days of a week
conversionsDayOfWeek <- data %>%
  group_by(Day_Of_Week=day_of_week) %>%
  summarize(TotalCount=n(), NumberConversions=sum(y)) %>%
  mutate(ConversionRate=NumberConversions/TotalCount*100) %>%
  arrange(desc(ConversionRate))

#reorder DESC
conversionsDayOfWeek$Day_Of_Week <- factor(conversionsDayOfWeek$Day_Of_Week, 
                                           levels = c("mon", "tue", "wed", "thu", "fri"))
#visualizing the data
ggplot(conversionsDayOfWeek, aes(x=Day_Of_Week, y=ConversionRate)) +
  geom_bar(width=0.5, stat = "identity", fill="darkgreen") +
  labs(title="Conversion Rates by Last Contact Day of Week") +
  theme(axis.text.x = element_text(angle = 90))


The conversion rate is higher if the clients were contacted on Thursday, Tuesday, and Wednesday.

## Correlation between subscribing to a term deposit and call duration

In [None]:
data_duration <- data %>%
  group_by(Subscribed=y) %>%
  summarise(Average_Duration=mean(duration))
head(data_duration)

The average duration of a successful call is more than 2 times longer. 

## Conversions by the number of contacts performed during the campaign

In [None]:
conversionsCamp <- data %>%
  group_by(Campaign=campaign) %>%
  summarize(TotalCount=n(), NumberConversions=sum(y)) %>%
  mutate(ConversionRate=NumberConversions/TotalCount*100) %>%
  arrange(desc(ConversionRate))

head(conversionsCamp)

If you look at the full data (not just a head), you will notice that after 18 (the number of contacts performed during this campaign and for this client) conversion rate is 0. So there is no point to call clients more than 18 times during one campaign).

## Conversions by the outcome of the previous campaign

In [None]:
#group the data by the previous outcome
conversionsPOutcome <- data %>%
  group_by(Previous_Outcome=poutcome) %>%
  summarize(TotalCount=n(), NumberConversions=sum(y)) %>%
  mutate(ConversionRate=NumberConversions/TotalCount*100) %>%
  arrange(desc(ConversionRate))

# visualizing the data
ggplot(conversionsPOutcome, aes(x=Previous_Outcome, y=ConversionRate)) +
  geom_bar(width=0.5, stat = "identity", fill="darkgreen") +
  labs(title="Conversion Rates by Outcome of the Previous Campaign")

Obviously, if the previous campaign outcome was successful (the bank probably earned some loyalty), this campaign converted better as well. 

## Summarizing recommendations for the bank

During the Bank Marketing Campaigns Dataset analysis, I found some interesting insights that can be used for improving a similar marketing campaign, launching new campaigns, and addressing the Bank's marketing strategy.

![](https://images.unsplash.com/photo-1560472355-536de3962603?ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&ixlib=rb-1.2.1&auto=format&fit=crop&w=1500&q=80)
  
** Target audience **

Based on the performance of different groups, I found that young people - age 20-30 and students, as well as retired people 60+, are more likely to become clients. So I would suggest focusing on these two groups and create different financial programs and marketing messages in advertising for each.

Also, I would recommend creating a marketing campaign for people who didn't have any credits before (explaining how it works and what are the benefits).

** Recommendations for the Sales Department (Call Center) **

* Always contact clients by cellphone when possible
* Perform most calls (campaigns) during these months: March, December, September, and October
* Plan most calls to clients on Thursday, Tuesday, Wednesday
* Long phone conversations perform better, so try to keep a conversation going as much as you can
* 18 is probably the max number of calls to a single client during a campaign

** Loyalty Programm **

I would highly recommend developing a loyalty program for the existing clients by giving them some bonuses and unique offers. The data shows that loyal clients most likely buy more products.