# Estimating the Difference Between the Means of Successful Outcomes and Failures

Group member: Eun Ji Hwang, Daniel Dai, Camilla Ren, Rachel Yang

## Table of content

### Index: 
1.[Introduction](#Introduction)  
2.[Preliminary Results](#Preliminary-Results)   
3.[Methods](#Methods)   
4.[Reference](#Reference)


## <img src="https://th.bing.com/th/id/R.3849382c629e1a734ac29d79a0656ea2?rik=A6LFA%2bGN%2b07q9Q&riu=http%3a%2f%2fwww.pngmart.com%2ffiles%2f1%2fPhone.png&ehk=DewsIHJL6yqq69NhIyBSqo6A%2fEnade%2boWXdZCaNB03g%3d&risl=&pid=ImgRaw&r=0" width=30> Introduction

### Background:
Marketing sales campaigns are a typical strategy used by companies to grow their business. In particular, telemarketing conducted remotely has enabled marketing to optimize customer lifetime value by evaluating available information and metrics, hence allowing the company to build longer and tighter relationships in alignment with business demand.

<img src="https://th.bing.com/th/id/R.8c8656b19cf1e229b4d8154de1b2f8b6?rik=BWfEliq15MzvBQ&riu=http%3a%2f%2fwww.pollenmidwest.org%2fwp-content%2fuploads%2f2017%2f04%2fbanking2.gif&ehk=CRhvAOpg%2bzexGRo8tdAQsWMixVUj8xnxdDQxQx%2bU%2fmM%3d&risl=&pid=ImgRaw&r=0" width=500>


### Our Dataset:
The selected data refer to telemarketing campaigns of a Portuguese banking institution, where the campaigns were based on phone calls. Each record contains information about a bank customer (e.g., age), information about the last contact of the current campaign (e.g., contact duration), other attributes (e.g., three-month Euribor), and the contact outcome ("success" or "failure"). 


### Our Question:
Previous study has shown that the duration of previous calls were the one of the most relevant attributes that determine the contact outcome. The duration of previous contact refers to the call duration that needed to be rescheduled to obtain a final answer by the client. Values of call duration were recorded in the column named “duration” in the choses dataset. 

Thus, in this analysis, we are interested in the difference in the mean values of call durations for successful outcomes and failures. Since the sample distributions of the call durations of both successful and failed outcomes are left skewed. The IQR is calculated to indicate the shape and spread of the distribution of the mean difference.

## <img src="https://th.bing.com/th/id/R.3849382c629e1a734ac29d79a0656ea2?rik=A6LFA%2bGN%2b07q9Q&riu=http%3a%2f%2fwww.pngmart.com%2ffiles%2f1%2fPhone.png&ehk=DewsIHJL6yqq69NhIyBSqo6A%2fEnade%2boWXdZCaNB03g%3d&risl=&pid=ImgRaw&r=0" width=30>Preliminary Results

In [1]:
# Load package
library(tidyverse)
library(GGally)
library(infer)
library(gridExtra)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2


Attaching package: ‘gridExtra’


The following object is masked from ‘package:dplyr’:

    combine




In [2]:
# Read in the bank data, rename the column name
bank_data <- read_delim("bank_additional_full.csv", skip=1, delim = ";", col_names = FALSE)%>%
                rename(age = X1,job = X2, marital = X3, education = X4,  default = X5, housing = X6, loan = X7, contact = X8, 
                                  month = X9, day_of_week = X10, duration = X11, campaign  =X12, pdays = X13, previous = X14, poutcome = X15, 
                                  emp_var_rate = X16, cons_price_idx = X17, cons_conf_idx = X18, euribor_3m = X19, nr_employed = X20,
                                  y = X21)

head(bank_data)

[1mRows: [22m[34m41188[39m [1mColumns: [22m[34m21[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ";"
[31mchr[39m (11): X2, X3, X4, X5, X6, X7, X8, X9, X10, X15, X21
[32mdbl[39m (10): X1, X11, X12, X13, X14, X16, X17, X18, X19, X20

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


age,job,marital,education,default,housing,loan,contact,month,day_of_week,⋯,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor_3m,nr_employed,y
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,⋯,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
57,services,married,high.school,unknown,no,no,telephone,may,mon,⋯,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
37,services,married,high.school,no,yes,no,telephone,may,mon,⋯,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
40,admin.,married,basic.6y,no,no,no,telephone,may,mon,⋯,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
56,services,married,high.school,no,no,yes,telephone,may,mon,⋯,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
45,services,married,basic.9y,unknown,no,no,telephone,may,mon,⋯,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no


In [None]:
# Create a ggpairs scatterplot of all the columns we are interested in including in our model
options(repr.plot.width = 20, repr.plot.height = 15)
bank_data_compare <- bank_data%>%
            select(age, duration, campaign, pdays, previous, emp_var_rate, cons_price_idx,cons_conf_idx,euribor_3m,nr_employed,y)%>%
            ggpairs(aes(color = y))
bank_data_compare 

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.



There are no strong relationships between any two explanatory variables in Figure ggpairs, indicating that all explanatory variables meet the standard. 

In [None]:
options(repr.plot.width = 20, repr.plot.height = 8)

# Create boxplot for all numeric variables
# Since for the euribor_3m column the number is too big to visulize we mutate it to log 
bank_data_mutated<- mutate(bank_data,euribor_3m= log10(as.numeric(euribor_3m)))

age_diag_plot <- ggplot(bank_data,aes(y=age,x=y)) +
    geom_boxplot()+
    labs(x="", y="Age") +
    theme(text=element_text(size=16)) +                    
    labs(title = "Figure1 Age vs Yes or No")+
    theme(plot.title = element_text(color = "red1"))

duration_diag_plot <- ggplot(bank_data,aes(y=duration,x=y)) +
    geom_boxplot()+
    labs(x="", y="duration") +
    theme(text=element_text(size=16)) +                    
    labs(title = "Figure2 Duration vs Yes or No")+
    theme(plot.title = element_text(color = "sandybrown"))

campaign_diag_plot <- ggplot(bank_data,aes(y=campaign,x=y)) +
    geom_boxplot()+
    labs(x="", y="Number of contacts performed") +
    theme(text=element_text(size=16)) +                    
    labs(title = "Figure3 Campaign vs Yes or No")+
    theme(plot.title = element_text(color = "lightgoldenrod3"))

combined_diag_plot <- grid.arrange(age_diag_plot,duration_diag_plot,campaign_diag_plot, ncol = 3)

pdays_diag_plot <- ggplot(bank_data,aes(y=pdays,x=y)) +
    geom_boxplot()+
    labs(x="", y="Number of days that passed by") +
    theme(text=element_text(size=16)) +                    
    labs(title = "Figure4 Pdays vs Yes or No")+
    theme(plot.title = element_text(color = "springgreen4"))+
    ylim(997,1001)

emp_var_rate_diag_plot <- ggplot(bank_data,aes(y=emp_var_rate,x=y)) +
    geom_boxplot()+
    labs(x="", y="Employment Variation Rate)") +
    theme(text=element_text(size=16)) +                    
    labs(title = "Figure5 Employment Variation Rate vs Yes or No")+
    theme(plot.title = element_text(color = "aquamarine4"))

cons_price_idx_diag_plot <- ggplot(bank_data,aes(y=cons_price_idx,x=y)) +
    geom_boxplot()+
    labs(x="", y="Consumer Price Index") +
    theme(text=element_text(size=16)) +                    
    labs(title = "Figure6 Consumer Price Index vs Yes or No")+
    theme(plot.title = element_text(color = "lightskyblue"))


combined_diag_plot_2 <- grid.arrange(pdays_diag_plot,
                                     emp_var_rate_diag_plot,
                                     cons_price_idx_diag_plot,
                                     ncol = 3)

cons_conf_idx_diag_plot <- ggplot(bank_data,aes(y=cons_conf_idx,x=y)) +
    geom_boxplot()+
    labs(x="", y="Consumer confidence index") +
    theme(text=element_text(size=16)) +                    
    labs(title = "Figure7 Consumer confidence index vs Yes or No")+
    theme(plot.title = element_text(color = "mediumpurple2"))

euribor_3m_diag_plot <- ggplot(bank_data_mutated,aes(y=euribor_3m,x=y)) +
    geom_boxplot()+
    labs(x="", y="Euribor 3 month rate") +
    theme(text=element_text(size=16)) +                    
    labs(title = "Figure8 Euribor 3 month rate vs Yes or No")+
    theme(plot.title = element_text(color = "pink2"))

nr_employed_diag_plot <- ggplot(bank_data,aes(y=nr_employed,x=y)) +
    geom_boxplot()+
    labs(x="", y="Number of employees") +
    theme(text=element_text(size=16)) +                    
    labs(title = "Figure9 Number of employeesvs Yes or No")+
    theme(plot.title = element_text(color = "violetred"))+
    ylim(4500,6000)

combined_diag_plot_3 <- grid.arrange(cons_conf_idx_diag_plot,
                                     euribor_3m_diag_plot,
                                     nr_employed_diag_plot,
                                     ncol = 3)

As the y of the bank data varies in the box plots, the distribution of duration differs significantly, indicating that all explanatory variables are related to the responsible variable.

In [None]:
# Clean and wrangle the data set. Drop irrelevant columns and rows.
bank_data_sample <- bank_data %>%
                select(y, duration)

bank_data_summary <- bank_data_sample %>%
                    mutate(y = as.factor(y))%>%
                    summary()
bank_data_summary

In [None]:
# Compute number of row
number_rows <- bank_data_sample %>%
            nrow()


# Compute the mean and standard deviation of duration
sample_estimates<- bank_data_sample %>%
                group_by(y) %>%
                summarize(mean = mean(duration),sd = sd(duration))
sample_estimates

In [None]:
# Plot the distribution of duration of the two categories
options(repr.plot.width = 15, repr.plot.height = 8)
sample_mean_distribution <- bank_data_sample %>% 
                    ggplot(aes(x = duration, fill = y)) +
                    geom_histogram(bins = 30, alpha = 0.4)+
                    ggtitle("Distribution of Duration")+
                    theme(text = element_text(size = 20))+
                    geom_vline(data = sample_estimates, aes(xintercept = mean, color = y))

sample_mean_distribution 

## <img src="https://th.bing.com/th/id/R.3849382c629e1a734ac29d79a0656ea2?rik=A6LFA%2bGN%2b07q9Q&riu=http%3a%2f%2fwww.pngmart.com%2ffiles%2f1%2fPhone.png&ehk=DewsIHJL6yqq69NhIyBSqo6A%2fEnade%2boWXdZCaNB03g%3d&risl=&pid=ImgRaw&r=0" width=30> Methods

The good thing about this report is the fact that we were able to obtain a good enough dataset for our purpose. We plan to use bootstrapping to obtain our estimate of sampling distribution.


Hypothesis testing: 
We want to compare the mean of call duration of successful and failure outcomes. This will give insights on whether call duration should be considered for employee training. 
In order to achieve our goal, we want to test the null hypothesis H0 against alternative hypothesis H1. We first start by defining the two sample means:

m0: mean call duration of successful sessions

m1: mean call duration of failed sessions

Then we define the null and alternative hypothesis as follow:

H0:  m0 - m1 = 0

H1: m0 - m1 > 0

We plan to use a 90% confidence interval (for a significant level of 0.05) for our hypothesis testing. 

We are expecting to reject the null hypothesis, as call duration would likely have an effect on success rate. 

Therefore, we will carry out our analysis as follows:
1. Compute the observed mean difference between two groups in terms of call duration
2. Simulate bootstrap samples, derive sample mean within each bootstrapped sample to obtain a bootstrap sampling distribution
3. Use a right-tailed test to calculate p-value and test it against a significant level of 0.05. 
4. Finally, we visualize the result, and determine whether we reject the null hypothesis or not. 

### Test Statictics

In [None]:
# Compute difference in sample means between call duration of a successful outcome and a failure
sample_means_diff <- sample_estimates%>%
                    select(y, mean)%>%
                    pivot_wider(names_from = y, values_from = mean)%>%
                    transmute(diff = yes - no)%>%
                    pull(diff)
sample_means_diff

### Bootstrap sampling distribution

In [None]:
# Generate the bootstrap sampling distribution
set.seed(1000)

null_model_bank <- bank_data_sample %>%
                specify(formula = duration ~ y) %>% 
                hypothesize(null = "independence") %>% 
                generate(reps = 500, type = "permute") %>% 
                calculate(stat="diff in means", order = c("yes", "no"))

head(null_model_bank)

### P-value and confidence intervals

In [None]:
# Compute p value
p_value_duration = null_model_bank %>%
                    get_p_value(obs_stat = sample_means_diff, direction = "right")
p_value_duration

# Visualize the bootstrap sampling distribution of difference of means of duration with shaded p-value 
options(repr.plot.width = 15, repr.plot.height = 8)
bank_result_plot <- null_model_bank %>%
                    visualize() + 
                    shade_p_value(obs_stat = sample_means_diff, direction = "right")+
                    theme(text = element_text(size = 20))+
                    xlab("Difference in mean of duration")
bank_result_plot

The null hypothesis is rejected and our test is statistically significant ( the differences between two groups are significant) as the p value is 0.

In [None]:
# Compute the corresponding 90% confidence interval 
duration_ci_0.9 <- null_model_bank %>%
                    get_confidence_interval(level = 0.9)
duration_ci_0.9 

In [None]:
# Visualize bootstrap sampling distribution of difference of means of duration with shaded 90% confidence interval
options(repr.plot.width = 15, repr.plot.height = 8)
shade_ci_duration <- null_model_bank %>%
                    visualize(bins = 30)+
                    shade_confidence_interval(endpoints = duration_ci_0.9 )+
                    labs(x = "Difference in mean of duration")
shade_ci_duration

- We are 90% 90% confident that the interval [-6.798716, 7.366154] contains the population difference of means of duration of successful and failed sessions

## <img src="https://th.bing.com/th/id/R.3849382c629e1a734ac29d79a0656ea2?rik=A6LFA%2bGN%2b07q9Q&riu=http%3a%2f%2fwww.pngmart.com%2ffiles%2f1%2fPhone.png&ehk=DewsIHJL6yqq69NhIyBSqo6A%2fEnade%2boWXdZCaNB03g%3d&risl=&pid=ImgRaw&r=0" width=30> Reference

1.D, Dua, and Graff C. “Bank Marketing Data Set.” UCI Machine Learning Repository, 2019, https://archive.ics.uci.edu/ml/datasets/Bank+Marketing.   
2.Moro, Sérgio, et al. “A Data-Driven Approach to Predict the Success of Bank Telemarketing.” Decision Support Systems, North-Holland, 13 Mar. 2014, https://www.sciencedirect.com/science/article/pii/S016792361400061X. 