## Predicting the most influential variables on whether or not a client will subscribe a term deposit

## Introduction:
The data set we are working on is the bank marketing data set. The bank we are analyzing is a Portuguese banking institution with data related with direct marketing campaigns (phone calls).

This dataset contains many variables as shown when we load the file below. However some of the more cryptic variables are defined here:

- `default` - Has credit in default?
- `balance` - Account balance (EUR)
- `housing` - Has housing loan?
- `contact` - Contact communication type
- `month` / `day_of_week` - Date of last contact
- `duration` - Last contact duration, in seconds
- `campaign` - Number of contacts performed during this campaign and for this client
- `pdays` - Number of days that passed by after the client was last contacted from a previous campaign
- `previous` - Number of contacts performed before this campaign and for this client 
- `poutcome` - Outcome of the previous marketing campaign
- `y` - Has the client subscribed a term deposit?

Using this dataset we will attempt to answer the question "Which variables have the greatest influence on predicting whether or not a client will subscribe a term deposit?" A term deposit is a cash investment held at a financial institution. Your money is invested for an agreed rate of interest over a fixed amount of time, or term.

## Method:

TODO: Quick summary of method

Load the libraries required to perform the data analysis. We also make sure to set the seed so that the results are repeatable and not affected by randomness.

In [3]:
set.seed(1337)

library(tidyverse)
library(repr)
library(rvest)
library(stringr)
library(tidymodels)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘rvest’ was built under R version 4.0.2”
Loading required package: xml2


Attaching package: ‘rvest’


The following object is masked from ‘pack

Download, extract, and parse our dataset from the web.

In [4]:
# set seed
set.seed(1337)

# download and extract dataset
dir.create("data/")
download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip", destfile = "data/bank.zip")
unzip("data/bank.zip", files = "bank-full.csv", exdir = "data/", overwrite = TRUE)

# load dataset
bank_full <- read_delim("data/bank-full.csv", delim = ';')
slice(bank_full, 1:10)

“'data' already exists”
Parsed with column specification:
cols(
  age = [32mcol_double()[39m,
  job = [31mcol_character()[39m,
  marital = [31mcol_character()[39m,
  education = [31mcol_character()[39m,
  default = [31mcol_character()[39m,
  balance = [32mcol_double()[39m,
  housing = [31mcol_character()[39m,
  loan = [31mcol_character()[39m,
  contact = [31mcol_character()[39m,
  day = [32mcol_double()[39m,
  month = [31mcol_character()[39m,
  duration = [32mcol_double()[39m,
  campaign = [32mcol_double()[39m,
  pdays = [32mcol_double()[39m,
  previous = [32mcol_double()[39m,
  poutcome = [31mcol_character()[39m,
  y = [31mcol_character()[39m
)



age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
<dbl>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
35,management,married,tertiary,no,231,yes,no,unknown,5,may,139,1,-1,0,unknown,no
28,management,single,tertiary,no,447,yes,yes,unknown,5,may,217,1,-1,0,unknown,no
42,entrepreneur,divorced,tertiary,yes,2,yes,no,unknown,5,may,380,1,-1,0,unknown,no
58,retired,married,primary,no,121,yes,no,unknown,5,may,50,1,-1,0,unknown,no
43,technician,single,secondary,no,593,yes,no,unknown,5,may,55,1,-1,0,unknown,no


We decide to filter out the columns we thought would not play any relationship to whether or not a client subscribed a term deposit.

In [5]:
bank_trimmed <- bank_full %>%
    select(-c(default, contact, duration))
slice(bank_trimmed, 1:10)

age,job,marital,education,balance,housing,loan,day,month,campaign,pdays,previous,poutcome,y
<dbl>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
58,management,married,tertiary,2143,yes,no,5,may,1,-1,0,unknown,no
44,technician,single,secondary,29,yes,no,5,may,1,-1,0,unknown,no
33,entrepreneur,married,secondary,2,yes,yes,5,may,1,-1,0,unknown,no
47,blue-collar,married,unknown,1506,yes,no,5,may,1,-1,0,unknown,no
33,unknown,single,unknown,1,no,no,5,may,1,-1,0,unknown,no
35,management,married,tertiary,231,yes,no,5,may,1,-1,0,unknown,no
28,management,single,tertiary,447,yes,yes,5,may,1,-1,0,unknown,no
42,entrepreneur,divorced,tertiary,2,yes,no,5,may,1,-1,0,unknown,no
58,retired,married,primary,121,yes,no,5,may,1,-1,0,unknown,no
43,technician,single,secondary,593,yes,no,5,may,1,-1,0,unknown,no


To summarize the data in the table we use the summary function. It tells potentially interesting statistics about the dataframe such as the mean age is 40.87 years.

In [6]:
summary(bank_trimmed)

      age            job              marital           education        
 Min.   :18.00   Length:45211       Length:45211       Length:45211      
 1st Qu.:33.00   Class :character   Class :character   Class :character  
 Median :39.00   Mode  :character   Mode  :character   Mode  :character  
 Mean   :40.94                                                           
 3rd Qu.:48.00                                                           
 Max.   :95.00                                                           
    balance         housing              loan                day       
 Min.   : -8019   Length:45211       Length:45211       Min.   : 1.00  
 1st Qu.:    72   Class :character   Class :character   1st Qu.: 8.00  
 Median :   448   Mode  :character   Mode  :character   Median :16.00  
 Mean   :  1362                                         Mean   :15.81  
 3rd Qu.:  1428                                         3rd Qu.:21.00  
 Max.   :102127                                   

Next we will convert all of our variables into numbers (enums for class varaibles) and turn unknown's in NA values so they don't affect our results.

In [10]:
bank_fixed = bank_trimmed

# set NAs
# bank_fixed = mutate(bank_fixed, across(where(is.character), ~na_if(., "unknown")))
# bank_fixed = mutate(bank_fixed, pdays = na_if(pdays, -1))

# parse into numbers
bank_fixed$job = as.numeric(as.factor(bank_fixed$job))
bank_fixed$marital = as.numeric(as.factor(bank_fixed$marital))
bank_fixed$education = as.numeric(as.factor(bank_fixed$education))
bank_fixed$housing = as.numeric(as.factor(bank_fixed$housing))
bank_fixed$loan = as.numeric(as.factor(bank_fixed$loan))
bank_fixed$month = as.numeric(as.factor(bank_fixed$month))
bank_fixed$poutcome = as.numeric(as.factor(bank_fixed$poutcome))
bank_fixed$y = as.numeric(as.factor(bank_fixed$y))

# set y as factor
bank_fixed = mutate(bank_fixed, y = as_factor(y))

# preview
slice(bank_fixed, 1:5)

age,job,marital,education,balance,housing,loan,day,month,campaign,pdays,previous,poutcome,y
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
58,5,2,3,2143,2,1,5,9,1,-1,0,4,1
44,10,3,2,29,2,1,5,9,1,-1,0,4,1
33,3,2,2,2,2,2,5,9,1,-1,0,4,1
47,2,2,4,1506,2,1,5,9,1,-1,0,4,1
33,12,3,4,1,1,1,5,9,1,-1,0,4,1


We split the dataset into the appropriate training and testing set to verify the authenticity of our results.

In [11]:
bank_split <- initial_split(bank_fixed, prop = 0.75, strata = y)
bank_train <- training(bank_split)
bank_test <- testing(bank_split)

To view our data we use the built in glimpse function

In [12]:
glimpse(bank_train)

Rows: 33,909
Columns: 14
$ age       [3m[90m<dbl>[39m[23m 58, 44, 33, 47, 28, 43, 41, 29, 58, 57, 51, 45, 57, 33, 28,…
$ job       [3m[90m<dbl>[39m[23m 5, 10, 3, 2, 5, 10, 1, 1, 10, 8, 6, 1, 2, 8, 2, 2, 6, 5, 3,…
$ marital   [3m[90m<dbl>[39m[23m 2, 3, 2, 2, 3, 3, 1, 3, 2, 2, 2, 3, 2, 2, 2, 3, 2, 3, 2, 3,…
$ education [3m[90m<dbl>[39m[23m 3, 2, 2, 4, 3, 2, 2, 2, 4, 2, 1, 4, 1, 2, 2, 1, 1, 3, 2, 2,…
$ balance   [3m[90m<dbl>[39m[23m 2143, 29, 2, 1506, 447, 593, 270, 390, 71, 162, 229, 13, 52…
$ housing   [3m[90m<dbl>[39m[23m 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ loan      [3m[90m<dbl>[39m[23m 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 2, 2,…
$ day       [3m[90m<dbl>[39m[23m 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
$ month     [3m[90m<dbl>[39m[23m 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,…
$ campaign  [3m[90m<dbl>[39m[23m 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [13]:
glimpse(bank_train)

Rows: 33,909
Columns: 14
$ age       [3m[90m<dbl>[39m[23m 58, 44, 33, 47, 28, 43, 41, 29, 58, 57, 51, 45, 57, 33, 28,…
$ job       [3m[90m<dbl>[39m[23m 5, 10, 3, 2, 5, 10, 1, 1, 10, 8, 6, 1, 2, 8, 2, 2, 6, 5, 3,…
$ marital   [3m[90m<dbl>[39m[23m 2, 3, 2, 2, 3, 3, 1, 3, 2, 2, 2, 3, 2, 2, 2, 3, 2, 3, 2, 3,…
$ education [3m[90m<dbl>[39m[23m 3, 2, 2, 4, 3, 2, 2, 2, 4, 2, 1, 4, 1, 2, 2, 1, 1, 3, 2, 2,…
$ balance   [3m[90m<dbl>[39m[23m 2143, 29, 2, 1506, 447, 593, 270, 390, 71, 162, 229, 13, 52…
$ housing   [3m[90m<dbl>[39m[23m 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ loan      [3m[90m<dbl>[39m[23m 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 2, 2,…
$ day       [3m[90m<dbl>[39m[23m 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
$ month     [3m[90m<dbl>[39m[23m 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,…
$ campaign  [3m[90m<dbl>[39m[23m 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

Following this we install the resources necessary to create a corelation table.

In [14]:
install.packages("corrplot")
source("http://www.sthda.com/upload/rquery_cormat.r")

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



Finally we can create a corelation table for our dataset. This will tell us which variables are most related which we can use to indentify what has the greatest impact on our outcome (`y`).

In [15]:
# print corelation table
rquery.cormat(bank_fixed)

corrplot 0.84 loaded



ERROR: Error in cor(x, use = "complete.obs", ...): 'x' must be numeric


We can see from the corelation table that most variables seem to have little influence on our `y` value. `housign` and `pdays` seem to have the greatest corelation with a negatiave corelation of around `-0.4`.

Because this result was unclear we also decided to test each variables influence on the outcome using knn classification.

**TODO**: Find where to put this and rephrase

K-NN is sensitive to the scale of the predictors, and so we should perform some preprocessing to standardize them. An additional consideration we need to take when doing this is that we should create the standardization preprocessor using only the training data. This ensures that our test data does not influence any aspect of our model training. Once we have created the standardization preprocessor, we can then apply it separately to both the training and test data sets.

#### Creating KNN models:

In the next cell code we create the knn specification needed for our case.

In [63]:
# create knn model
knn_spec = nearest_neighbor(weight_func = "rectangular") %>%
       set_engine("kknn") %>%
       set_mode("classification")

knn_spec

K-Nearest Neighbor Model Specification (classification)

Main Arguments:
  weight_func = rectangular

Computational engine: kknn 


**TODO**: describe what's going on here

In [64]:
bank_recipe_all <- recipe(y ~ ., data = bank_train) %>%
  step_scale(all_predictors()) %>%
  step_center(all_predictors())
bank_recipe_all

knn_fit_all <- workflow() %>%
    add_recipe(bank_recipe_all) %>%
    add_model(knn_spec) %>%
    fit(data = bank_train)
knn_fit_all

bank_test_predictions_all <- predict(knn_fit_all, bank_test) %>%
  bind_cols(bank_test)

all_pred = bank_test_predictions_all %>%
  metrics(truth = y, estimate = .pred_class)
all_pred

ERROR: Error in eval(lhs, parent, parent): object 'bank_test_predictions_marital' not found


## Create a new recipe and specification for every individual variable

##### This specification can be used for every variable:

In [16]:
knn_specification <- nearest_neighbor(weight_func = "rectangular", neighbors = 5) %>%
  set_engine("kknn") %>%
  set_mode("classification")

#### Accuracy of "marital" variable as a predictor:

In [70]:
bank_recipe_marital <- recipe(y ~ marital, data = bank_train) %>%
  step_scale(all_predictors()) %>%
  step_center(all_predictors())

knn_fit_marital <- workflow() %>%
    add_recipe(bank_recipe_marital) %>%
    add_model(knn_specification) %>%
    fit(data = bank_train)

In [71]:
bank_test_predictions_marital <- predict(knn_fit_marital, bank_test) %>%
  bind_cols(bank_test)

marital_pred = bank_test_predictions_marital %>%
  metrics(truth = y, estimate = .pred_class)
marital_pred

.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
accuracy,binary,0.8830296
kap,binary,0.0


##### From the above cell we see that when we use the "marital" variable as our predictor, we yeild a 88.3% accuracy

#### Accuracy of "education" variable as a predictor: not working due to na in data set

In [76]:
bank_recipe_education <- recipe(y ~ education, data = bank_train) %>%
  step_scale(all_predictors()) %>%
  step_center(all_predictors())

knn_fit_education <- workflow() %>%
    add_recipe(bank_recipe_education) %>%
    add_model(knn_specification) %>%
    fit(data = bank_train)

In [78]:
#bank_test_predictions_education <- predict(knn_fit_education, bank_test) %>%
#  bind_cols(bank_test)

#bank_test_predictions_education %>%
#  metrics(truth = y, estimate = .pred_class)

#### Accuracy of "default" variable as a predictor:

In [72]:
bank_recipe_default <- recipe(y ~ default, data = bank_train) %>%
  step_scale(all_predictors()) %>%
  step_center(all_predictors())

knn_fit_default <- workflow() %>%
    add_recipe(bank_recipe_default) %>%
    add_model(knn_specification) %>%
    fit(data = bank_train)

In [73]:
bank_test_predictions_default <- predict(knn_fit_default, bank_test) %>%
  bind_cols(bank_test)

default_pred = bank_test_predictions_default %>%
  metrics(truth = y, estimate = .pred_class)

default_pred

.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
accuracy,binary,0.8830296
kap,binary,0.0


#### Accuracy of "age" variable as a predictor:

In [74]:
bank_recipe_age <- recipe(y ~ age, data = bank_train) %>%
  step_scale(all_predictors()) %>%
  step_center(all_predictors())

knn_fit_age <- workflow() %>%
    add_recipe(bank_recipe_age) %>%
    add_model(knn_specification) %>%
    fit(data = bank_train)

In [75]:
bank_test_predictions_age <- predict(knn_fit_age, bank_test) %>%
  bind_cols(bank_test)

age_pred = bank_test_predictions_age %>%
  metrics(truth = y, estimate = .pred_class)

age_pred

.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
accuracy,binary,0.88241019
kap,binary,0.03550441


#### Accuracy of "balance" variable as a predictor:

In [79]:
bank_recipe_balance <- recipe(y ~ balance, data = bank_train) %>%
  step_scale(all_predictors()) %>%
  step_center(all_predictors())

knn_fit_balance <- workflow() %>%
    add_recipe(bank_recipe_balance) %>%
    add_model(knn_specification) %>%
    fit(data = bank_train)

In [80]:
bank_test_predictions_balance <- predict(knn_fit_balance, bank_test) %>%
  bind_cols(bank_test)

balance_pred = bank_test_predictions_balance %>%
  metrics(truth = y, estimate = .pred_class)

balance_pred

.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
accuracy,binary,0.87126172
kap,binary,0.04781843


#### Accuracy of "housing" variable as a predictor:

In [81]:
bank_recipe_housing <- recipe(y ~ housing, data = bank_train) %>%
  step_scale(all_predictors()) %>%
  step_center(all_predictors())

knn_fit_housing <- workflow() %>%
    add_recipe(bank_recipe_housing) %>%
    add_model(knn_specification) %>%
    fit(data = bank_train)

In [82]:
bank_test_predictions_housing <- predict(knn_fit_housing, bank_test) %>%
  bind_cols(bank_test)

housing_pred = bank_test_predictions_housing %>%
  metrics(truth = y, estimate = .pred_class)

housing_pred

.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
accuracy,binary,0.8830296
kap,binary,0.0


#### Accuracy of "loan" variable as a predictor:

In [83]:
bank_recipe_loan <- recipe(y ~ loan, data = bank_train) %>%
  step_scale(all_predictors()) %>%
  step_center(all_predictors())

knn_fit_loan <- workflow() %>%
    add_recipe(bank_recipe_loan) %>%
    add_model(knn_specification) %>%
    fit(data = bank_train)

In [84]:
bank_test_predictions_loan <- predict(knn_fit_loan, bank_test) %>%
  bind_cols(bank_test)

loan_pred = bank_test_predictions_loan %>%
  metrics(truth = y, estimate = .pred_class)

loan_pred

.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
accuracy,binary,0.8830296
kap,binary,0.0


#### Accuracy of "day" variable as a predictor:

In [85]:
bank_recipe_day <- recipe(y ~ day, data = bank_train) %>%
  step_scale(all_predictors()) %>%
  step_center(all_predictors())

knn_fit_day <- workflow() %>%
    add_recipe(bank_recipe_day) %>%
    add_model(knn_specification) %>%
    fit(data = bank_train)

In [86]:
bank_test_predictions_day <- predict(knn_fit_day, bank_test) %>%
  bind_cols(bank_test)

day_pred = bank_test_predictions_day %>%
  metrics(truth = y, estimate = .pred_class)

day_pred

.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
accuracy,binary,0.8830296
kap,binary,0.0
