### Group 12 Project Proposal: 
**Bank Marketing Classification**

**Introduction**

*Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal:*

Phone-based direct marketing campaigns from a Portuguese banking institution asked clients questions about themselves (ex. Age, job, etc.). The result of the marketing campaign was 45211 responses from clients.

*Clearly state the question you will try to answer with your project:*

Based on the predictors, will a given person subscribe to a term deposit or not?

*Identify and describe the dataset that will be used to answer the question:*

We will be using  bank-additional.csv because it contains 4119 rows (compared to the full (41188 rows of the complete dataset). Using this dataset will cause the server to run smoother as well as avoid overplotting.  



**Preliminary Exploratory Data Analysis**

In [None]:
library(tidyverse)
library(tidymodels)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘tidymodels’ was built under R version 4.0.2”


In [None]:
data <- read_csv2("https://drive.google.com/u/0/uc?id=1r8z-f85larbfItrABtwmpBeNK_TSDNO1&export=download")
head(data, 10)

In [None]:
data <- mutate(data, subscribed = y)
data <- select(data, "age", 'duration', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed', 'subscribed')
#head(data, 10)

In [None]:
bank_split <- initial_split(data, prop = 0.50, strata = subscribed)
bank_train <- training(bank_split)
bank_test <- testing(bank_split)
head(bank_train, 15) #3090x11
# bank_test 1029x11

In [None]:
group_by(bank_train, subscribed) %>% summarize(count=n())

In [None]:
bank_recipe <- recipe(subscribed ~ duration + age , data = bank_train) %>%
    step_scale(all_predictors()) %>%
    step_center(all_predictors()) %>%
    prep()

scaled_bank <- bake(bank_recipe, bank_train)
ggplot(scaled_bank, aes(x=age, y=duration, color = subscribed)) +
    geom_point(alpha=0.6) + 
    labs(x="Age (Scaled)", y= "Duration of call in seconds (Scaled)", color = "Subscribed") + 
    ggtitle("Age vs Duration of call") + 
    theme(text = element_text(size=18), plot.title = element_text(hjust = 0.5))

**Methods**

*Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?*

We will be using bank_addition.csv and **keeping** the following columns. We believe that these variables will influence the prediction the most and will greatly affect the predictor’s accuracy: 

* `age`: the age of the client
* `job`: the job type held (ex. admin, blue-collar, etc.)
* `marital status`: marital status
* `education`: level of education
* `default`: does the client have credit in default?
* `housing`: does the client have a housing loan?
* `loan`: does the client have a personal loan?
* `month`: month the client was contacted during
* `campaigns`: number of times this client was contacted for this campaign
* `poutcome`: outcome of the previous marketing campaign
* `y`: did the client sunscribe to a term deposit (categorical labels)

The following columns will be **omitted** when tidying our data. We believe that these variables are unnecessary in that they will have little to no effect on the prediction:

* `contact`: was the client contacted through a cellphone or telephone? 
* `day_of_week`: day of the week the client was last contacted on
* `duration`: duration of the last phone call, in seconds 
* `pdays`: number of days since the client was last contacted from a previous call
* `previous`: number of times this client was contacted before this campaign
* `emp.var.rate`: employment variation rate
* `cons.price.idx`: consumer price index
* `cons.conf.idx`: consumer confidence index
* `euribor3m`: euribor 3 month rate
* `nr.employed`: number of employees

*Describe at least one way that you will visualize the results:*

We will be creating a scatterplot that shows two different coloured groups in a way that we can see where our prediction is relative to the two groups (yes or no). 


**Expected Outcomes and Significance:**

*What do you expect to find?*

We expect to find that age, job, education, and marital status to have the greatest influence on the accuracy of the model. 

*What impact could such findings have?*

Our findings could be useful for advertising purposes, where companies would cater to a particular demographic if they knew that they would be the most likely to subscribe to a term deposit. 

*What future questions could this lead to?*

Why do these variables influence whether or not someone subscribes to a term deposit in the first place? 
