# Bank Direct Marketing data

Dataset acquired from this website: https://www.kaggle.com/ruthgn/bank-marketing-data-set

## Column Summaries

bank client data:
* 1 - age (numeric)
* 2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
* 3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
* 4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
* 5 - default: has credit in default? (categorical: 'no','yes','unknown')
* 6 - housing: has housing loan? (categorical: 'no','yes','unknown')
* 7 - loan: has personal loan? (categorical: 'no','yes','unknown')

related with the last contact of the current campaign:
* 8 - contact: contact communication type (categorical: 'cellular','telephone')
* 9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', …, 'nov', 'dec')
* 10 - dayofweek: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')

other attributes:
* 11 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
* 12 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
* 13 - previous: number of contacts performed before this campaign and for this client (numeric)
* 14 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

social and economic context attributes:
* 15 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
* 16 - cons.price.idx: consumer price index - monthly indicator (numeric)
* 17 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
* 18 - euribor3m: euribor 3 month rate - daily indicator (numeric)
* 19 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):

* 20 - y - has the client subscribed a term deposit? (binary: 'yes','no')

## Load DataFrame

In [3]:
import pandas as pd

In [5]:
df = pd.read_csv("bank-direct-marketing-campaigns.csv")
df.head(20)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
5,45,services,married,basic.9y,unknown,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
6,59,admin.,married,professional.course,no,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
7,41,blue-collar,married,unknown,unknown,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
8,24,technician,single,professional.course,no,yes,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
9,25,services,single,high.school,no,yes,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [10]:
for c in df.columns:
    print(c, len(df[c].unique()), df[c].dtype)

age 78 int64
job 12 object
marital 4 object
education 8 object
default 3 object
housing 3 object
loan 3 object
contact 2 object
month 10 object
day_of_week 5 object
campaign 42 int64
pdays 27 int64
previous 8 int64
poutcome 3 object
emp.var.rate 10 float64
cons.price.idx 26 float64
cons.conf.idx 26 float64
euribor3m 316 float64
nr.employed 11 float64
y 2 object


## Preliminary Investigation

### Check if day of the week has a correlation with success?

In [14]:
df["y"].value_counts()

no     36548
yes     4640
Name: y, dtype: int64

In [25]:
temp_dict = {}
for d, g in df.groupby("day_of_week"):
    temp_dict[d] = {}
    temp_dict[d]["calls"] = len(g)
    temp_dict[d]["success"] = (g["y"] == "yes").sum()
    temp_dict[d]["success_rate"] = temp_dict[d]["success"] / temp_dict[d]["calls"]
    
for day in ["mon", "tue", "wed", "thu", "fri"]:
    print(day, round(100*temp_dict[day]["success_rate"], 1))

mon 9.9
tue 11.8
wed 11.7
thu 12.1
fri 10.8


## Preparing dataset for Neural Network

### One Hot Encoding

#### Testing on a smaller dataframe

In [36]:
df_mini = df[["age", "job", "marital", "education"]]
df_mini.head()

Unnamed: 0,age,job,marital,education
0,56,housemaid,married,basic.4y
1,57,services,married,high.school
2,37,services,married,high.school
3,40,admin.,married,basic.6y
4,56,services,married,high.school


In [37]:
def one_hot_encode_column(df_input, column_name):
    one_hot = pd.get_dummies(df_input[column_name], prefix=column_name)
    df_output = df_input.drop(column_name, axis=1).join(one_hot)
    return df_output

In [39]:
for c in ["job", "marital", "education"]:
    df_mini = one_hot_encode_column(df_mini, c)

In [42]:
df_mini.head()

Unnamed: 0,age,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,...,marital_single,marital_unknown,education_basic.4y,education_basic.6y,education_basic.9y,education_high.school,education_illiterate,education_professional.course,education_university.degree,education_unknown
0,56,0,0,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,57,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
2,37,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
3,40,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,56,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0


#### Can we attempt to estimate someone's age depending on the other factors in this smaller dataset?