# ML-Zoomcamp Capstone Project - Bank Marketing

The source data and its description, can be found [here](https://archive.ics.uci.edu/ml/datasets/Pedal+Me+Bicycle+Deliveries).  

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

### 1. Importing libraries and loading the data

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('./data/bank-full.csv', sep=';')
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


### 2. Data preprocessing

#### 2.1 Duplicates

Taking a look on duplicates:

In [3]:
df[df.duplicated(keep='last')]

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y


Getting rid of duplicate line:

In [4]:
before = df.shape[0]
df.drop_duplicates(inplace=True)
after = df.shape[0]

before, after

(45211, 45211)

#### 2.2 Dealing with column names and value formats

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 7.2+ MB


In [8]:
df.contact.value_counts()

cellular     29285
unknown      13020
telephone     2906
Name: contact, dtype: int64

From the data description, we see a couple of things:

- The features `day` and `month` are not necesary, we already have a variable called `pdays` which stand for the number of days thath passed by after the client was last contacted.

- The variable `contact` is irrelvant (telephone or cellular). We know we're analyzing a telephone marketing campaign. Additionally, `unknown contact` are like 6X the `telephone contact`.

- Column `y` stands for the output variable, which is if the client made a deposit or not, we're going to chainge that. 

Dropping day and month columns:

In [9]:
df.drop(['day', 'month', 'contact'], axis=1, inplace=True)
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,198,1,-1,0,unknown,no


In [10]:
df.rename(columns={'y': 'success'}, inplace=True)
df.head()


Unnamed: 0,age,job,marital,education,default,balance,housing,loan,duration,campaign,pdays,previous,poutcome,success
0,58,management,married,tertiary,no,2143,yes,no,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,198,1,-1,0,unknown,no


We nees to change the yes/no success values into 1/0 values:

In [13]:
df.success = (df.success == 'yes').astype('int')
df.success

0        0
1        0
2        0
3        0
4        0
        ..
45206    1
45207    1
45208    1
45209    0
45210    0
Name: success, Length: 45211, dtype: int64

### 3. Exploratory data analysis and feature importance

Splitting the data (train/val/test = 60%/20%/20%)

In [14]:
from sklearn.model_selection import train_test_split

In [15]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=7)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=7)

len(df) == len(df_train) + len(df_val) + len(df_test)

True

In [18]:
len(df_train), len(df_val), len(df_test)

(27126, 9042, 9043)

In [16]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.success.values
y_val = df_val.success.values
y_test = df_test.success.values

In [17]:
del df_train['success']
del df_val['success']
del df_test['success']

Looking at the target value:

In [19]:
df_full_train.success.value_counts(normalize=True)

0    0.881857
1    0.118143
Name: success, dtype: float64

According to this numbers, about 12% of the clients contacted made a term-deposit into their accounts as a result of the telemarketing campaign.

We can call this the __success rate__.

In [20]:
global_success_rate = df_full_train.success.mean()
round(global_success_rate, 3)

0.118

Looking at numerical and categorical values:

In [34]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27126 entries, 0 to 27125
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        27126 non-null  int64 
 1   job        27126 non-null  object
 2   marital    27126 non-null  object
 3   education  27126 non-null  object
 4   default    27126 non-null  object
 5   balance    27126 non-null  int64 
 6   housing    27126 non-null  object
 7   loan       27126 non-null  object
 8   duration   27126 non-null  int64 
 9   campaign   27126 non-null  int64 
 10  pdays      27126 non-null  int64 
 11  previous   27126 non-null  int64 
 12  poutcome   27126 non-null  object
dtypes: int64(6), object(7)
memory usage: 2.7+ MB


In [35]:
numerical = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']
categorical = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'poutcome']

A closer look at categorical values:

In [36]:
df_full_train[categorical].nunique()

job          12
marital       3
education     4
default       2
housing       2
loan          2
poutcome      4
dtype: int64

Looking at feature importance for categorical values:

In [25]:
from sklearn.metrics import mutual_info_score

In [26]:
def mutual_info_success_score(series):
    return mutual_info_score(series, df_full_train.success)

In [37]:
mutual_info = df_full_train[categorical].apply(mutual_info_success_score)
mutual_info.sort_values(ascending=False)

poutcome     0.029603
housing      0.010089
job          0.008228
education    0.002813
loan         0.002582
marital      0.001976
default      0.000332
dtype: float64

As we see, `[poutcome, housing, job]` have the higher impact when predicting the campaign success, and `[default]` has the significantly lower impact on prediction. 

Looking at feature importance for numerical values:

In [38]:
df_full_train[numerical].corrwith(df_full_train.success).abs().sort_values(ascending=False)

duration    0.390809
pdays       0.101776
previous    0.091025
campaign    0.074874
balance     0.051821
age         0.025072
dtype: float64

There is a significat correlation on the `duration` variable and success. At the same time, it looks like `age` has the least impact on success rate.

### 4. Training models

We are going to be using three different classification models:

- Logistic Regression
- Random Forest Classifier
- XGBoost

In [40]:
from sklearn.feature_extraction import DictVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

from sklearn.metrics import roc_auc_score

Getting the train and val matrices:

In [39]:
train_dict = df_train[numerical + categorical].to_dict(orient='records')
val_dict = df_val[numerical + categorical].to_dict(orient='records')

dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dict)
X_val = dv.transform(val_dict)

Training the Logistic Regression model: using `penalty` parameter for tuning the model.

In [45]:
penalties = ['l1', 'l2']
scores = []

for penalty in penalties:
    log_reg = LogisticRegression(solver='saga', penalty=penalty, max_iter=10000, random_state=7)
    log_reg.fit(X_train, y_train)

    y_pred = log_reg.predict_proba(X_val)[:, 1]
    score = roc_auc_score(y_val, y_pred)
    scores.append((penalty, score))

df_log_reg = pd.DataFrame(scores, columns=['penalty', 'auc'])

In [46]:
df_log_reg

Unnamed: 0,penalty,auc
0,l1,0.676419
1,l2,0.676421


As we see, there is no a significant difference in using one penalty or the other.

Training a Random Forest model, using max_depth and 