### Group: Zeru Zhou's Group
### Name: Zeru Zhou
### Email: zeruzhou9@gmail.com
### Country: United States
### University: University of Southern California
### Specialization: Data Science

### Github Repo:

### Problem Description 
### In this project, I need to build binery classification machine learning models to predict if the bank clients will renew the term deposit or not in order to make corresponding strategies to maintain clients.

### Dataset Information
### The dataset contains 3 parts: Client information, compaign information and social/economical context.
### There are total 41188 instances and 20columns. 
### The target response is whether the bank client will renew the term deposit.

## Library importing

In [1]:
import pandas as pd
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.linear_model import RidgeClassifierCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from imblearn.pipeline import Pipeline
#from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, precision_score, recall_score, RocCurveDisplay
from sklearn.model_selection import train_test_split
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import RFECV

## Data uploading

In [2]:
my_df = pd.read_csv('../data/bank-additional/bank-additional-full.csv', sep=';')

In [3]:
my_df.shape

(41188, 21)

In [4]:
my_df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [5]:
my_df['y'].value_counts()

no     36548
yes     4640
Name: y, dtype: int64

In [6]:
my_df.isnull().any()

age               False
job               False
marital           False
education         False
default           False
housing           False
loan              False
contact           False
month             False
day_of_week       False
duration          False
campaign          False
pdays             False
previous          False
poutcome          False
emp.var.rate      False
cons.price.idx    False
cons.conf.idx     False
euribor3m         False
nr.employed       False
y                 False
dtype: bool

In [7]:
sub = []
for i in my_df.isnull().any().keys():
    if my_df.isnull().any()[i] == False:
        sub.append(i)
sub

['age',
 'job',
 'marital',
 'education',
 'default',
 'housing',
 'loan',
 'contact',
 'month',
 'day_of_week',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'poutcome',
 'emp.var.rate',
 'cons.price.idx',
 'cons.conf.idx',
 'euribor3m',
 'nr.employed',
 'y']

### As we can see, this is an imbalanced dataset and it has NA values. We need to clean the dataset before moving to any analysis and model building.

## Data Cleaning and imputation

### Drop duplicated rows

In [8]:
my_df.duplicated(subset= sub).value_counts()

False    41176
True        12
dtype: int64

In [9]:
my_df = my_df.drop_duplicates()
my_df.shape

(41176, 21)

### Impute missing values

### First, find categorical variables

In [78]:
my_df.dtypes

age                 int64
job                object
marital            object
education          object
default            object
balance           float64
housing            object
loan               object
contact            object
day               float64
month              object
duration            int64
campaign            int64
pdays               int64
previous            int64
poutcome           object
y                  object
day_of_week        object
emp.var.rate      float64
cons.price.idx    float64
cons.conf.idx     float64
euribor3m         float64
nr.employed       float64
dtype: object

In [80]:
my_dict = {}
for n,i in enumerate(my_df.columns):
    my_dict[i] = my_df.dtypes[n]
type_df = pd.DataFrame(my_dict, index=['dtype'])
type_df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y,day_of_week,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
dtype,int64,object,object,object,object,float64,object,object,object,float64,object,int64,int64,int64,int64,object,object,object,float64,float64,float64,float64,float64


In [86]:
type_df = type_df.T
type_df

Unnamed: 0,dtype
age,int64
job,object
marital,object
education,object
default,object
balance,float64
housing,object
loan,object
contact,object
day,float64


In [90]:
col = type_df.loc[type_df['dtype'] == 'object']
categorical = list(col.index)
categorical.remove('y')
categorical

['job',
 'marital',
 'education',
 'default',
 'housing',
 'loan',
 'contact',
 'month',
 'poutcome',
 'day_of_week']

In [91]:
df = pd.get_dummies(my_df, columns=categorical)

In [93]:
df.head()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,y,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,job_unemployed,job_unknown,marital_divorced,marital_married,marital_single,marital_unknown,education_basic.4y,education_basic.6y,education_basic.9y,education_high.school,education_illiterate,education_primary,education_professional.course,education_secondary,education_tertiary,education_university.degree,education_unknown,default_no,default_unknown,default_yes,housing_no,housing_unknown,housing_yes,loan_no,loan_unknown,loan_yes,contact_cellular,contact_telephone,contact_unknown,month_apr,month_aug,month_dec,month_feb,month_jan,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_nonexistent,poutcome_other,poutcome_success,poutcome_unknown,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed
0,58,2143.0,5.0,261,1,-1,0,no,,,,,,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0
1,44,29.0,5.0,151,1,-1,0,no,,,,,,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0
2,33,2.0,5.0,76,1,-1,0,no,,,,,,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0
3,47,1506.0,5.0,92,1,-1,0,no,,,,,,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0
4,33,1.0,5.0,198,1,-1,0,no,,,,,,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0


In [94]:
x,y = df.drop(columns=['y']), df['y']

### First, we can try simple imputer like filling the value with mean

In [95]:
x

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,job_unemployed,job_unknown,marital_divorced,marital_married,marital_single,marital_unknown,education_basic.4y,education_basic.6y,education_basic.9y,education_high.school,education_illiterate,education_primary,education_professional.course,education_secondary,education_tertiary,education_university.degree,education_unknown,default_no,default_unknown,default_yes,housing_no,housing_unknown,housing_yes,loan_no,loan_unknown,loan_yes,contact_cellular,contact_telephone,contact_unknown,month_apr,month_aug,month_dec,month_feb,month_jan,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_nonexistent,poutcome_other,poutcome_success,poutcome_unknown,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed
0,58,2143.0,5.0,261,1,-1,0,,,,,,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0
1,44,29.0,5.0,151,1,-1,0,,,,,,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0
2,33,2.0,5.0,76,1,-1,0,,,,,,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0
3,47,1506.0,5.0,92,1,-1,0,,,,,,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0
4,33,1.0,5.0,198,1,-1,0,,,,,,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,,,334,1,999,0,-1.1,94.767,-50.8,1.028,4963.6,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0
41184,46,,,383,1,999,0,-1.1,94.767,-50.8,1.028,4963.6,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0
41185,56,,,189,2,999,0,-1.1,94.767,-50.8,1.028,4963.6,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0
41186,44,,,442,1,999,0,-1.1,94.767,-50.8,1.028,4963.6,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0


In [104]:
x = x.drop(columns=['duration'])

In [105]:
x_simple = SimpleImputer(strategy='mean').fit_transform(x)
pd.DataFrame(x_simple).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71
0,58.0,2143.0,5.0,1.0,-1.0,0.0,0.081922,93.57572,-40.502863,3.621293,5167.03487,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,44.0,29.0,5.0,1.0,-1.0,0.0,0.081922,93.57572,-40.502863,3.621293,5167.03487,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,33.0,2.0,5.0,1.0,-1.0,0.0,0.081922,93.57572,-40.502863,3.621293,5167.03487,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,47.0,1506.0,5.0,1.0,-1.0,0.0,0.081922,93.57572,-40.502863,3.621293,5167.03487,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,33.0,1.0,5.0,1.0,-1.0,0.0,0.081922,93.57572,-40.502863,3.621293,5167.03487,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


### Since this is high-dimensional dataset with almost 100k instances, using simple imputer may lead to huge error. So we try iterative imputer using round robin algorithm

In [134]:
x_iter = IterativeImputer(n_nearest_features=50).fit_transform(x)
x_iter_df = pd.DataFrame(x_iter)

In [108]:
x_iter_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71
0,58.0,2143.0,5.0,1.0,-1.0,0.0,-0.001759,93.460615,-37.686446,3.6906,5166.507093,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,44.0,29.0,5.0,1.0,-1.0,0.0,0.013415,93.471578,-38.045953,3.66559,5165.584233,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,33.0,2.0,5.0,1.0,-1.0,0.0,-0.012448,93.45125,-38.446562,3.637786,5166.385384,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,47.0,1506.0,5.0,1.0,-1.0,0.0,0.008979,93.502296,-38.150717,3.642176,5163.373385,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,33.0,1.0,5.0,1.0,-1.0,0.0,-0.024682,93.495518,-37.526338,3.675858,5162.429888,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


### After filling all the NAs, the next step is to eliminate some outliers

In [109]:
x_iter_df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71
count,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0,86387.0
mean,40.501314,1068.949277,14.197018,2.670437,479.792504,0.386181,0.039593,93.55541,-39.408163,3.613381,5164.835299,0.180467,0.219767,0.034068,0.026624,0.143332,0.046095,0.034727,0.094007,0.020987,0.165951,0.026821,0.007154,0.113651,0.603505,0.281917,0.000926,0.048341,0.02652,0.069976,0.110109,0.000208,0.079306,0.060657,0.268582,0.15397,0.140808,0.041522,0.891025,0.099506,0.009469,0.447938,0.01146,0.540602,0.832359,0.01146,0.156181,0.641532,0.207751,0.150717,0.064396,0.143806,0.004584,0.030664,0.016241,0.162802,0.123387,0.011842,0.318717,0.093417,0.016843,0.013301,0.105953,0.411532,0.0213,0.033385,0.427831,0.090592,0.098533,0.09976,0.093602,0.094158
std,10.534612,2267.8907,6.66517,2.947981,483.824356,1.713173,1.11072,0.471894,4.076148,1.228713,54.158662,0.384578,0.414091,0.181404,0.160984,0.350413,0.209692,0.18309,0.29184,0.143341,0.372039,0.161561,0.084278,0.317389,0.489172,0.449936,0.030417,0.214486,0.160677,0.255108,0.313028,0.014433,0.270217,0.238702,0.443225,0.360922,0.347826,0.199496,0.311609,0.299342,0.096848,0.497285,0.106437,0.498352,0.373549,0.106437,0.363029,0.479553,0.4057,0.357775,0.245459,0.350895,0.06755,0.172408,0.126401,0.369187,0.328882,0.108176,0.465982,0.291017,0.128683,0.114559,0.30778,0.492114,0.144382,0.17964,0.494767,0.28703,0.298036,0.299682,0.291276,0.29205
min,17.0,-8019.0,1.0,1.0,-1.0,0.0,-3.4,92.201,-50.8,-15.186267,1782.306391,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,32.0,163.0,10.445861,1.0,-1.0,0.0,-0.22937,93.329312,-42.0,3.18187,5159.11953,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,39.0,584.0,13.73086,2.0,246.0,0.0,0.001502,93.482906,-39.8,3.7858,5171.094469,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,48.0,1211.58629,17.166562,3.0,999.0,0.0,1.1,93.918,-36.4,4.857,5191.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
max,98.0,102127.0,36.74044,63.0,999.0,275.0,10.858251,133.895958,-25.313785,5.045,5228.1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [126]:
y

0         no
1         no
2         no
3         no
4         no
        ... 
41183    yes
41184     no
41185     no
41186    yes
41187     no
Name: y, Length: 86387, dtype: object

In [135]:
x_iter_df['y'] = y.values

In [17]:
my_list = []
for key, value in my_df.dtypes.items():
    if value == 'object':
        my_list.append(key)
my_list.remove('y')
my_list

['job',
 'marital',
 'education',
 'default',
 'housing',
 'loan',
 'contact',
 'month',
 'day_of_week',
 'poutcome']

In [18]:
my_df = pd.get_dummies(my_df, columns=my_list)
my_df

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,...,month_oct,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success
0,56,261,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0
1,57,149,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0
2,37,226,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0
3,40,151,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0
4,56,307,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41181,37,281,1,999,0,-1.1,94.767,-50.8,1.028,4963.6,...,0,0,1,0,0,0,0,0,1,0
41182,29,112,1,9,1,-1.1,94.767,-50.8,1.028,4963.6,...,0,0,1,0,0,0,0,0,0,1
41184,46,383,1,999,0,-1.1,94.767,-50.8,1.028,4963.6,...,0,0,1,0,0,0,0,0,1,0
41185,56,189,2,999,0,-1.1,94.767,-50.8,1.028,4963.6,...,0,0,1,0,0,0,0,0,1,0


In [19]:
for i in my_df.columns:
    if i == 'y':
        continue
    q_low = my_df[i].quantile(0.01)
    q_high = my_df[i].quantile(0.99)
    my_df = my_df.loc[(my_df[i] <= q_high) & (my_df[i] >= q_low)]

In [20]:
my_df.shape

(36103, 64)

In [30]:
my_df = my_df.drop(columns=['duration'])
my_df.shape

(36103, 63)

### Here above, For each feature, I only dropped very extreme values that not belong to the central 99%. It is also feasible to use q1, q3, and 1.5 IQR to detect outliers but I don't want that much data lose.

## Model Building

### For each model, we'll build pipeline for feature selection and hyperparameter tuning, using cross validation. Also SMOTE is included since this is imbalanced dataset.

### Split the dataset into train/test

In [31]:
x,y = my_df.drop(columns=['y']), my_df['y']

In [32]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=42)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((32492, 62), (3611, 62), (32492,), (3611,))

### Logistic Regression

### For Logistic Regression, we can perform feature selection by using l1 peanlization, it is a shrink method by reducing the coefficients of irrelavent variables to 0. We can also use other methods but this is the most convinent way. We do feature selection together with hyperparameter tuning.

In [33]:
log = LogisticRegression(penalty='l1', solver='saga', max_iter=300)

In [34]:
scale = MinMaxScaler()

In [35]:
imbalance = SMOTE()

In [36]:
param = {'model__C': np.logspace(-3, 3, num=50)}

In [37]:
x_train_scaled = scale.fit_transform(x_train)

In [38]:
pipe = Pipeline(steps=[('smote', imbalance), ('model', log)])

In [39]:
clf_log = GridSearchCV(pipe, param, n_jobs=-1).fit(x_train_scaled, y_train)



In [40]:
x_test_scaled = scale.fit_transform(x_test)

In [41]:
y_pred = clf_log.predict(x_test_scaled)

In [42]:
acc = accuracy_score(y_test, y_pred)
print(f'The accuracy score is {acc}')

The accuracy score is 0.7864857380227084


In [43]:
pre = precision_score(y_test, y_pred,  pos_label='no')
print(f'The precision score is {pre}')

The precision score is 0.9589578872234118


In [44]:
rec = recall_score(y_test, y_pred,  pos_label='no')
print(f'The recall score is {rec}')

The recall score is 0.8037690696978762


In [45]:
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[2687  656]
 [ 115  153]]


### Try RFE instead of l1-peanlty

In [46]:
log = LogisticRegression(max_iter=5000, penalty='none')

In [47]:
feature, target = imbalance.fit_resample(x_train, y_train)

In [48]:
feature_scaled = scale.fit_transform(feature)

In [49]:
clf_log_rfe = RFECV(log, cv=5, n_jobs=-1).fit(feature_scaled, target)

In [50]:
y_pred = clf_log_rfe.predict(x_test_scaled)

In [51]:
acc = accuracy_score(y_test, y_pred)
print(f'The accuracy score is {acc}')

The accuracy score is 0.9277208529493215


In [52]:
pre = precision_score(y_test, y_pred,  pos_label='no')
print(f'The precision score is {pre}')

The precision score is 0.9331084879145587


In [53]:
rec = recall_score(y_test, y_pred,  pos_label='no')
print(f'The recall score is {rec}')

The recall score is 0.9931199521387974


In [54]:
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[3320   23]
 [ 238   30]]


### Without class imbalance treatment

In [55]:
log = LogisticRegression(max_iter=5000)

In [56]:
param = {'C': np.logspace(-3, 3, num=50)}

In [57]:
clf_log_1 = GridSearchCV(log, param, n_jobs=-1).fit(x_train_scaled, y_train)



In [58]:
y_pred = clf_log_1.predict(x_test_scaled)

In [59]:
acc = accuracy_score(y_test, y_pred)
print(f'The accuracy score is {acc}')

The accuracy score is 0.9304901689282747


In [60]:
pre = precision_score(y_test, y_pred,  pos_label='no')
print(f'The precision score is {pre}')

The precision score is 0.9342696629213483


In [61]:
rec = recall_score(y_test, y_pred,  pos_label='no')
print(f'The recall score is {rec}')

The recall score is 0.9949147472330242


In [62]:
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[3326   17]
 [ 234   34]]


### K-Neighbor 

In [63]:
knn = KNeighborsClassifier(weights='uniform')

In [64]:
params = {'n_neighbors':[int(x) for x in np.linspace(1,20,num=20)]}

In [65]:
clf_knn = GridSearchCV(knn, params, n_jobs=-1).fit(feature_scaled, target)

In [66]:
y_pred = clf_knn.predict(x_test_scaled)

In [67]:
acc = accuracy_score(y_test, y_pred)
print(f'The accuracy score is {acc}')

The accuracy score is 0.910828025477707


In [68]:
pre = precision_score(y_test, y_pred,  pos_label='no')
print(f'The precision score is {pre}')

The precision score is 0.934176487496407


In [69]:
rec = recall_score(y_test, y_pred,  pos_label='no')
print(f'The recall score is {rec}')

The recall score is 0.9721806760394855


In [70]:
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[3250   93]
 [ 229   39]]


### Ensemble Tree

In [71]:
rdf = RandomForestClassifier()

In [72]:
params = {'ccp_alpha': np.logspace(-3, 3, num=20)}

In [73]:
clf_rdf = GridSearchCV(rdf, params, n_jobs=-1).fit(feature_scaled, target)

In [74]:
y_pred = clf_rdf.predict(x_test_scaled)

In [75]:
acc = accuracy_score(y_test, y_pred)
print(f'The accuracy score is {acc}')

The accuracy score is 0.8759346441428967


In [76]:
pre = precision_score(y_test, y_pred,  pos_label='no')
print(f'The precision score is {pre}')

The precision score is 0.9570571518787496


In [77]:
rec = recall_score(y_test, y_pred,  pos_label='no')
print(f'The recall score is {rec}')

The recall score is 0.9066706551002094


In [78]:
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[3031  312]
 [ 136  132]]


### Boosting Tree

In [79]:
xgb = XGBClassifier(eta = 0.01, objective = 'binary:logistic')

In [80]:
params = {'reg_alpha': np.logspace(-3, 3, 10)}

In [81]:
def change(num):
    if num == 'no':
        return 0
    else:
        return 1

In [82]:
target_num = [change(i) for i in target]

In [83]:
clf_xgb = GridSearchCV(xgb, params, n_jobs=-1).fit(feature_scaled, target_num)

In [84]:
y_pred = clf_xgb.predict(x_test_scaled)

In [85]:
y_test_num = [change(i) for i in y_test]

In [86]:
acc = accuracy_score(y_test_num, y_pred)
print(f'The accuracy score is {acc}')

The accuracy score is 0.8726114649681529


In [87]:
pre = precision_score(y_test_num, y_pred)
print(f'The precision score is {pre}')

The precision score is 0.2818181818181818


In [88]:
rec = recall_score(y_test_num, y_pred)
print(f'The recall score is {rec}')

The recall score is 0.4626865671641791


In [89]:
cm = confusion_matrix(y_test_num, y_pred)
print(cm)

[[3027  316]
 [ 144  124]]


### SVM

In [92]:
svc = SVC()

In [98]:
params = {'C': np.logspace(-3, 2, 10)}

In [99]:
clf_svm = GridSearchCV(svc, params, n_jobs=-1).fit(feature_scaled, target)



In [100]:
y_pred = clf_svm.predict(x_test_scaled)

In [101]:
acc = accuracy_score(y_test, y_pred)
print(f'The accuracy score is {acc}')

The accuracy score is 0.9293824425366934


In [102]:
pre = precision_score(y_test, y_pred,  pos_label='no')
print(f'The precision score is {pre}')

The precision score is 0.9356659142212189


In [103]:
rec = recall_score(y_test, y_pred,  pos_label='no')
print(f'The recall score is {rec}')

The recall score is 0.9919234220759796


In [104]:
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[3316   27]
 [ 228   40]]


### As results above, I've trialed several ML models with feature engineering & hyperparameter tuning. Logistic regression had the best performance in contrast to SVM, Bayesian model, and ensemble tree models.