# Train Three Different Models and Use Evaluation Metrics to Pick the Best Performing Model

You work as a data scientist at a bank. The bank would like to implement a model that predicts the likelihood of a customer purchasing a term deposit. The bank provides you with a dataset, which is the same as the one in Chapter 3, Binary Classification. You have previously learned how to train a logistic regression model for binary classification. You have also heard about other non-parametric modeling techniques and would like to try out a decision tree as well as a random forest to see how well they perform against the logistic regression models you have been training.

In this activity, you will train a logistic regression model and compute a classification report. You will then proceed to train a decision tree classifier and compute a classification report. You will compare the models using the classification reports. Finally, you will train a random forest classifier and generate the classification report. You will then compare the logistic regression model with the random forest using the classification reports to determine which model you should put into production.

In [72]:
import pandas as pd 
import numpy as np 

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [73]:
df = pd.read_csv('https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter06/Dataset/bank-additional-full.csv', sep=';')
df

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,...,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes


In [74]:
df.head(10).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
age,56,57,37,40,56,45,59,41,24,25
job,housemaid,services,services,admin.,services,services,admin.,blue-collar,technician,services
marital,married,married,married,married,married,married,married,married,single,single
education,basic.4y,high.school,high.school,basic.6y,high.school,basic.9y,professional.course,unknown,professional.course,high.school
default,no,unknown,no,no,no,unknown,no,unknown,no,no
housing,no,no,yes,no,no,no,no,no,yes,yes
loan,no,no,no,no,yes,no,no,no,no,no
contact,telephone,telephone,telephone,telephone,telephone,telephone,telephone,telephone,telephone,telephone
month,may,may,may,may,may,may,may,may,may,may
day_of_week,mon,mon,mon,mon,mon,mon,mon,mon,mon,mon


In [75]:
df.campaign.value_counts()

1     17642
2     10570
3      5341
4      2651
5      1599
6       979
7       629
8       400
9       283
10      225
11      177
12      125
13       92
14       69
17       58
15       51
16       51
18       33
20       30
19       26
21       24
22       17
23       16
24       15
27       11
29       10
25        8
26        8
28        8
30        7
31        7
35        5
33        4
32        4
34        3
40        2
42        2
43        2
37        1
39        1
41        1
56        1
Name: campaign, dtype: int64

In [76]:
df.pdays.value_counts()

999    39673
3        439
6        412
4        118
9         64
2         61
7         60
12        58
10        52
5         46
13        36
11        28
1         26
15        24
14        20
8         18
0         15
16        11
17         8
18         7
19         3
22         3
21         2
26         1
20         1
25         1
27         1
Name: pdays, dtype: int64

In [77]:
df.previous.value_counts()

0    35563
1     4561
2      754
3      216
4       70
5       18
6        5
7        1
Name: previous, dtype: int64

In [78]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

In [79]:
cat_df = df.drop(['age', 'duration', 'campaign', 'previous', 'pdays', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed', 'y'], axis=1)

_df = pd.get_dummies(df, columns=cat_df.columns)
_df.columns

Index(['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate',
       'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed', 'y',
       'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'marital_unknown', 'education_basic.4y', 'education_basic.6y',
       'education_basic.9y', 'education_high.school', 'education_illiterate',
       'education_professional.course', 'education_university.degree',
       'education_unknown', 'default_no', 'default_unknown', 'default_yes',
       'housing_no', 'housing_unknown', 'housing_yes', 'loan_no',
       'loan_unknown', 'loan_yes', 'contact_cellular', 'contact_telephone',
       'month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'mon

In [80]:
X = _df.drop('y', axis=1)
y = _df['y']

In [81]:
X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size=0.3, random_state=0)

In [82]:
X_val, X_test, y_val, y_test = train_test_split(X_eval, y_eval, test_size=0.5, random_state=0)

In [83]:
from sklearn.linear_model import LogisticRegression
model1 = LogisticRegression()
model1.fit(X_train, y_train)
y_pred1 = model1.predict(X_val)

report1 = classification_report(y_val, y_pred1)
print(report1)

precision    recall  f1-score   support

          no       0.93      0.98      0.95      5465
         yes       0.69      0.41      0.51       713

    accuracy                           0.91      6178
   macro avg       0.81      0.69      0.73      6178
weighted avg       0.90      0.91      0.90      6178



In [84]:
from sklearn.tree import DecisionTreeClassifier
model2 = DecisionTreeClassifier(max_depth=6)
model2.fit(X_train, y_train)
y_pred2 = model2.predict(X_val)

report2 = classification_report(y_val, y_pred2)
print(report2)

precision    recall  f1-score   support

          no       0.94      0.97      0.95      5465
         yes       0.68      0.54      0.60       713

    accuracy                           0.92      6178
   macro avg       0.81      0.75      0.78      6178
weighted avg       0.91      0.92      0.91      6178



In [85]:
from sklearn.ensemble import RandomForestClassifier
model3 = RandomForestClassifier(n_estimators=1000)
model3.fit(X_train, y_train)
y_pred3 = model3.predict(X_val)

report3 = classification_report(y_val, y_pred3)
print(report3)

precision    recall  f1-score   support

          no       0.93      0.97      0.95      5465
         yes       0.68      0.45      0.54       713

    accuracy                           0.91      6178
   macro avg       0.80      0.71      0.75      6178
weighted avg       0.90      0.91      0.90      6178



In [87]:
print(model1.score(X_val, y_val))
print(model2.score(X_val, y_val))
print(model3.score(X_val, y_val))

0.9108125606992554
0.9180964713499514
0.9117837487860149
