## Dataset Title: Bank Marketing

**Relevant Information:**

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (or not) subscribed.

The classification goal is to predict if the client will subscribe a term deposit (variable y).
Number of Instances: 45211 for bank-full.csv (4521 for bank.csv)
Number of Attributes: 16 + output attribute.

**Input variables:**

**bank client data:**

1 - age (numeric) <br>2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student", "blue-collar","self-employed","retired","technician","services") <br>3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed) <br>4 - education (categorical: "unknown","secondary","primary","tertiary")<br> 5 - default: has credit in default? (binary: "yes","no")<br> 6 - balance: average yearly balance, in euros (numeric)<br> 7 - housing: has housing loan? (binary: "yes","no")<br> 8 - loan: has personal loan? (binary: "yes","no")

**related with the last contact of the current campaign:**

9 - contact: contact communication type (categorical: "unknown","telephone","cellular") <br>10 - day: last contact day of the month (numeric)<br> 11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")<br> 12 - duration: last contact duration, in seconds (numeric)

**other attributes:**

13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)<br> 14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)<br> 15 - previous: number of contacts performed before this campaign and for this client (numeric) <br>16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

**Output variable (desired target):** <br>17 - y - has the client subscribed a term deposit? (binary: "yes","no")
<br>Missing Attribute Values: None

In [37]:
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [73]:
# Wine quality data
df = pd.read_csv('bank/bank-full.csv',sep=';')
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [77]:
print(df.shape)

(45211, 17)
(45211, 17)


In [83]:
X=df.drop('y',1)
X=pd.get_dummies(X)
Y=df['y']
X.head()

(45211, 51)

## Decision Tree

In [91]:
from sklearn import tree 
from sklearn.model_selection import cross_val_score

decision_tree1 = tree.DecisionTreeClassifier(
    criterion='entropy',
    max_features=1,
    max_depth=3,
    random_state=1
)
decision_tree1.fit(X,Y)

score=cross_val_score(decision_tree1,X,Y,cv=10)
print(score)
print('score mean :',score.mean(), 'score std:',score.std())

[0.88301636 0.88301636 0.88299049 0.88299049 0.88299049 0.88299049
 0.88299049 0.88299049 0.88299049 0.87610619]
score mean : 0.8823072345380625 score std: 0.0020670384762338794


I got the best cross validation score with the combination of max_feature=1 and max_depth=3. 

## Random Forest 

In [97]:
from sklearn import ensemble

rfc = ensemble.RandomForestClassifier(
        max_depth=1,
        max_features=3
)

score_rfc = cross_val_score(rfc,X,Y,cv=10)
print(score_rfc)
print('score mean :',score_rfc.mean(), 'score std:',score_rfc.std())

[0.88301636 0.88301636 0.88299049 0.88299049 0.88299049 0.88299049
 0.88299049 0.88299049 0.88299049 0.88318584]
score mean : 0.8830151991398324 score std: 5.778880215313187e-05


I got almost same score but much smaller standard deviation when using Random Forest model. 

## Measuring run time

In [99]:
import time

In [100]:
start = time.time()

decision_tree1 = tree.DecisionTreeClassifier(
    criterion='entropy',
    max_features=1,
    max_depth=3,
    random_state=1)
decision_tree1.fit(X,Y)

cross_val_score(decision_tree1,X,Y,cv=10)

end = time.time()
print('run time of decision tree is ', end-start)

run time of decision tree is  0.8770930767059326


In [101]:
start=time.time()

rfc=ensemble.RandomForestClassifier(
    max_features=1,
    max_depth=3)

cross_val_score(rfc,X,Y,cv=10)

end=time.time()
print('run time of random forest is ', end-start)

run time of random forest is  1.4551470279693604


- Random Forest Model performs better than Decision Tree considering much smaller standard deviation of cross validation scores.
- However, runtime of Random Forest model is about two times than that of Decision Tree. 
- Random Forest has better performance but more computational load