Data Description:

The data comes from direct marketing efforts of a European banking institution. The marketing campaign involves making a phone call to a customer, often multiple times to ensure a product subscription, in this case a term deposit. Term deposits are usually short-term deposits with maturities ranging from one month to a few years. The customer must understand when buying a term deposit that they can withdraw their funds only after the term ends. All customer information that might reveal personal information is removed due to privacy concerns.

Attributes:

age : age of customer (numeric)

job : type of job (categorical)

marital : marital status (categorical)

education (categorical)

default: has credit in default? (binary)

balance: average yearly balance, in euros (numeric)

housing: has a housing loan? (binary)

loan: has personal loan? (binary)

contact: contact communication type (categorical)

day: last contact day of the month (numeric)

month: last contact month of year (categorical)

duration: last contact duration, in seconds (numeric)

campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

Output (desired target):

y - has the client subscribed to a term deposit? (binary)


In [1]:
#At first, let's look at data and try to understand it. 
import pandas as pd
data = pd.read_csv('term-deposit-marketing-2020.csv')
data.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,no


In [2]:
# we have already seen 
# by looking how much data we have, understand how much data will we try to deal with
number_of_samples = len(data)
number_of_samples

40000

In [3]:
#output variable is y and it has binary output 'yes' or 'no'.
#let's have a look at how many 'yes' and 'no' there is in the given data.
data.groupby('y').size()

y
no     37104
yes     2896
dtype: int64

In [4]:
#when we look at its distribution, 'no' answers is predominant. 
#now, we have information about the data and we can start preprocessing of the data.

#for numerical features
numeric = ['age', 'balance', 'day', 'duration', 'campaign']
#let's have a look at whether we have null data for numerical features.
data[numeric].isnull().sum()

age         0
balance     0
day         0
duration    0
campaign    0
dtype: int64

In [5]:
#the same process for categorical features:

categoric = ['job', 'marital', 'education', 'default', 'housing', 
           'loan', 'contact', 'month']
#let's have a look at whether we have null data for numerical features.
data[categoric].isnull().sum()

job          0
marital      0
education    0
default      0
housing      0
loan         0
contact      0
month        0
dtype: int64

In [6]:
#okay, it's good. we don't have to deal with null data.
#now, we can deal with categrical data by using one hot encoding.
new_categoric = pd.get_dummies(data[categoric])
new_categoric.head()

Unnamed: 0,job_admin,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,...,month_aug,month_dec,month_feb,month_jan,month_jul,month_jun,month_mar,month_may,month_nov,month_oct
0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [7]:
#now, prepare our new data set
data['output'] = (data.y == 'yes').astype('int')
data = pd.concat([data, new_categoric], axis = 1)
all_new_categoric = list(new_categoric.columns)
input_ = numeric + all_new_categoric
data_new = data[input_ + ['output']]


In [8]:
data_new.head()

Unnamed: 0,age,balance,day,duration,campaign,job_admin,job_blue-collar,job_entrepreneur,job_housemaid,job_management,...,month_dec,month_feb,month_jan,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,output
0,58,2143,5,261,1,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
1,44,29,5,151,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,33,2,5,76,1,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0
3,47,1506,5,92,1,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,33,1,5,198,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


In [9]:
features_data = data_new
output_data = data_new.pop('output')

#prepare training and test splits.
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(features_data, 
                                                    output_data, 
                                                    test_size=0.33, 
                                                    random_state=42)


In [11]:
from sklearn.preprocessing import StandardScaler
import numpy as np
scaler = StandardScaler()
scaler.fit(train_X)

StandardScaler()

In [13]:
train_X_tf = scaler.transform(train_X)

In [14]:
#similar process for test data:
scaler.fit(test_X)

StandardScaler()

In [15]:
test_X_tf = scaler.transform(test_X)

In [61]:
#for all data
scaler.fit(features_data)
features_data_tf = scaler.transform(features_data)

### Models for prediction and their accuracies

In [54]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

**logistic regression model for prediction** 

In [20]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(solver='liblinear')
log_reg.fit(train_X, train_y)

LogisticRegression(solver='liblinear')

In [21]:
predictions_log_reg = log_reg.predict(test_X)
accuracy_score(test_y, predictions_log_reg)

0.9345454545454546

**k-nearest neighbors model for prediction** 

In [16]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(train_X, train_y)

KNeighborsClassifier(n_neighbors=3)

In [18]:
predictions_knn = knn.predict(test_X)
accuracy_score(test_y, predictions_knn)

0.9215909090909091

**SVM model for prediction** 

In [25]:
from sklearn.svm import SVC
svm = SVC(gamma='auto')
svm.fit(train_X, train_y)

SVC(gamma='auto')

In [27]:
predictions_svm = svm.predict(test_X)
accuracy_score(test_y, predictions_svm)

0.9268939393939394

**K-means Clustering model for prediction** 

In [30]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 3, random_state = 0)
kmeans.fit(train_X, train_y)

KMeans(n_clusters=3, random_state=0)

In [31]:
predictions_kmeans = kmeans.predict(test_X)
accuracy_score(test_y, predictions_kmeans)

0.8418939393939394

**Decision Tree model for prediction** 

In [34]:
from sklearn import tree
dec_tree = tree.DecisionTreeClassifier()
dec_tree.fit(train_X, train_y)

DecisionTreeClassifier()

In [35]:
predictions_dec_tree = dec_tree.predict(test_X)
accuracy_score(test_y, predictions_dec_tree)

0.9181060606060606

**Random Forest model for prediction** 

In [36]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(max_depth = 2, random_state = 0)
rfc.fit(train_X, train_y)

RandomForestClassifier(max_depth=2, random_state=0)

In [37]:
predictions_rfc = rfc.predict(test_X)
accuracy_score(test_y, predictions_rfc)

0.926969696969697

**Naive Bayes model for prediction** 

In [38]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(train_X, train_y)

GaussianNB()

In [39]:
predictions_nb = nb.predict(test_X)
accuracy_score(test_y, predictions_nb)

0.8980303030303031

**Stochastic Gradient Descent model for prediction** 

In [46]:
from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier(loss = "hinge", penalty = "l2", max_iter = 10000)
sgd.fit(train_X, train_y)

SGDClassifier(max_iter=10000)

In [47]:
predictions_sgd = sgd.predict(test_X)
accuracy_score(test_y, predictions_sgd)

0.9243939393939394

**Multi-layer Perceptron model for prediction**

In [71]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
mlp.fit(train_X, train_y)

MLPClassifier(alpha=1e-05, hidden_layer_sizes=(5, 2), random_state=1)

In [72]:
predictions_mlp = mlp.predict(test_X)
accuracy_score(test_y, predictions_mlp)

0.9326515151515151

**When we use train and test splits and then calculate accuracy**,the best model to predict y (whether the client has subscribed to a term deposit) is logistic resgression and all models we used give us 81% or above accuracy. 


**Bonus**

In [87]:
#to understand feature importance, we can use coefficients in logistic regression
feature_importances = pd.DataFrame(log_reg.coef_[0], index = input_,
                                  columns = ['importance']).sort_values('importance',
                                                                       ascending = False)

In [88]:
feature_importances


Unnamed: 0,importance
duration,1.264669
month_jun,0.4412
contact_cellular,0.433282
day,0.281604
month_mar,0.222626
month_may,0.204215
month_oct,0.185222
month_apr,0.167052
housing_no,0.162364
month_feb,0.144046


Therefore, we can say that duration is the most important feature and we should focus more on it.