<a href="https://colab.research.google.com/github/ankurrokad/Artificial-Neural-Network/blob/Ankur/ANN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Problem





**Our goal is to predict , when customer service representative call the customer, will they subscribe for a bank term deposit or not?**

**Dataset : https://archive.ics.uci.edu/ml/datasets/Bank+Marketing**

### Team

* [Ankur Rokad](https://github.com/ankurrokad)
* [Sahista Patel](https://github.com/Sahista-Patel)
* [Murali Krishna](https://github.com/muralikrishnarar)
* [Gursanjam Kaur](https://github.com/sv2021)




In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plot
import seaborn as sos
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from time import time

In [None]:
#ANN Model
from keras.models import Sequential
from keras.layers import Dense, Dropout

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/ankurrokad/Artificial-Neural-Network/main/bank-additional-full.csv', sep=';')
df.columns = ['age', 'job', 'marital', 'education', 'credit', 'housing', 'loan','contact', 'month', 'day_of_week',
              'duration', 'campaign', 'pdays','previous', 'poutcome', 'emp.var.rate', 'cons.price.idx','cons.conf.idx',
              'euribor3m', 'nr.employed', 'subscribed']
df.head()


Unnamed: 0,age,job,marital,education,credit,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,subscribed
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


# Dataset Discription

### User Details:

1 - age (numeric)

2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')

3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)

4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')

5 - default: has credit in default? (categorical: 'no','yes','unknown')

6 - housing: has housing loan? (categorical: 'no','yes','unknown')

7 - loan: has personal loan? (categorical: 'no','yes','unknown')

### Related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')

9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')

10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')

11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

# Other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

14 - previous: number of contacts performed before this campaign and for this client (numeric)

15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

# Social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)

17 - cons.price.idx: consumer price index - monthly indicator (numeric)

18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)

19 - euribor3m: euribor 3 month rate - daily indicator (numeric)

20 - nr.employed: number of employees - quarterly indicator (numeric)

# Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')

In [None]:
df.isnull().sum()

age               0
job               0
marital           0
education         0
credit            0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
subscribed        0
dtype: int64

We dont have any null values, so no need to do any imputation or anything

# Pre processing

## Label Encoding

{'job': {'housemaid': 1, 'unemployed': 0, 'entrepreneur': 4, 'blue-collar': 1, 'services': 3, 'admin.': 2, 'technician': 2, 'retired': 1, 'management': 4, 'self-employed': 3, 'unknown': 1, 'student': 0.5}}
{'education': {'basic.4y': 1, 'basic.6y': 1, 'basic.9y': 1, 'high.school': 1, 'professional.course': 2, 'university.degree': 2, 'illiterate': 0.9, 'unknown': 0.9}}
{'poutcome': {'nonexistent': 0, 'failure': 0, 'success': 1}}
{'y': {'no': 0, 'yes': 1}}

In [None]:
df.select_dtypes('object').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   job          41188 non-null  object
 1   marital      41188 non-null  object
 2   education    41188 non-null  object
 3   credit       41188 non-null  object
 4   housing      41188 non-null  object
 5   loan         41188 non-null  object
 6   contact      41188 non-null  object
 7   month        41188 non-null  object
 8   day_of_week  41188 non-null  object
 9   poutcome     41188 non-null  object
 10  subscribed   41188 non-null  object
dtypes: object(11)
memory usage: 3.5+ MB


Job:"housemaid":1,
    "unemployed":0,
    "entrepreneur":4,
    "blue-collar":1,
    "services":3,
    "admin.":2,
    "technician":2,
    "retired":1,
    "management":4,
    "self-employed":3,
    "unknown":1,
    "student":0.5
    
Education:"basic.4y":1,
          "basic.6y":1,
          "basic.9y":1,
          "high.school":1,
          "professional.course":2,
          "university.degree":2,
          "illiterate":0.9,
          "unknown":0.9
outcome:"nonexistent":0,
        "failure":0,
        "success":1
Subscribed:"no":0,
            "yes":1

In [None]:
dict_job = {
    "job":{
        "housemaid":1,
        "unemployed":0,
        "entrepreneur":4,
        "blue-collar":1,
        "services":3,
        "admin.":2,
        "technician":2,
        "retired":1,
        "management":4,
        "self-employed":3,
        "unknown":1,
        "student":0.5
    }}
dict_education = {
    "education":{
        "basic.4y":1,
        "basic.6y":1,
        "basic.9y":1,
        "high.school":1,
        "professional.course":2,
        "university.degree":2,
        "illiterate":0.9,
        "unknown":0.9
    }}
dict_poutcome = {
    "poutcome":{
        "nonexistent":0,
        "failure":0,
        "success":1
    }}
dict_y = {
    "subscribed":{
        "no":0,
        "yes":1
    }}

In [None]:
for i in [dict_job,dict_education,dict_poutcome,dict_y]:
    df.replace(i,inplace=True)

In [None]:
df.head()

Unnamed: 0,age,job,marital,education,credit,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,subscribed
0,56,1.0,married,1.0,no,no,no,telephone,may,mon,261,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0
1,57,3.0,married,1.0,unknown,no,no,telephone,may,mon,149,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0
2,37,3.0,married,1.0,no,yes,no,telephone,may,mon,226,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0
3,40,2.0,married,1.0,no,no,no,telephone,may,mon,151,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0
4,56,3.0,married,1.0,no,no,yes,telephone,may,mon,307,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0


In [None]:
#Encoding rest of the un ordinal categorical variable 
lc_X1 = LabelEncoder()
lst = ['marital','credit','housing','loan','contact','month','day_of_week']
for i in lst:
    df[i] = lc_X1.fit_transform(df[i])

In [None]:
df.head()
# df.groupby(df['marital']).mean()

Unnamed: 0,age,job,marital,education,credit,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,subscribed
0,56,1.0,1,1.0,0,0,0,1,6,1,261,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0
1,57,3.0,1,1.0,1,0,0,1,6,1,149,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0
2,37,3.0,1,1.0,0,2,0,1,6,1,226,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0
3,40,2.0,1,1.0,0,0,0,1,6,1,151,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0
4,56,3.0,1,1.0,0,0,2,1,6,1,307,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0


In [None]:
#one hot encoding
df_1 = pd.get_dummies(df,
                      columns=['marital','credit','housing','loan','contact','month','day_of_week'],
                      drop_first=True)

In [None]:
print(df_1.shape)
df_1.head()

(41188, 37)


Unnamed: 0,age,job,education,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,subscribed,marital_1,marital_2,marital_3,credit_1,credit_2,housing_1,housing_2,loan_1,loan_2,contact_1,month_1,month_2,month_3,month_4,month_5,month_6,month_7,month_8,month_9,day_of_week_1,day_of_week_2,day_of_week_3,day_of_week_4
0,56,1.0,1.0,261,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0
1,57,3.0,1.0,149,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0
2,37,3.0,1.0,226,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0
3,40,2.0,1.0,151,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0
4,56,3.0,1.0,307,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,1,0,0,0


## Splitting the data and Scaling

In [None]:
#Assign Variable
X = df_1.drop(columns = 'subscribed',axis=1).values
y = df_1['subscribed'].values 
#split training - test set
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=101)

In [None]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

# Printing Results

In [None]:
# We are using this function to calculate the accuracy, precision, recall, f1_score etc.
# Hyperparams: model = string, labels = y, pred = model.pred(y)
def print_scores(model, labels, pred):
    # Confusion matrix
    accuracy = round(accuracy_score(labels, pred), 3)
    precision = round(precision_score(labels, pred), 3)    
    recall = round(recall_score(labels, pred), 3)
    f1 = round(f1_score(labels, pred), 3)
    
    cm = confusion_matrix(labels, pred)
    df = pd.DataFrame(cm)
#     print(df)
    
    print(f"TP: {df[0][0]} ")
    print(f"TN: {df[1][1]} ")
    print(f"FP: {df[1][0]} ")
    print(f"FN: {df[0][1]} ")
    
    print(f"{model}:: Accuracy: {accuracy}, Precision: {precision}, Recall: {recall}, f1_score: {f1}")

In [None]:
# print the cross-validation results
def print_cv_result(results):
    print(f"Best Params : {results.best_params_}\n")
    
    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print(f'{round(mean, 3)} (+/-{round(std * 2, 3)}) for {params}')

# SVM

In [None]:
# Standard

# start = time()
# svc = SVC(kernel = "rbf", C=1)
# svc.fit(X_train,y_train)
# y_pred_svm = svc.predict(X_test)
# end = time()
# print(f"SVM Time:{round((end - start), 5) * 1000}")
# print_scores("SVM", y_test, y_pred_svm)

In [None]:
# K-Fold Cross Validation

# start = time()
# svc = SVC()
# parameters = {
#     'kernel' : ['linear', 'rbf'],
#     'C' : [ 0.1, 1, 10]
# }

# cv = GridSearchCV(svc, parameters, cv=5)

# cv.fit(X_train, y_train)

# print_cv_result(cv)

# best_params = cv.best_params_
# print(best_params)
# svc = SVC(kernel=best_params['kernel'], C=best_params['C'])


# y_pred_svm = svc.predict(X_test)
# print_scores("SVM", y_test, y_pred_svm)

# end = time()
# print(f"SVM Time:{round((end - start), 5) * 1000}")


Best Params : {'C': 1, 'kernel': 'rbf'}

0.906 (+/-0.005) for {'C': 0.1, 'kernel': 'linear'}
0.903 (+/-0.005) for {'C': 0.1, 'kernel': 'rbf'}
0.906 (+/-0.006) for {'C': 1, 'kernel': 'linear'}
0.91 (+/-0.006) for {'C': 1, 'kernel': 'rbf'}
0.906 (+/-0.006) for {'C': 10, 'kernel': 'linear'}
0.906 (+/-0.006) for {'C': 10, 'kernel': 'rbf'}
{'C': 1, 'kernel': 'rbf'}


NotFittedError: ignored

# Logistic Regression

In [None]:
lgr = LogisticRegression()
lgr.fit(X_train, y_train)
y_pred_lgr = lgr.predict(X_test)
print_scores("LGR", y_test, y_pred_lgr)

# ANN

In [None]:
classifier = Sequential()

##### Configuring the layers
* units — number of nodes for first layer
* activation — activation function we use hidden layers
* kernel_initializer — initiating weight as close as 0
* input_dim — number of independent variable in our dataset

![img](https://miro.medium.com/max/4200/1*GTLzJ0sUmwDPb9uVffnZ6g.png)

In [None]:
classifier.add(Dense(units=13,
                     activation='relu',
                     kernel_initializer='uniform',
                     input_dim=36))
classifier.add(Dropout(rate=0.1))

In [None]:
classifier.add(Dense(units=13,
                     activation='relu',
                     kernel_initializer='uniform'))
classifier.add(Dropout(rate=0.1))

In [None]:
classifier.add(Dense(units=1,
                     activation='sigmoid',
                     kernel_initializer='uniform'))

In [None]:
#compile ANN
classifier.compile(optimizer='adam',loss='binary_crossentropy',
                  metrics=['accuracy'])

#####  Training and testing

* batch_size — number of sample it takes for each iteration
* epochs — number of iteration to optimise the model

In [None]:
#fitting ANN

print(X_train.shape, y_train.shape)
start = time()
classifier.fit(x=X_train, y=y_train, batch_size=10,epochs=100)

In [None]:
y_pred_ann = classifier.predict(X_test)
end = time()
print(f"ANN Time:{round((end - start), 5) * 1000}")
"""
y_pred array contains boolean value of whether dependent variable has more than 50% chance of being ‘yes’ or not.

If ‘yes’ it will be True, if ‘no’ it will be ‘False’.
"""
y_pred_ann = y_pred_ann > 0.5

print_scores("ANN", y_test, y_pred_ann)

# References

* Activation Function : https://www.analyticsvidhya.com/blog/2020/01/fundamentals-deep-learning-activation-functions-when-to-use-them/
* Keras : https://medium.com/datadriveninvestor/building-neural-network-using-keras-for-classification-3a3656c726c1
* Confusion Matrix : https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9