### <p style="color:blue;">Table of content</p>
[1. Introduction](#1)<br>
[2. Packages loading](#2)<br>
[3. Dataset loading](#3)<br>
[4. Correlation](#4)<br>
[5. Features ingeneering](#5)<br>
[6. Outliers handeling](#6)<br>
[7. Dataset balancing](#7)<br>
[8. Categorical features encoding](#8)<br>
[9. Missing values Imputing](#9)<br>
[10. Build, train, and test model](#10)<br>
[11. features importance](#11)<br>

### ***1. <a id="1"></a> Introduction*** <br>
<p style="font-family:verdana; font-size:140%;">In this notebook, I am trying to increase the accuracy of Bank Marketing prediction. This dataset has been used in many Kaggle notebooks. The accuracy scores of previous notebooks are between 80% and 97% as far as I know. I reach a prediction accuracy of 98% through outliers values handling, missing data imputation, and dataset balancing. Below is a description of the dataset features provided by the authors:</p>
<b>Input variables:</b><br>
<b>bank client data</b><br>
* 1 - age (numeric)<br>
* 2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')<br>
* 3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
* 4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')<br>
* 5 - default: has credit in default? (categorical: 'no','yes','unknown')<br>
* 6 - housing: has housing loan? (categorical: 'no','yes','unknown')<br>
* 7 - loan: has personal loan? (categorical: 'no','yes','unknown')<br>
<b>related with the last contact of the current campaign:</b><br>
* 8 - contact: contact communication type (categorical: 'cellular','telephone')<br>
* 9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')<br>
* 10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')<br>
* 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.<br>
<b>other attributes:</b><br>
* 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)<br>
* 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)<br>
* 14 - previous: number of contacts performed before this campaign and for this client (numeric)<br>
* 15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')<br>
<b>social and economic context attributes </b><br>
* 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)<br>
* 17 - cons.price.idx: consumer price index - monthly indicator (numeric)<br>
* 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)<br>
* 19 - euribor3m: euribor 3 month rate - daily indicator (numeric)<br>
* 20 - nr.employed: number of employees - quarterly indicator (numeric)<br>

<b>Output variable (desired target):</b><br>
* 21 - y - has the client subscribed a term deposit? (binary: 'yes','no')<br>

### ***2. <a id="2"></a> Packages Loading***

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing

# data visualisation
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick

#data encoding, imputing, and model building and testing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
import random


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### ***3. <a id="3"></a> Dataset loading*** 

In [None]:
df_db=pd.read_csv("/kaggle/input/bank-marketing/bank-additional-full.csv",sep=";")
df_db.head()


In [None]:
df_db.describe()

### ***4. <a id="4"></a> Correlation*** 

In [None]:
df_db2=df_db.copy()
for i in list(df_db.columns):
    if df_db[i].dtype == 'object':
        df_db2[i]=pd.factorize(df_db[i])[0]

plt.figure(figsize=(10, 5),dpi=200)
plt.title('correlation between attributes')
sns.heatmap(df_db2.corr(),lw=1,linecolor='white',cmap='YlOrRd')
plt.xticks(rotation=57)
plt.yticks(rotation = 0)
plt.show()
corr_matrix = df_db.corr()

### ***5. <a id="5"></a>  Features ingeneering*** 
I will drop not interesting features and merge features that give similar information based on domain knowledge

In [None]:
import warnings
warnings.filterwarnings('ignore')
df_db=df_db.drop(["month","day_of_week","contact"],axis=1)
df_db["education"]=df_db["education"].replace(['basic.4y','high.school','basic.6y','basic.9y','professional.course','university.degree','illiterate',"unknown"],[4,12,6,9,14,17,0,np.nan])
df_db.housing[df_db['housing']=='no']=0
df_db.housing[df_db['housing']=='yes']=1
df_db.housing[df_db['housing']=='unknown']=np.nan
df_db.loan[df_db['loan']=='no']=0
df_db.loan[df_db['loan']=='yes']=1
df_db.loan[df_db['loan']=='unknown']=np.nan
df_db["loan"]=df_db["loan"]+df_db["housing"]
df_db.drop("housing",axis=1)


### ***6. <a id="6"></a>  Outliers handeling*** 

In [None]:
plt.figure(figsize=(14,6))
df_db.boxplot()
print()

In [None]:
#Zoom on some features
fig, axs = plt.subplots(nrows=2, ncols=3,figsize=(15,10))
sns.boxplot(y=df_db["age"],ax=axs[0][0])
sns.boxplot(y=df_db["duration"],ax=axs[0][1])
sns.boxplot(y=df_db["previous"],ax=axs[0][2])
sns.boxplot(y=df_db["campaign"],ax=axs[1][0])
sns.boxplot(y=df_db["pdays"],ax=axs[1][1])
sns.boxplot(y=df_db["cons.conf.idx"],ax=axs[1][2])

In [None]:
Q1 = df_db['duration'].quantile(.25)
Q3 = df_db['duration'].quantile(.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
df_db = df_db[df_db['duration'] >= lower] 
df_db = df_db[df_db['duration'] <=upper] 

In [None]:
Q1 = df_db['age'].quantile(.20)
Q3 = df_db['age'].quantile(.80)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
df_db = df_db[df_db['age'] >= lower] 
df_db = df_db[df_db['age'] <=upper] 

Q1 = df_db['previous'].quantile(.20)
Q3 = df_db['previous'].quantile(.80)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
df_db = df_db[df_db['previous'] >= lower] 
df_db = df_db[df_db['previous'] <=upper] 

### ***7. <a id="7"></a> Dataset balancing*** 

In [None]:
values=df_db["y"].value_counts().tolist()
values=[i * 100/sum(values) for i in values]
labels=["No","Yes"]
scale=values[0]/values[1]
explode = (0.01, 0.1)
fig1, ax1 = plt.subplots()
ax1.pie(values, explode=explode, labels=labels, autopct='%1.1f%%',shadow=False, startangle=90)
ax1.axis('equal')
plt.title('Dataset classes\' rate')
plt.show()

In [None]:
df_db["y"]=df_db["y"].replace("no",0)
df_db["y"]=df_db["y"].replace("yes",1)
df_classe_majority = df_db[df_db.y==0]
df_classe_minority = df_db[df_db.y==1]
# Upsample of minority class
from sklearn.utils import resample
df_classe_minority_upsampled = resample(df_classe_minority, 
                                           replace = True,     
                                           n_samples =df_classe_majority.shape[0],   
                                           random_state = 150) 
df_db = pd.concat([df_classe_majority, df_classe_minority_upsampled])
Y = df_db["y"]
X = df_db.drop(['y'], axis=1)

In [None]:
values=df_db["y"].value_counts().tolist()
values=[i * 100/sum(values) for i in values]
labels=["No","Yes"]
scale=values[0]/values[1]
explode = (0.01, 0.1)
fig1, ax1 = plt.subplots()
ax1.pie(values, explode=explode, labels=labels, autopct='%1.1f%%',shadow=False, startangle=90)
ax1.axis('equal')
plt.title('Dataset classes\' rate')
plt.show()

### ***8. <a id="8"></a> Categorical features  encoding*** 

In [None]:
categoricals=['job','marital','default','poutcome','campaign']
labelencoder = LabelEncoder()
for c in categoricals:
    X[c]      = labelencoder.fit_transform(X[c]) 
X.head()

### ***9. <a id="9"></a>  Missing values Imputing*** 

In [None]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=9)
X=pd.DataFrame(imputer.fit_transform(X),columns=X.columns)

### ***10. <a id="10"></a> Build, train, and test model*** 

In [None]:
seed = 42
test_size = 0.3
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

In [None]:
from sklearn import tree

dtc = tree.DecisionTreeClassifier()
dtc.fit(X_train, y_train)

y_dtc_pred = dtc.predict(X_test)

accuracy_dtc = accuracy_score(y_test, y_dtc_pred)
print("Accuracy: {0:.4f}".format(accuracy_dtc))
print()


In [None]:
from sklearn.model_selection import cross_val_predict, cross_val_score, KFold
from sklearn.metrics import mean_squared_error, mean_absolute_error, confusion_matrix, accuracy_score, r2_score, classification_report
k_fold = KFold(n_splits=7, shuffle=False, random_state=None)
print("Mean accuracy",(cross_val_score(dt, X_train, y_train, cv=k_fold, n_jobs=1, scoring = 'accuracy').mean()))
print(classification_report(y_test, y_dtc_pred))

### ***11. <a id="11"></a>  features importance*** 

In [None]:
features=X.columns
importances =dtc.feature_importances_
indices = np.argsort(importances)

plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='y', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')