## Visual Data analysis and ML modeling on Banking Data ##

In [2]:
#Importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
import scipy
%matplotlib inline

In [3]:
# For unhide the all columns
pd.set_option('display.max_columns', 22)

In [4]:
#Creating the datafram from .csv file
df=pd.read_csv('/kaggle/input/banking-dataset/bank-additional-full.csv', sep=';')
df.head()

In [5]:
df.shape

In [6]:
df.columns




Input features (column names):

1.  `age` - client age in years (numeric)
2.  `job` - type of job (categorical: `admin.`, `blue-collar`, `entrepreneur`, `housemaid`, `management`, `retired`, `self-employed`, `services`, `student`, `technician`, `unemployed`, `unknown`)
3.  `marital` - marital status (categorical: `divorced`, `married`, `single`, `unknown`)
4.  `education` - client education (categorical: `basic.4y`, `basic.6y`, `basic.9y`, `high.school`, `illiterate`, `professional.course`, `university.degree`, `unknown`)
5.  `default` - has credit in default? (categorical: `no`, `yes`, `unknown`)
6.  `housing` - has housing loan? (categorical: `no`, `yes`, `unknown`)
7.  `loan` - has personal loan? (categorical: `no`, `yes`, `unknown`)
8.  `contact` - contact communication type (categorical: `cellular`, `telephone`)
9.  `month` - last contact month of the year (categorical: `jan`, `feb`, `mar`, ..., `nov`, `dec`)
10. `day_of_week` - last contact day of the week (categorical: `mon`, `tue`, `wed`, `thu`, `fri`)
11. `duration` - last contact duration, in seconds (numeric).
12. `campaign` - number of contacts performed for this client during this campaign (numeric, includes last contact)
13. `pdays` - number of days that have passed after the client was last contacted from the previous campaign (numeric; 999 means the client has not been previously contacted)
14. `previous` - number of contacts performed for this client before this campaign (numeric)
15. `poutcome` - outcome of the previous marketing campaign (categorical: `failure`, `nonexistent`, `success`)
16. `emp.var.rate` - employment variation rate, quarterly indicator (numeric)
17. `cons.price.idx` - consumer price index, monthly indicator (numeric)
18. `cons.conf.idx` - consumer confidence index, monthly indicator (numeric)
19. `euribor3m` - euribor 3 month rate, daily indicator (numeric)
20. `nr.employed` - number of employees, quarterly indicator (numeric)

Output feature (desired target):


21. `y` - has the client subscribed a term deposit? (binary: `yes`,`no`)



In [7]:
df.dtypes

In [8]:
# Ignoring filterwarning 
import warnings
warnings.filterwarnings('ignore')


#### Maping the target feature ####

Target feature 'y' shows a positive behavior of a phone call during the marketing campaign.
Mark the positive outcome as '1' and negative one as '0'.

In [9]:
target={'no':0, 'yes':1}

In [10]:
df['y']=df['y'].map(target)
df.head()

All 'yes' and 'no' observations are converted into 0 and 1.

## * Utilization of different Python libraries for visual data analysis ##

#### 1. Matplotlib ####

For each feature we can build a separate histogram through '.hist' function

In [11]:
df['age'].hist(figsize=(10,6), bins=10, legend=True, grid=True)

The histogram shows that most of the clients are between the ages of 25 and 50, which corresponds to the actively working part of the population.

#### Building a new graph of the average client age depending upon the marital staus ####

In [12]:
df[['age', 'marital']].groupby('marital').mean().plot(kind='bar', rot=50, figsize=(10,6), grid=True)

#### Seaborn ####

we will get acquainted with the first "complex" type of pair plot graphics (Scatter Plot Matrix). This visualization will help us to look at one picture as at interconnection of various features.

In [13]:
import seaborn as sns

In [14]:
sns.pairplot(df[['age', 'duration', 'campaign']])

This visualization allows us to identify an interesting inverse relationship between a campaign and duration, which indicates a decrease in the duration of contact with the client with an increase in their contact quantity during the campaign.

Also with the help of seaborn we can build a distribution, for example, now see at the distribution of the client age. To do this, build distplot.

In [15]:
plt.rcParams["figure.figsize"] = (8, 6)
sns.distplot(df['age'], bins=10, color='r')

In order to look more for the relationship between two numerical features, there is also joint_plot - this is a hybrid Scatter Plot and Histogram (there are also histograms of feature distributions). Now see at the relationship between the number of contacts in a campaign and the last contact duration.

In [16]:
sns.jointplot(x='age', y='duration', data=df, kind='reg')

Another useful seaborn plot type is **Box Plot** ("Box and whisker plot"). Let's compare the age of customers for the top 5 of the most common employment forms.

In [17]:
top_5_job=df['job'].value_counts().sort_values(ascending=False).head(5).index.values
top_5_job

In [18]:
sns.boxplot(x='age', y='job', data=df[df['job'].isin(top_5_job)], orient='h', linewidth=2)

The plot shows that among the top-5 client categories by the type of employment, the most senior customers represent the management, and the largest number of outliers is among the categories of admin. and technician.

And one more plot type is a **heat map**. A Heat Map allows us to look at the distribution of some numerical feature in two categories. We visualize the distribution of clients on family status and the type of employment.

In [19]:
#Creating pivot table with feature and target variable
job_marital_y = (df.pivot_table(index="job", columns="marital", values="y", aggfunc=sum))
job_marital_y


In [20]:
#Visualizing the heat map through pivot table
sns.heatmap(job_marital_y, annot=True, fmt="d", linewidths=0.9, linecolor='y');

We will visualize the distribution of clients on contact status and the type of employment.

In [21]:
job_contact_y = (df.pivot_table(index="job", columns="contact", values="y", aggfunc=sum))
job_contact_y

In [22]:
#Visualizing the heat map through pivot table
sns.heatmap(job_contact_y, annot=True, fmt="d", linewidths=0.9, linecolor='y', cmap="YlGnBu")

### Plotly


We looked at the visualization based on the Library **Matplotlib** and **Seaborn**. However, this is not the only option to build charts with Python. We will also get acquainted with the library **plotly**. Plotly is an open-source library that allows us to build interactive graphics in Python.

The beauty of interactive graphs is that we can see the exact numerical value on mouse hover, hide the uninteresting rows in the visualization, zoom in a certain area of ​​graphics, etc.

To begin with, we build Line Plot with the distribution of the total number and the number of attracted clients by **age**.

In [23]:
#Importing plotly libraries

import plotly
import plotly.graph_objs as gp
from plotly.offline import iplot, plot, init_notebook_mode, download_plotlyjs
init_notebook_mode(connected=True)

In [24]:
#Aggregating the 'age' feature and target 'y'
age_sum=df.groupby('age')[['y']].sum()
age_sum

In [25]:
age_count=(df.groupby('age')[['y']].count())
age_count

In [26]:
#Creating a dataframe by joining two aggregate object 'age_sum' and 'age_count'

age_df=(age_sum.join(age_count, rsuffix='_count'))

#Renaming the columns
age_df.columns=["Attracted", "Total Number"]
age_df

In **Plotly**, we create the Figure object, which consists of data and design/style, for which the object Layout was created. In simple cases, we can call the function iplot just for the traces list.

In [27]:
trace_attracted=gp.Scatter(x=age_df.index, y=age_df["Attracted"], name="Attracted", fillcolor='violet') 
trace_tot_num= gp.Scatter(x=age_df.index, y=age_df["Total Number"], name= "Total Number", fillcolor='yellow')

data=[trace_attracted, trace_tot_num]
#data

In [28]:
#Creating layout, figure and visualize the data
layout={'title': 'Statistics by client age'}
fig=gp.Figure(data=data, layout=layout)

iplot(fig, show_link=False)

Let us also see the distribution of customers by months, designed by the number of attracted clients and on the total number of clients. To do this, build **Bar Chart**

In [29]:
# Creating aggregate object on month data
month_index=["jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"]

month_sum=df.groupby('month')[['y']].sum()
month_sum

In [30]:
month_count=df.groupby('month')[['y']].count()
month_count

In [31]:
#Creating month dataframe by joining these aggregated object

month_df=(month_sum.join(month_count, rsuffix='_count')).reindex(month_index)

month_df.columns=['Attracted', 'Total Number']
month_df
          

In [32]:
trace_mo_att=gp.Bar(x=month_df.index, y=month_df['Attracted'], name='Attracted')
trace_mo_tot=gp.Bar(x=month_df.index, y=month_df['Total Number'], name='Total Number')
data_month=[trace_mo_att, trace_mo_tot]

data_month

In [33]:
layout={'title': "Share of month wise"}
        
fig=gp.Figure(data=data_month, layout=layout)

iplot(fig, show_link=True)

**Plotly** can build the Box plot. Considering the differences in the client age depending on the family status

In [34]:
# Creating box plot with age and marital status category

data_status=[]

for status in df['marital'].unique():
    data_status.append(gp.Box(y=df[df.marital == status].age, name=status))
    
iplot(data_status, show_link=False)   

## Multi-collinearity analysis ##

When a dataset has a large number of independent variables (features), it is possible that few of these independent variables may be highly correlated. The exixtance of a high correlation between independent variables called **Multi-collinearity**. Presence of multi-collinearity can destabilize the model. Thus, it is necessary to identify the presense of multi-collinearity and take corrective actions.

In [35]:
# Creating feature metrix with numerical features
X_ind=df[["age", "duration", "campaign","pdays","previous","emp.var.rate","cons.price.idx","cons.conf.idx","euribor3m","nr.employed"]]
X_ind.head()

In [36]:
X_features=X_ind.columns
X_features

###  Variance Inflation Factor ###

Variance inflation factor is a measure used for identifying the existance of multi-collinearity.

The plot clearly shows the distribution of clients by age, the presence of outliers for all categories of the family status, except for unknown. Moreover, the plot is interactive - hovering the mouse pointer to its elements allows us to obtain additional statistical characteristics of the series.

 $$ VIF = 1 / (1 - R^2) $$

<li> R = R-Squared value

In [37]:
#Importing VIF library

from statsmodels.stats.outliers_influence import variance_inflation_factor
import scipy
from scipy import stats

In [38]:
def get_vif(X_ind):
    X_matrix=X_ind.values
    vif=[variance_inflation_factor(X_matrix, i) for i in range (X_matrix.shape[1])]
    vif_factors=pd.DataFrame()
    vif_factors['column']=X_ind.columns
    vif_factors['VIF']=vif
    
    return vif_factors

In [39]:
vif_factors=get_vif(X_ind[X_features])
vif_factors

In [40]:

selected_vif_feat=vif_factors[vif_factors.VIF > 4].column
selected_vif_feat

In [41]:
plt.figure(figsize=(15,5))
sns.heatmap(X_ind[selected_vif_feat].corr(), annot=True)
plt.title('Heatmap depicting correlation between features')

From above heat map '	emp.var.rate' and 'euribor3m', 'nr.employed' highly correlated. We can keep one feature and eleminate other from each category. Here we will remove 'emp.var.rate' feature from the feature matrix.

In [42]:

removed_columns=['emp.var.rate']

X_new_feature= list(set(X_features) - set(removed_columns))
X_new_feature

In [43]:
# Creating new feature matrix removing one column
X=X_ind[X_new_feature]
X

From the dataset we can see it is a classification problem, We will use different types of classification techniques like; Logistics Regression, Decision Tree, Random Forest, KNN, SVM for understanding the classes of term deposit.

### Logistic Regression ###

Logistic regression ia a statistical model in which the response variable takes a discrete value and the explanatory variables can either be continous or discrete. If the outcome variable takes only two values, then the model is called binary logistic regression model. Here we are working on binary logistic regression where the outcome either '0' or '1'. 

For classification algorithm we have to choose the feature which has numerical data. Creating new dataframe with numerical feature.

In [45]:
# Creating feature metrix

X.head()

In [46]:
X.shape

In [47]:
#Creating response vector

Y=df["y"]
Y

In [48]:
#Importing Sci-kit learn libraries

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, confusion_matrix
from sklearn import metrics
from sklearn import preprocessing

In [49]:
#Normalisation of feature matrics through standard scaler

X=preprocessing.StandardScaler().fit_transform(X)
X

In [50]:
#Splitting the dataset into train and test category

x_train, x_test, y_train, y_test=train_test_split(X,Y, test_size=0.30, random_state= 50)

In [51]:
print("Train feature Shape: ", x_train.shape)
print("Test feature Shape: ", x_test.shape)
print("Train target Shape: ",y_train.shape)
print("Test target Shape: ", y_test.shape)

In [52]:
# Creating regression model with logistic regression
logistic_model=LogisticRegression()
logistic_model.fit(x_train, y_train)

In [53]:
# Probability of response model prediction
y_prob=logistic_model.predict(x_test)
y_prob

In [54]:
# Model Score
print("Model Accuracy: ",logistic_model.score(x_test, y_test))

From the above accuracy score we can say our model is more than **90%** accurate.

In [55]:
print("Mean Squared Error:",mean_squared_error(y_test, y_prob))

In [56]:
from math import sqrt
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_prob)))

The **RMSE** value indicates the model prediction has a standard deviation of 0.30.

In [57]:
print("R-Squared value: ", r2_score(y_test, y_prob))

In [58]:
print("R-Squared value for train data:  ", r2_score( y_train, logistic_model.predict(x_train)))

In [59]:
# Creating Confusion matrix
print("Confusion Matrix: \n", metrics.confusion_matrix(y_test, y_prob))

In [60]:
# Creating classification report on Logistic Regression model
from sklearn.metrics import classification_report
print("Classification Report: \n", metrics.classification_report( y_test, y_prob))

## K-Nearest Neighbor ##

**K-Nearest Neighbor** algorithm is a non-parametric, lazy learning algorithm used for regression and classification problem. Parametric models estimate a fixed number of parameters from the data and strong assumption of the data.

In [61]:
# Importing KNN library
from sklearn.neighbors import KNeighborsClassifier

In [62]:
# Building KNN model and fit with train data
knn_model=KNeighborsClassifier(n_neighbors=5)
knn_model.fit(x_train, y_train)

In [63]:
y_pred_knn=knn_model.predict(x_test)
y_pred_knn

In [64]:
print('Model Accuracy: ', knn_model.score(x_test, y_test))

In [65]:
print('KNN Accuracy Score: ', metrics.accuracy_score(y_test, y_pred_knn))

KNN Model accuracy is 90%.

**Best k-value for maximum model accuracy**

In [66]:
k_range=(1,50)
score=[]

for k in k_range:
    knn_model=KNeighborsClassifier(n_neighbors=k)
    knn_model.fit(x_train, y_train)
    knn_pred=knn_model.predict(x_test)
    score.append(metrics.accuracy_score(y_test, knn_pred))


In [67]:
plt.plot(k_range, score, color='g')
plt.xlabel('K-Value')
plt.ylabel('Accuracy Score')

plt.tight_layout
plt.show()

This plot shows with the increase of K -Value model accuracy is increasisng and reached at maximum acuuracy 90.5%.

### Support Vector Machine ###

Support vactor machine is another kind od machine learning algorithm.SVM widely used for classification and regression problem in supervised learning. The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.

In [69]:
# Importing SVM library

from sklearn.svm import SVC

In [70]:
svm_model=SVC(kernel='rbf', gamma=0.2, random_state=10)

In [71]:
svm_model.fit(x_train, y_train)

In [73]:
svm_y_pred=svm_model.predict(x_test)
svm_y_pred

In [76]:
print("Model Accuracy: ", svm_model.score(x_test, y_test))

In [78]:
print('Accuracy Score: ', metrics.accuracy_score(y_test, svm_y_pred))

In [79]:
# Finding jaccard score and f1_score

from sklearn.metrics import jaccard_score, f1_score

In [80]:
print('Jaccard Score: ', metrics.jaccard_score(y_test, svm_y_pred))

In [82]:
print('F1 Score: ', metrics.f1_score(y_test, svm_y_pred))

In [85]:
print('Classification report: \n', metrics.classification_report(y_test, svm_y_pred))

### Decision Tree ###

Decision tree is a collection of divide and conquer problem solving solving strategies that use tree like structure to predict the value of an outcome variable.The tree starts with the root node consisting of the complete data and thereafter uses intellience strateies to split the nodes into multiple branches.

In [86]:
# Importing library

from sklearn.tree import DecisionTreeClassifier

In [87]:
DT_model=DecisionTreeClassifier(criterion='entropy', max_depth=6, splitter='random', min_impurity_decrease=0.0)

In [88]:
DT_model.fit(x_train, y_train)

In [89]:
y_pred_dt=DT_model.predict(x_test)
y_pred_dt

In [90]:
print('Model Accuray: ', metrics.accuracy_score(y_test, y_pred_dt))

In [91]:
print('Classification report: \n', metrics.classification_report(y_test, y_pred_dt))