# Question: Bank Marketing Analysis
Attached is a txt file containing some real data that relates to a marketing campaign run by
a bank. The aim of the marketing campaign was to get customers to subscribe to a bank
term deposit product. Whether they did this or not is variable ‘y’ in the data set.
The bank in question is considering how to optimise this campaign in future.
What would your recommendations to the marketing manager be?


# Variable description

The variables are as follows: </br>
Input variables: 1 - age (numeric)</br>
2 - job : type of job (categorical:
'admin.','unknown','unemployed','management','housemaid',</br>
'entrepreneur','student','blue-collar','self-employed',</br>
'retired','technician','services')</br>
3 - marital : marital status </br>
(categorical: 'married','divorced','single'; note:</br>
 'divorced' means divorced or widowed)
4 -education (categorical: 'unknown','secondary','primary','tertiary')</br>
5 - default: has credit in
default? (binary: 'yes','no') </br>
6 - balance: average yearly balance, in euros (numeric)</br>
7 - housing: has housing loan? (binary: 'yes','no') </br>
8 - loan: has personal loan? (binary:
'yes','no') ### related with the last contact of the current campaign:</br>
9 - contact: contact
communication type (categorical: 'unknown','telephone','cellular')</br>
 10 - day: last contact
day of the month (numeric)</br>
 11 - month: last contact month of year (categorical: 'jan', 'feb',
'mar', ..., 'nov', 'dec')</br>
 12 - duration: last contact duration, in seconds (numeric) #### other
attributes:</br>
 13 - campaign: number of contacts performed during this campaign and for this
client (numeric, includes last contact)</br>
 14 - pdays: number of days that passed by after the
client was last contacted from a previous campaign (numeric, -1 means client was not
previously contacted)</br>
 15 - previous: number of contacts performed before this campaign
and for this client (numeric)</br>
 16 - poutcome: outcome of the previous marketing campaign
(categorical: 'unknown','other','failure','success')
Output variable (desired target):</br>
 17 - y - has the client subscribed a term deposit? (binary:
'yes','no')

1.  Data cleansing and missing value treatment</br>

2.  Exploratory data analysis and visual plotting</br>

3.  The scope of feature engineering</br>

4.  Model building and evaluation,</br>

please choose a different set of models to show different</br>
prediction accuracy that you get from the data

Importing and displaying the data

In [1]:
import time
from datetime import datetime as dt
import pandas as pd
import numpy as np

from pandasgui import show
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns
import plotly.graph_objs as go

import gc
import warnings
from sklearn.preprocessing import label_binarize, StandardScaler
from tqdm import tqdm

import category_encoders as ce
from sklearn.model_selection import train_test_split
import sklearn.linear_model as lm
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score, accuracy_score, confusion_matrix, precision_score, auc, \
    recall_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler
warnings.filterwarnings("ignore")

ModuleNotFoundError: No module named 'pandasgui'

In [None]:
data = pd.read_csv("data.txt", delim_whitespace=True)
print(data.shape)
data.head(45)

In [None]:
data.dtypes

In [None]:
#check for missing values
data.isnull().sum()

# Exploiratory Data Analysis
# categorical features

In [None]:

deposit=data['y'].value_counts()
deposit.plot(kind='bar')
plt.title('Ratio of acceptance and rejection')

Given data set is highly imbalanced, i.e. number of data belonging to 'no' category is way higher than 'yes' category.

In [None]:
ageyes=data['age'][data.y=='yes']
ageno=data['age'][data.y=='no']
plt.figure(facecolor='w')
ageyes.plot(kind='hist',color='green')
plt.title('Age Distrubution Accepted Customer')
plt.figure(facecolor='w')
ageno.plot(kind='hist',color='gray')
plt.title('Age Distrubution Rejected Customer')
plt.show()

In [None]:
# function that show categorical values dist

def plot_bar(data ,column):

    temp_1 = pd.DataFrame()
    temp_1['No'] = data[data['y'] == 'no'][column].value_counts()
    temp_1['Yes'] = data[data['y'] == 'yes'][column].value_counts()
    plt.figure(facecolor='w')
    temp_1.plot(kind='bar')
    plt.xlabel(f'{column}')
    plt.ylabel('Number of clients')
    plt.title('Distribution of {} and deposit'.format(column))

    plt.show()

In [None]:
plot_bar(data ,'job'),\
plot_bar(data,'marital'),\
plot_bar(data, 'education'),\
plot_bar(data,'contact'),\
plot_bar(data,'loan'),\
plot_bar(data,'housing')

In [None]:
# Convert target variable into numeric
# data.y = data.y.map({'no':0, 'yes':1}).astype('uint8')
data_new = pd.get_dummies(data, columns=['job','marital',
                                         'education','default',
                                         'housing','loan',
                                         'contact','month',
                                         'poutcome'])

TODO:
ADD day of week

In [None]:
# 2014	суббота 10 май 2014 г.	май	+ 9 months	1 time in year 2014
# суббота 17 май 2014 г.	май	+ 9 months	1 time in year 2014
# so lets set year as 2014
data.day.unique()
# pd.to_datetime(str(f'{data.month} {str(data.day)}, 2014'), format="%b %d, %Y")
# data['dates'] = pd.to_datetime(str('2014-'+str(data.month)+'-'+str(data.day)), format="%Y-%b-%d" )
# data['dates']

In [None]:
data1 = data[data['y'] == 'yes']
data2 = data[data['y'] == 'no']

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(12,10))

# b1 = ax[0, 0].bar(data1['day_of_week'].unique(),height = data1['day_of_week'].value_counts(),color='#000000')
# b2 = ax[0, 0].bar(data2['day_of_week'].unique(),height = data2['day_of_week'].value_counts(),bottom = data1['day_of_week'].value_counts(),color = '#DC4405')
# ax[0, 0].title.set_text('Day of week')
# #ax[0, 0].legend((b1[0], b2[0]), ('Yes', 'No'))
b1 = ax[0, 1].bar(data1['month'].unique(),height = data1['month'].value_counts(),color='#000000')
b2 = ax[0, 1].bar(data2['month'].unique(),height = data2['month'].value_counts(),bottom = data1['month'].value_counts(),color = '#DC4405')
ax[0, 1].title.set_text('Month')
ax[1, 0].bar(data1['job'].unique(),height = data1['job'].value_counts(),color='#000000')
ax[1, 0].bar(data1['job'].unique(),height = data2['job'].value_counts()[data1['job'].value_counts().index],bottom = data1['job'].value_counts(),color = '#DC4405')
ax[1, 0].title.set_text('Type of Job')
ax[1, 0].tick_params(axis='x',rotation=90)
ax[1, 1].bar(data1['education'].unique(),height = data1['education'].value_counts(),color='#000000') #row=0, col=1
ax[1, 1].bar(data1['education'].unique(),height = data2['education'].value_counts()[data1['education'].value_counts().index],bottom = data1['education'].value_counts(),color = '#DC4405')
ax[1, 1].title.set_text('Education')
ax[1, 1].tick_params(axis='x',rotation=90)
#ax[0, 1].xticks(rotation=90)
plt.figlegend((b1[0], b2[0]), ('Yes', 'No'),loc="right",title = "Term deposit")
plt.show()

In [None]:
fig, ax = plt.subplots(2, 3, figsize=(15,10))

b1 = ax[0, 0].bar(data1['marital'].unique(),height = data1['marital'].value_counts(),color='#000000')
b2 = ax[0, 0].bar(data1['marital'].unique(),height = data2['marital'].value_counts()[data1['marital'].value_counts().index],bottom = data1['marital'].value_counts(),color = '#DC4405')
ax[0, 0].title.set_text('Marital Status')
#ax[0, 0].legend((b1[0], b2[0]), ('Yes', 'No'))
ax[0, 1].bar(data1['housing'].unique(),height = data1['housing'].value_counts(),color='#000000')
ax[0, 1].bar(data1['housing'].unique(),height = data2['housing'].value_counts()[data1['housing'].value_counts().index],bottom = data1['housing'].value_counts(),color = '#DC4405')
ax[0, 1].title.set_text('Has housing loan')
ax[0, 2].bar(data1['loan'].unique(),height = data1['loan'].value_counts(),color='#000000')
ax[0, 2].bar(data1['loan'].unique(),height = data2['loan'].value_counts()[data1['loan'].value_counts().index],bottom = data1['loan'].value_counts(),color = '#DC4405')
ax[0, 2].title.set_text('Has personal loan')
ax[1, 0].bar(data1['contact'].unique(),height = data1['contact'].value_counts(),color='#000000')
ax[1, 0].bar(data1['contact'].unique(),height = data2['contact'].value_counts()[data1['contact'].value_counts().index],bottom = data1['contact'].value_counts(),color = '#DC4405')
ax[1, 0].title.set_text('Type of Contact')
ax[1, 1].bar(data1['default'].unique(),height = data1['default'].value_counts(),color='#000000')
ax[1, 1].bar(data1['default'].unique(),height = data2['default'].value_counts()[data1['default'].value_counts().index],bottom = data1['default'].value_counts(),color = '#DC4405')
ax[1, 1].title.set_text('Has credit in default')
ax[1, 2].bar(data1['poutcome'].unique(),height = data1['poutcome'].value_counts(),color='#000000')
ax[1, 2].bar(data1['poutcome'].unique(),height = data2['poutcome'].value_counts()[data1['poutcome'].value_counts().index],bottom = data1['poutcome'].value_counts(),color = '#DC4405')
ax[1, 2].title.set_text('Outcome of the previous marketing campaign')
plt.figlegend((b1[0], b2[0]), ('Yes', 'No'),loc="right",title = "Term deposit")
plt.show()

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(12,10))

ax[0, 0].hist(data2['age'],color = '#DC4405',alpha=0.7,bins=20, edgecolor='white')
ax[0, 0].hist(data1['age'],color='#000000',alpha=0.5,bins=20, edgecolor='white')
ax[0, 0].title.set_text('Age')
ax[0, 1].hist(data2['duration'],color = '#DC4405',alpha=0.7, edgecolor='white')
ax[0, 1].hist(data1['duration'],color='#000000',alpha=0.5, edgecolor='white')
ax[0, 1].title.set_text('Contact duration')
ax[1, 0].hist(data2['campaign'],color = '#DC4405',alpha=0.7, edgecolor='white')
ax[1, 0].hist(data1['campaign'],color='#000000',alpha=0.5, edgecolor='white')
ax[1, 0].title.set_text('Number of contacts performed')
ax[1, 1].hist(data2[data2['pdays'] != 999]['pdays'],color = '#DC4405',alpha=0.7, edgecolor='white')
ax[1, 1].hist(data1[data1['pdays'] != 999]['pdays'],color='#000000',alpha=0.5, edgecolor='white')
ax[1, 1].title.set_text('Previous contact days')
plt.figlegend((b1[0], b2[0]), ('Yes', 'No'),loc="right",title = "Term deposit")
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(15,10))

ax.hist(data2['previous'],color = '#DC4405',alpha=0.7, edgecolor='white')
ax.hist(data1['previous'],color='#000000',alpha=0.5, edgecolor='white')
ax.title.set_text('Number of contacts performed previously')





# Explore numerical features (EDA)

In [None]:
# Convert target variable into numeric
data.y = data.y.map({'no':0, 'yes':1}).astype('uint8')
data_new.y.replace(('yes', 'no'), (1, 0), inplace=True)
print(data_new.shape)
data_new.dtypes

In [None]:
# Build correlation matrix
corr = data.corr()
corr.style.background_gradient(cmap='PuBu')

In [None]:
# Replacing values with binary ()
data.contact = data.contact.map({'cellular': 1, 'telephone': 0, 'unknown': 0}).astype('uint8')
data.loan = data.loan.map({'yes': 1, 'unknown': 0, 'no' : 0}).astype('uint8')
data.housing = data.housing.map({'yes': 1, 'unknown': 0, 'no' : 0}).astype('uint8')
data.default = data.default.map({'no': 1, 'unknown': 0, 'yes': 0}).astype('uint8')
data.pdays = data.pdays.replace(999, 0) # replace with 0 if not contact
data.previous = data.previous.apply(lambda x: 1 if x > 0 else 0).astype('uint8') # binary has contact or not

In [None]:
# binary if were was an outcome of marketing campane
data.poutcome.unique()

In [None]:
data.poutcome = data.poutcome.map({'unknown':0, 'failure':0, 'other':0, 'success':1}).astype('uint8')

In [None]:
data.age = np.log(data.age)

# less space

data.campaign = data.campaign.astype('uint8')
data.pdays = data.pdays.astype('uint8')

In [None]:
# fucntion to One Hot Encoding
def encode(data, col):
    return pd.concat([data, pd.get_dummies(col, prefix=col.name)], axis=1)

In [None]:
# One Hot encoding of 3 variable
data = encode(data, data.job)
data = encode(data, data.month)

In [None]:
# Drop tranfromed features
data.drop(['job', 'month'], axis=1, inplace=True)

In [None]:
data.drop_duplicates(inplace=True)


In [None]:
'''Convert Duration Call into 5 category'''
def duration(data):
    data.loc[data['duration'] <= 102, 'duration'] = 1
    data.loc[(data['duration'] > 102) & (data['duration'] <= 180)  , 'duration'] = 2
    data.loc[(data['duration'] > 180) & (data['duration'] <= 319)  , 'duration'] = 3
    data.loc[(data['duration'] > 319) & (data['duration'] <= 645), 'duration'] = 4
    data.loc[data['duration']  > 645, 'duration'] = 5
    return data
duration(data)

# Treating Imbalanced Data

In [None]:
# save target variable before transformation
y = data.y
y.value_counts()

This imbalance has to treated so as to make sure that there is no bias in modeling. </br>
Imbalance is generally treated in three ways.</br>

In [None]:
predictors = data.drop(['pdays'],axis=1)
X = predictors.drop(['y'],axis=1)
X = pd.get_dummies(X)

## Random Undersampling
In this method, the majority category,</br>
in this case 'no' category is randomly</br>
sampled to match the size of the minority 'yes' category.</br>
Remaining data of majority category is discarded.

In [None]:
rus = RandomUnderSampler(random_state=0)
X_Usampled, y_Usampled = rus.fit_resample(X, y)
pd.Series(y_Usampled).value_counts()

In [None]:
''' Target encoding for two categorical feature '''
# Create target encoder object
target_encode = ce.target_encoder.TargetEncoder(cols=['marital', 'education']).fit(data, y)
numeric_dataset = target_encode.transform(data)
# drop target variable
numeric_dataset.drop('y', axis=1, inplace=True)

##
## Random Oversampling
In this method, the minority category 'no' is randomly sampled with replacement</br>
to match the size of the majority 'no' category. Minority category entries will be repeated</br>
many times.

In [None]:
ros = RandomOverSampler(random_state=0)
X_Osampled, y_Osampled = ros.fit_resample(X, y)
pd.Series(y_Osampled).value_counts()

## SMOTE - Synthetic Minority Oversampling Technique
This is an oversampling technique in which instead of randomly repeating minority 'yes' category,<\br>
new entires are sythetically created maintaining the convexity of minority entry space.<\br>
Minority category will again match the majority category samples.<\br>

In [None]:
sm = SMOTE(random_state=0)
X_SMOTE, y_SMOTE = sm.fit_resample(X, y)
pd.Series(y_SMOTE).value_counts()

In [None]:
'''Check numerical data set'''
print(numeric_dataset.shape, y.shape)
print(f'We observe {numeric_dataset.shape} rows  numerical features after transformation.'
      f' Target variable shape is {y.shape} as expected')

In [None]:
''' Split data on train and test'''
# set global random state
random_state = 11
# split data
X_train, X_test, y_train, y_test = train_test_split(numeric_dataset, y, test_size=0.2, random_state=random_state)
# collect excess data
gc.collect()


In [None]:
print('check the shape of splitted train and test sets \n', X_train.shape, y_train.shape, X_test.shape, y_test.shape)

In [None]:
'''Build pipline of classifiers'''
# set all CPU
n_jobs = -1

# LogisticRegression
pipe_lr = Pipeline([('lr', LogisticRegression(
    random_state=random_state,
    n_jobs=n_jobs,
    max_iter=500))])

# RandomForestClassifier
pipe_rf = Pipeline([('rf', RandomForestClassifier(random_state=random_state,
                                                  oob_score=True,
                                                  n_jobs=n_jobs))])

# KNeighborsClassifier
pipe_knn = Pipeline([('knn', KNeighborsClassifier(n_jobs=n_jobs))])

# DecisionTreeClassifier
pipe_dt = Pipeline([('dt', DecisionTreeClassifier(random_state=random_state,
                                                  max_features='auto'))])
# BaggingClassifier
# note we use SGDClassifier as classier inside BaggingClassifier
pipe_bag = Pipeline([('bag',BaggingClassifier(base_estimator=SGDClassifier(random_state=random_state,
                                                                           n_jobs=n_jobs,
                                                                           max_iter=1500),
                                              random_state=random_state,
                                              oob_score=True,
                                              n_jobs=n_jobs))])

# SGDClassifier
pipe_sgd = Pipeline([('sgd', SGDClassifier(random_state=random_state,
                                           n_jobs=n_jobs,
                                           max_iter=1500))])


In [None]:
'''Set parameters for Grid Search '''
# set number
cv = StratifiedKFold(shuffle=True,
                     n_splits=5,
                     random_state=random_state)
# set for LogisticRegression
grid_params_lr = [{
                'lr__penalty': ['l2'],
                'lr__C': [0.3, 0.6, 0.7],
                'lr__solver': ['sag']
                }]
# set for RandomForestClassifier
grid_params_rf = [{
                'rf__criterion': ['entropy'],
                'rf__min_samples_leaf': [80, 100],
                'rf__max_depth': [25, 27],
                'rf__min_samples_split': [3, 5],
                'rf__n_estimators' : [60, 70]
                }]
# set for KNeighborsClassifier
grid_params_knn = [{'knn__n_neighbors': [16,17,18]}]

# set for DecisionTreeClassifier
grid_params_dt = [{
                'dt__max_depth': [8, 10],
                'dt__min_samples_leaf': [1, 3, 5, 7]
                  }]
# set for BaggingClassifier
grid_params_bag = [{'bag__n_estimators': [10, 15, 20]}]

# set for SGDClassifier
grid_params_sgd = [{
                    'sgd__loss': ['log', 'huber'],
                    'sgd__learning_rate': ['adaptive'],
                    'sgd__eta0': [0.001, 0.01, 0.1],
                    'sgd__penalty': ['l1', 'l2', 'elasticnet'],
                    'sgd__alpha':[0.1, 1, 5, 10]
                    }]


In [None]:
'''Grid search objects'''
# for LogisticRegression
gs_lr = GridSearchCV(pipe_lr,
                     param_grid=grid_params_lr,
                     scoring='accuracy', cv=cv)
# for RandomForestClassifier
gs_rf = GridSearchCV(pipe_rf,
                     param_grid=grid_params_rf,
                     scoring='accuracy', cv=cv)
# for KNeighborsClassifier
gs_knn = GridSearchCV(pipe_knn,
                      param_grid=grid_params_knn,
                     scoring='accuracy', cv=cv)
# for DecisionTreeClassifier
gs_dt = GridSearchCV(pipe_dt,
                     param_grid=grid_params_dt,
                     scoring='accuracy', cv=cv)
# for BaggingClassifier
gs_bag = GridSearchCV(pipe_bag,
                      param_grid=grid_params_bag,
                     scoring='accuracy', cv=cv)
# for SGDClassifier
gs_sgd = GridSearchCV(pipe_sgd,
                      param_grid=grid_params_sgd,
                     scoring='accuracy', cv=cv)



In [None]:
# models that we iterate over
look_for = [gs_lr, gs_rf, gs_knn, gs_dt, gs_bag, gs_sgd]
# dict for later use
model_dict = {
    0:'Logistic_reg',
    1:'RandomForest',
    2:'Knn',
    3:'DesionTree',
    4:'Bagging with SGDClassifier',
    5:'SGD Class'}

In [None]:
def do_results_obtain(look_for, X_train, y_train,X_test, y_test):

    ''' Function to iterate over models and obtain results'''

    result_acc = {}
    result_auc = {}
    models = []

    for index, model in enumerate(look_for):
            start = time.time()
            print()
            print('+++++++ Start New Model ++++++++++++++++++++++')
            print('Estimator is {}'.format(model_dict[index]))
            model.fit(X_train, y_train)
            print('---------------------------------------------')
            print('best params {}'.format(model.best_params_))
            print('best score is {}'.format(model.best_score_))
            auc = roc_auc_score(y_test, model.predict_proba(X_test)[:,1])
            print('---------------------------------------------')
            print('ROC_AUC is {} and accuracy rate is {}'.format(auc, model.score(X_test, y_test)))
            end = time.time()
            print('It lasted for {} sec'.format(round(end - start, 3)))
            print('++++++++ End Model +++++++++++++++++++++++++++')
            print()
            print()
            models.append(model.best_estimator_)
            result_acc[index] = model.best_score_
            result_auc[index] = auc
    return (result_acc, result_auc, models)

result_acc, result_auc, models = do_results_obtain(look_for, X_train, y_train,X_test, y_test)

In [None]:
plt.plot(model_dict.values(), result_acc.values(), c='r')
plt.plot(model_dict.values(), result_auc.values(), c='b')
plt.xlabel('Models')
plt.xticks(rotation=45)
plt.ylabel('Accouracy and ROC_AUC')
plt.title('Result of Grid Search')
plt.legend(['Accuracy', 'ROC_AUC'])
plt.show();



In [None]:
""" Model performance during Grid Search """
pd.DataFrame(list(zip(model_dict.values(), result_acc.values(), result_auc.values())),
                  columns=['Model', 'Accuracy_rate','Roc_auc_rate'])


In [None]:
def graph(model, X_train, y_train):
    obb = []
    est = list(range(5, 200, 5))
    for i in tqdm(est):
        random_forest = model(n_estimators=i, criterion='entropy', random_state=11, oob_score=True, n_jobs=-1, \
                           max_depth=25, min_samples_leaf=80, min_samples_split=3,)
        random_forest.fit(X_train, y_train)
        obb.append(random_forest.oob_score_)
    print('max oob {} and number of estimators {}'.format(max(obb), est[np.argmax(obb)]))
    plt.plot(est, obb)
    plt.title('model')
    plt.xlabel('number of estimators')
    plt.ylabel('oob score')
    plt.show()

graph(RandomForestClassifier, X_train, y_train)



In [None]:
''' Build graph for ROC_AUC '''

fpr, tpr, threshold = roc_curve(y_test, models[1].predict_proba(X_test)[:,1])

trace0 = go.Scatter(
    x=fpr,
    y=tpr,
    text=threshold,
    fill='tozeroy',
    name='ROC Curve')

trace1 = go.Scatter(
    x=[0,1],
    y=[0,1],
    line={'color': 'red', 'width': 1, 'dash': 'dash'},
    name='Baseline')

data = [trace0, trace1]

layout = go.Layout(
    title='ROC Curve',
    xaxis={'title': 'False Positive Rate'},
    yaxis={'title': 'True Positive Rate'})

fig = go.Figure(data, layout)
fig.show();


In [None]:
''' Build bar plot of feature importance of the best model '''

def build_feature_importance(model, X_train, y_train):

    models = RandomForestClassifier(criterion='entropy', random_state=11, oob_score=True, n_jobs=-1, \
                           max_depth=25, min_samples_leaf=80, min_samples_split=3, n_estimators=70)
    models.fit(X_train, y_train)
    data = pd.DataFrame(models.feature_importances_, X_train.columns, columns=["feature"])
    data = data.sort_values(by='feature', ascending=False).reset_index()
    plt.figure(figsize=[6,6])
    sns.barplot(x='index', y='feature', data=data[:10], palette="Blues_d")
    plt.title('Feature inportance of Random Forest after Grid Search')
    plt.xticks(rotation=45)
    plt.show()

build_feature_importance(RandomForestClassifier, X_train, y_train)


In [None]:
sm = SMOTE(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
tree = DecisionTreeClassifier(criterion="entropy", max_depth=7)
X_SMOTE, y_SMOTE = sm.fit_resample(X_train, y_train)
model = tree.fit(X_SMOTE,y_SMOTE)
y_pred = model.predict(X_test)

print("Precision: ",round(precision_score(y_test,y_pred),2),"Recall: ",round(recall_score(y_test,y_pred),2))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
forest = RandomForestClassifier(n_estimators= 1000,criterion="gini", max_depth=5,min_samples_split = 0.4,min_samples_leaf=1, class_weight="balanced")
model = forest.fit(X_train,y_train)
y_pred = model.predict(X_test)
pd.Series(y_pred).value_counts()
print("Precision: ",round(precision_score(y_test,y_pred),2),"Recall: ",round(recall_score(y_test,y_pred),2))


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
forest = RandomForestClassifier(n_estimators= 1000,criterion="gini", max_depth=5,min_samples_split = 0.4,min_samples_leaf=1, class_weight="balanced")
X_SMOTE, y_SMOTE = sm.fit_resample(X_train, y_train)
model = forest.fit(X_SMOTE,y_SMOTE)
y_pred = model.predict(X_test)
pd.Series(y_pred).value_counts()
print("Precision: ",round(precision_score(y_test,y_pred),2),"Recall: ",round(recall_score(y_test,y_pred),2))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = lm.LogisticRegression(random_state=0, solver='lbfgs',multi_class='auto',max_iter=1000).fit(X_train,y_train)
y_pred = model.predict_proba(X_test)
y_pred = y_pred[:,1]

fpr_imb, tpr_imb, _ = roc_curve(y_test, y_pred)
roc_auc_imb = auc(fpr_imb, tpr_imb)
y_pred = model.predict(X_test)

print("Imbalanced -")
print("Precision: ",round(precision_score(y_test,y_pred),2),"Recall: ",round(recall_score(y_test,y_pred),2))
# Undersampled
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
rus = RandomUnderSampler(random_state=0)
X_Usampled, y_Usampled = rus.fit_resample(X_train, y_train)
model = lm.LogisticRegression(random_state=0, solver='lbfgs',multi_class='auto',max_iter=5000).fit(X_Usampled,y_Usampled)
y_pred = model.predict_proba(X_test)
y_pred = y_pred[:,1]

fpr_us, tpr_us, _ = roc_curve(y_test, y_pred)
roc_auc_us = auc(fpr_us, tpr_us)
y_pred = model.predict(X_test)

print("Random undersampled -")
print("Precision: ",round(precision_score(y_test,y_pred),2),"Recall: ",round(recall_score(y_test,y_pred),2))
# Oversampled
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
ros = RandomOverSampler(random_state=0)
X_Osampled, y_Osampled = ros.fit_resample(X_train, y_train)
model = lm.LogisticRegression(random_state=0, solver='lbfgs',multi_class='auto',max_iter=5000).fit(X_Osampled, y_Osampled)
y_pred = model.predict_proba(X_test)
y_pred = y_pred[:,1]

fpr_os, tpr_os, _ = roc_curve(y_test, y_pred)
roc_auc_os = auc(fpr_os, tpr_os)
y_pred = model.predict(X_test)

print("Random oversampled -")
print("Precision: ",round(precision_score(y_test,y_pred),2),"Recall: ",round(recall_score(y_test,y_pred),2))
# SMOTE
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
sm = SMOTE(random_state=0)
X_SMOTE, y_SMOTE = sm.fit_resample(X_train, y_train)
model = lm.LogisticRegression(random_state=0, solver='lbfgs',multi_class='auto',max_iter=5000).fit(X_SMOTE,y_SMOTE)
y_pred = model.predict_proba(X_test)
y_pred = y_pred[:,1]

fpr_smote, tpr_smote, _ = roc_curve(y_test, y_pred)
roc_auc_smote = auc(fpr_smote, tpr_smote)
y_pred = model.predict(X_test)

print("SMOTE -")
print("Precision: ",round(precision_score(y_test,y_pred),2),"Recall: ",round(recall_score(y_test,y_pred),2))

In [None]:
plt.figure()
lw = 2
plt.plot(fpr_imb, tpr_imb,
         label='Imbalanced data ROC curve (area = {0:0.4f})'
               ''.format(roc_auc_imb),
         color='deeppink', linestyle=':', linewidth=2)

plt.plot(fpr_us, tpr_us,
         label='Undersampled data ROC curve (area = {0:0.4f})'
               ''.format(roc_auc_us),
         color='blue', linestyle='--', linewidth=2)

plt.plot(fpr_os, tpr_os,
         label='Random Oversampled data ROC curve (area = {0:0.4f})'
               ''.format(roc_auc_os),
         color='darkred', linestyle='--', linewidth=2)

plt.plot(fpr_smote, tpr_smote,
         label='SMOTE data ROC curve (area = {0:0.4f})'
               ''.format(roc_auc_smote),
         color='darkgreen', linestyle='--', linewidth=2)

plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.00])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()


In [None]:
sm = SMOTE(random_state=0)
X_SMOTE, y_SMOTE = sm.fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X_SMOTE, y_SMOTE, test_size=0.3)
svm = SVC(kernel='linear')
model = svm.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Linear kernel- ","Precision: ",round(precision_score(y_test,y_pred),2),"Recall: ",round(recall_score(y_test,y_pred),2))
fpr_linear, tpr_linear, _ = roc_curve(y_test, y_pred)
roc_auc_linear = auc(fpr_linear, tpr_linear)
sm = SMOTE(random_state=0)
X_SMOTE, y_SMOTE = sm.fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X_SMOTE, y_SMOTE, test_size=0.3)
svm = SVC(kernel='rbf')
model = svm.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Guassian kernel- ","Precision: ",round(precision_score(y_test,y_pred),2),"Recall: ",round(recall_score(y_test,y_pred),2))
fpr_rbf, tpr_rbf, _ = roc_curve(y_test, y_pred)
roc_auc_rbf = auc(fpr_rbf, tpr_rbf)
