# Logistic Regression

The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

Link for the dataset: https://www.kaggle.com/datasets/rouseguy/bankbalanced

Cleaning and modifying data

In [176]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# pip install scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

In [177]:
# uploading dataset to juputer notebook
df = pd.read_csv("bank.csv")

In [178]:
# let's read data
df.head()

# I checked from this link https://archive.ics.uci.edu/dataset/222/bank+marketing
# what some of columns mean (their definition and values)

# default --> has credit in default? (binary: "yes","no")
# duration --> last contact duration, in seconds (numeric)
# campaign --> number of contacts performed during this campaign 
#and for this client (numeric, includes last contact)
# pdays --> number of days that passed by after the client was last contacted 
#from a previous campaign (numeric, -1 means client was not previously contacted)
# previous --> number of contacts performed before this campaign and for this client (numeric)
# poutcome --> outcome of the previous marketing campaign (categorical: "unknown","other","failure","success"

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,deposit
0,59,admin.,married,secondary,no,2343,yes,no,unknown,5,may,1042,1,-1,0,unknown,yes
1,56,admin.,married,secondary,no,45,no,no,unknown,5,may,1467,1,-1,0,unknown,yes
2,41,technician,married,secondary,no,1270,yes,no,unknown,5,may,1389,1,-1,0,unknown,yes
3,55,services,married,secondary,no,2476,yes,no,unknown,5,may,579,1,-1,0,unknown,yes
4,54,admin.,married,tertiary,no,184,no,no,unknown,5,may,673,2,-1,0,unknown,yes


## My clean data plan:
1. Check all NaN values --> delete if there are 
2. Check duplicates --> delete if there are 

3. Check the job column --> use clustering or other methods to modify it into numeric
4. marital column --> OneHotEncoder
5. education column --> OneHotEncoder
6. default column --> LabelEncoder
7. housing column --> LabelEncoder
8. loan column --> LabelEncoder
9. contact column --> OneHotEncoder 
10. month column --> use clustering or other methods to modify it into numeric
11. poutcome column --> OneHotEncoder 
12. deposit column --> LabelEncoder

13. Remove outliers

In [179]:
# 1. Check all NaN values --> delete if there are 
df.isna().sum()

# suprisingly, zero :)

age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
deposit      0
dtype: int64

In [180]:
# 2. Check duplicates --> delete if there are 
# let's check whether we have duplicates
# or one more surprise awaiting us :)
df.duplicated().sum()

# yes, life is full of surprises :)

0

In [181]:
# 3. Check the job column --> use clustering or other methods to modify it into numeric

# type of job (categorical: 'admin.','blue-collar','entrepreneur',
# 'housemaid','management','retired','self-employed','services','student',
# 'technician','unemployed','unknown')

# overwriting the 'job' column with the encoded values

# the package that needed to be installed
# in order to modify data into numeric
from sklearn.preprocessing import LabelEncoder

# initialize LabelEncoder
label_encoder = LabelEncoder()

# fit LabelEncoder and transform 'job' column
df['job'] = label_encoder.fit_transform(df['job'])

# Print the mapping of encoded values to original categories
# I will use these values for tester row and for GUI
print("Encoded values:")
for category, encoded_value in zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)):
    print(f"{category}: {encoded_value}")

Encoded values:
admin.: 0
blue-collar: 1
entrepreneur: 2
housemaid: 3
management: 4
retired: 5
self-employed: 6
services: 7
student: 8
technician: 9
unemployed: 10
unknown: 11


In [182]:
# Modifying columns with OneHotEncoder
# 4. marital column --> OneHotEncoder

# fit LabelEncoder and transform 'job' column
df['marital'] = label_encoder.fit_transform(df['marital'])

# Print the mapping of encoded values to original categories
# I will use these values for tester row and for GUI
print("Encoded values:")
for category, encoded_value in zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)):
    print(f"{category}: {encoded_value}")

Encoded values:
divorced: 0
married: 1
single: 2


In [183]:
# 5. education column --> OneHotEncoder
df['education'] = label_encoder.fit_transform(df['education'])

# Print the mapping of encoded values to original categories
# I will use these values for tester row and for GUI
print("Encoded values:")
for category, encoded_value in zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)):
    print(f"{category}: {encoded_value}")

Encoded values:
primary: 0
secondary: 1
tertiary: 2
unknown: 3


In [184]:
# 9. contact column --> OneHotEncoder 
df['contact'] = label_encoder.fit_transform(df['contact'])

# Print the mapping of encoded values to original categories
# I will use these values for tester row and for GUI
print("Encoded values:")
for category, encoded_value in zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)):
    print(f"{category}: {encoded_value}")

Encoded values:
cellular: 0
telephone: 1
unknown: 2


In [185]:
# 11. poutcome column --> OneHotEncoder 
df['poutcome'] = label_encoder.fit_transform(df['poutcome'])

# Print the mapping of encoded values to original categories
# I will use these values for tester row and for GUI
print("Encoded values:")
for category, encoded_value in zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)):
    print(f"{category}: {encoded_value}")

Encoded values:
failure: 0
other: 1
success: 2
unknown: 3


In [186]:
# verify the transformation
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,deposit
0,59,0,1,1,no,2343,yes,no,2,5,may,1042,1,-1,0,3,yes
1,56,0,1,1,no,45,no,no,2,5,may,1467,1,-1,0,3,yes
2,41,9,1,1,no,1270,yes,no,2,5,may,1389,1,-1,0,3,yes
3,55,7,1,1,no,2476,yes,no,2,5,may,579,1,-1,0,3,yes
4,54,0,1,2,no,184,no,no,2,5,may,673,2,-1,0,3,yes


In [187]:
# 10. month column --> use clustering or other methods to modify it into numeric
# Define a dictionary mapping each month to its numeric representation
month_to_number = {
    'jan': 1,
    'feb': 2,
    'mar': 3,
    'apr': 4,
    'may': 5,
    'jun': 6,
    'jul': 7,
    'aug': 8,
    'sep': 9,
    'oct': 10,
    'nov': 11,
    'dec': 12
}

# Map the 'month' column to its corresponding numeric representation
df['month'] = df['month'].map(month_to_number)

# I used this method in order to avoid creating additional columns

In [188]:
# checking whether everything worked in the correct way
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,deposit
0,59,0,1,1,no,2343,yes,no,2,5,5,1042,1,-1,0,3,yes
1,56,0,1,1,no,45,no,no,2,5,5,1467,1,-1,0,3,yes
2,41,9,1,1,no,1270,yes,no,2,5,5,1389,1,-1,0,3,yes
3,55,7,1,1,no,2476,yes,no,2,5,5,579,1,-1,0,3,yes
4,54,0,1,2,no,184,no,no,2,5,5,673,2,-1,0,3,yes


In [189]:
# All columns that need the same method can be done in one step
# 6. default column --> LabelEncoder
# 7. housing column --> LabelEncoder
# 8. loan column --> LabelEncoder
# 12. deposit column --> LabelEncoder

# this just converts the value of column to 0 or 1
# factorize in pandas works too, but only one column at a time
from sklearn.preprocessing import LabelEncoder
variables = ['default', 'housing', 'loan', 'deposit']
encoder = LabelEncoder()
df[variables] = df[variables].apply(encoder.fit_transform)

In [190]:
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,deposit
0,59,0,1,1,0,2343,1,0,2,5,5,1042,1,-1,0,3,1
1,56,0,1,1,0,45,0,0,2,5,5,1467,1,-1,0,3,1
2,41,9,1,1,0,1270,1,0,2,5,5,1389,1,-1,0,3,1
3,55,7,1,1,0,2476,1,0,2,5,5,579,1,-1,0,3,1
4,54,0,1,2,0,184,0,0,2,5,5,673,2,-1,0,3,1


job
Encoded values:
admin.: 0
blue-collar: 1
entrepreneur: 2
housemaid: 3
management: 4
retired: 5
self-employed: 6
services: 7
student: 8
technician: 9
unemployed: 10
unknown: 11


marital
divorced: 0
married: 1
single: 2

education
primary: 0
secondary: 1
tertiary: 2
unknown: 3

contact
cellular: 0
telephone: 1
unknown: 2

poutcome
failure: 0
other: 1
success: 2
unknown: 3

month
'jan': 1,
'feb': 2,
'mar': 3,
'apr': 4,
'may': 5,
'jun': 6,
'jul': 7,
'aug': 8,
'sep': 9,
'oct': 10,
'nov': 11,
'dec': 12

default
yes: 1
no: 0

housing
yes: 1
no: 0

loan
yes: 1
no: 0

deposit
yes: 1
no: 0