### Data Dictionary

1) age (numeric).
2) job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student", "blue-collar","selfemployed","retired","technician","services").
3) marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed).
4) education (categorical: "unknown","secondary","primary","tertiary").
5) default: has credit in default? (binary: "yes","no").
6) balance: average yearly balance, in euros (numeric) .
7) housing: has housing loan? (binary: "yes","no").
8) loan: has personal loan? (binary: "yes","no")  Related with the last contact of the current campaign:
9) contact: contact communication type (categorical: "unknown","telephone","cellular") .
10) day: last contact day of the month (numeric).
11) month: last contact month of year (categorical: "jan", "feb", "mar", …, "nov", "dec").
12) duration: last contact duration, in seconds (numeric).

Other attributes: 

13) campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact).
14) pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted).
15) previous: number of contacts performed before this campaign and for this client (numeric).
16) poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success").

Output variable (desired target):

17) y - has the client subscribed a term deposit? (binary: "yes","no").
Missing Attribute Values: None.   

### Import Libraries 

In [1]:
import pandas as pd 
import sys
sys.path.append('/home/emred/Desktop/Data Science Projects/Implementation Logistic Regression/src')
from data_preprocessing import DataPreprocessing
from logistic_regression import LogisticRegression
from IPython.display import display

### Load Data

In [2]:
df = pd.read_excel("/home/emred/Desktop/Data Science Projects/Implementation Logistic Regression/input/bank-full.xlsx")

In [3]:
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


### Basic Data Control

In [4]:
display(DataPreprocessing.check_missing(df=df))

age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

In [5]:
display(DataPreprocessing.df_info(df=df))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


None

In [6]:
display(DataPreprocessing.describe(df=df))

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,45211.0,40.93621,10.618762,18.0,33.0,39.0,48.0,95.0
balance,45211.0,1362.272058,3044.765829,-8019.0,72.0,448.0,1428.0,102127.0
day,45211.0,15.806419,8.322476,1.0,8.0,16.0,21.0,31.0
duration,45211.0,258.16308,257.527812,0.0,103.0,180.0,319.0,4918.0
campaign,45211.0,2.763841,3.098021,1.0,1.0,2.0,3.0,63.0
pdays,45211.0,40.197828,100.128746,-1.0,-1.0,-1.0,-1.0,871.0
previous,45211.0,0.580323,2.303441,0.0,0.0,0.0,0.0,275.0


In [7]:
display(DataPreprocessing.control_discrete_object_feat(df=df))

there is no object type

 job object :

 blue-collar      9732
management       9458
technician       7597
admin.           5171
services         4154
retired          2264
self-employed    1579
entrepreneur     1487
unemployed       1303
housemaid        1240
student           938
unknown           288
Name: job, dtype: int64

 marital object :

 married     27214
single      12790
divorced     5207
Name: marital, dtype: int64

 education object :

 secondary    23202
tertiary     13301
primary       6851
unknown       1857
Name: education, dtype: int64

 default object :

 no     44396
yes      815
Name: default, dtype: int64
there is no object type

 housing object :

 yes    25130
no     20081
Name: housing, dtype: int64

 loan object :

 no     37967
yes     7244
Name: loan, dtype: int64

 contact object :

 cellular     29285
unknown      13020
telephone     2906
Name: contact, dtype: int64
there is no object type

 month object :

 may    13766
jul     6895
aug     6247
jun     

None

### Label Encoding & Drop Categorical Columns

In [8]:
df_encoded = DataPreprocessing.encoding_object_feat(df=df)
display(df_encoded.head(3))

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,...,job_encoded,marital_encoded,education_encoded,default_encoded,housing_encoded,loan_encoded,contact_encoded,month_encoded,poutcome_encoded,y_encoded
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,...,4,1,2,0,1,0,2,8,3,0
1,44,technician,single,secondary,no,29,yes,no,unknown,5,...,9,2,1,0,1,0,2,8,3,0
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,...,2,1,1,0,1,1,2,8,3,0


In [9]:
df_encoded2 = DataPreprocessing.drop_categorical_cols(df=df_encoded)

### Multicollinerity Elimination

In [10]:
df_after_multicollinerity = DataPreprocessing.multicollinearity_elimination(df=df_encoded2)
print(f"Number of features before multicollinerity: {len(df_encoded.columns)}")
print(f"Number of features after multicollinerity: {len(df_after_multicollinerity.columns)}")

Number of features before multicollinerity: 27
Number of features after multicollinerity: 16


### Train Test Split 

In [11]:
X = df_after_multicollinerity.drop('y_encoded', axis=1)
y = df_after_multicollinerity.y_encoded

In [12]:
X_train, X_test, y_train, y_test = LogisticRegression.train_test_split(X,y)
print("\n X_train shape: ", X_train.shape, "\n X_test shape: ", X_test.shape, "\n y_train shape: ", y_train.shape, "\n y_test shape: ", y_test.shape)


 X_train shape:  (36168, 15) 
 X_test shape:  (9043, 15) 
 y_train shape:  (36168,) 
 y_test shape:  (9043,)


### Build Logistic Regression

In [13]:
clf = LogisticRegression()
clf.log_fit(X=X_train, y=y_train)
y_pred = clf.predict(X=X_test)
accuracy = clf.accuracy(y_test=y_test, y_pred=y_pred)

print("Accuracy of logistic regression is ", accuracy)

  return 1/(1+np.exp(-x))


Accuracy of logistic regression is  0.8178701758266063


In [18]:
#You can getter better result by changing parameters (learning rate and number of iteration) of logistic regression as you can see.
clf2 = LogisticRegression(lr=0.001, n_iters=1000)
clf2.log_fit(X=X_train, y=y_train)
y_pred2 = clf2.predict(X=X_test)
accuracy2 = clf2.accuracy(y_test=y_test, y_pred=y_pred2)

print("Accuracy of logistic regression is ", accuracy2)


  return 1/(1+np.exp(-x))


Accuracy of logistic regression is  0.8269379630653544
