## Building A Logistic Regression in Python, Step by Step

Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) as a function of X.

#### Logistic Regression Assumptions
    • Binary logistic regression requires the dependent variable to be binary.
    • For a binary regression, the factor level 1 of the dependent variable should represent the desired outcome.
    • Only the meaningful variables should be included.
    • The independent variables should be independent of each other. That is, the model should have little or no multicollinearity.
    • The independent variables are linearly related to the log odds.
    • Logistic regression requires quite large sample sizes.
Keeping the above assumptions in mind, let’s look at our dataset.

#### Data
The dataset comes from the UCI Machine Learning repository, and it is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict whether the client will subscribe (1/0) to a term deposit (variable y). <br>
The dataset can be downloaded from here: https://raw.githubusercontent.com/madmashup/targeted-marketing-predictive-engine/master/banking.csv

In [8]:
#main imports

import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt 
plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)

In [9]:
# read the main data and explore the head rows
main_data = pd.read_csv('https://raw.githubusercontent.com/madmashup/targeted-marketing-predictive-engine/master/banking.csv', header=0)
main_data.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,y
0,44,blue-collar,married,basic.4y,unknown,yes,no,cellular,aug,thu,...,1,999,0,nonexistent,1.4,93.444,-36.1,4.963,5228.1,0
1,53,technician,married,unknown,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-0.1,93.2,-42.0,4.021,5195.8,0
2,28,management,single,university.degree,no,yes,no,cellular,jun,thu,...,3,6,2,success,-1.7,94.055,-39.8,0.729,4991.6,1
3,39,services,married,high.school,no,no,no,cellular,apr,fri,...,2,999,0,nonexistent,-1.8,93.075,-47.1,1.405,5099.1,0
4,55,retired,married,basic.4y,no,yes,no,cellular,aug,fri,...,1,3,1,success,-2.9,92.201,-31.4,0.869,5076.2,1


In [13]:
# explore the end value of the provided data
main_data.tail()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,y
41183,59,retired,married,high.school,unknown,no,yes,telephone,jun,thu,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.866,5228.1,0
41184,31,housemaid,married,basic.4y,unknown,no,no,telephone,may,thu,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.86,5191.0,0
41185,42,admin.,single,university.degree,unknown,yes,yes,telephone,may,wed,...,3,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
41186,48,technician,married,professional.course,no,no,yes,telephone,oct,tue,...,2,999,0,nonexistent,-3.4,92.431,-26.9,0.742,5017.5,0
41187,25,student,single,high.school,no,no,no,telephone,may,fri,...,4,999,0,nonexistent,1.1,93.994,-36.4,4.859,5191.0,0


In [17]:
# explore the shape of the provided dataset
main_data.shape

#41187 rows and 21 columns

(41188, 21)

In [19]:
# extract all column names in a valiable
data_cols = main_data.columns
data_cols

Index(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'emp_var_rate', 'cons_price_idx',
       'cons_conf_idx', 'euribor3m', 'nr_employed', 'y'],
      dtype='object')

In [10]:
# perform a check to see if there is any missing value
main_data.isnull().values.any()

False

In [12]:
# explore the types of the values in all columns
main_data.dtypes

age                 int64
job                object
marital            object
education          object
default            object
housing            object
loan               object
contact            object
month              object
day_of_week        object
duration            int64
campaign            int64
pdays               int64
previous            int64
poutcome           object
emp_var_rate      float64
cons_price_idx    float64
cons_conf_idx     float64
euribor3m         float64
nr_employed       float64
y                   int64
dtype: object

#### Input variables

    1. age (numeric)
    2. job : 
        - type of job (categorical: “admin”, “blue-collar”, “entrepreneur”, “housemaid”, “management”, “retired”, “self-employed”, “services, “student”, “technician”, “unemployed”, “unknown”)
        - marital : marital status (categorical: “divorced”, “married”, “single”, “unknown”)
        - education (categorical: “basic.4y”, “basic.6y”, “basic.9y”, “high.school”, “illiterate”, “professional.course”, “university.degree, “unknown”)
    3. default: has credit in default? (categorical: “no”, “yes”, “unknown”)
    4. housing: has housing loan? (categorical: “no”, “yes”, “unknown”)
    5. loan: has personal loan? (categorical: “no”, “yes”, “unknown”)
    6. contact: contact communication type (categorical: “cellular”, “telephone”)
    7. month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
    8. day_of_week: last contact day of the week (categorical: “mon”, “tue”, “wed”, “thu”, “fri”)
    9. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). The duration is not known before a call is performed, also, after the end of the call, y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model
    10. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
    11. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
    12. previous: number of contacts performed before this campaign and for this client (numeric)
    13. poutcome: outcome of the previous marketing campaign (categorical: “failure”, “nonexistent”, “success”)
    14. emp.var.rate: employment variation rate — (numeric)
    15. cons.price.idx: consumer price index — (numeric)
    16. cons.conf.idx: consumer confidence index — (numeric)
    17. euribor3m: euribor 3 month rate — (numeric)
    18. nr.employed: number of employees — (numeric)

#### Predict variable (desired target):

y — has the client subscribed a term deposit? (binary: “1”, means “Yes”, “0” means “No”)


In [20]:
#The education column of the dataset has many categories and we need to reduce the categories for a better modelling. \
#The education column has the following categories:

main_data['education'].unique()

array(['basic.4y', 'unknown', 'university.degree', 'high.school',
       'basic.9y', 'professional.course', 'basic.6y', 'illiterate'],
      dtype=object)

In [22]:
# group “basic.4y”, “basic.9y” and “basic.6y” together and call them “basic”

main_data['education']=np.where(main_data['education'] =='basic.9y', 'Basic', main_data['education'])
main_data['education']=np.where(main_data['education'] =='basic.6y', 'Basic', main_data['education'])
main_data['education']=np.where(main_data['education'] =='basic.4y', 'Basic', main_data['education'])


In [23]:
#after grouping:
main_data['education'].unique()

array(['Basic', 'unknown', 'university.degree', 'high.school',
       'professional.course', 'illiterate'], dtype=object)