*Day 5*
# **Logistic Regression**
---

Logistic Regression is a classification algorithm. It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables. To represent binary / categorical outcome, we use dummy variables.

### !!! Sigmoid Curve:
* Sigmoid curve is a curve which is used to represent the probability of the event to occur.
* It is a S-shaped curve.
* It is used to represent the probability of the event to occur between 0 and 1.

SIgmoid Curve image:

![Sigmoid Curve](https://miro.medium.com/max/700/1*JHWL_71qml0kP_Imyx4zBg.png)

#Example: Loan repayment comparison by gender

Features (x) : Gender (Male/Female)

Target (y) : Loan Repaid (Yes/No)

***Logistic Regression Equation:***
* The logistic regression equation can be obtained from the linear regression equation.
* The linear regression equation is of the form: y = b0 + b1*x
* The logistic regression equation is of the form: y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))
* The logistic regression equation can be rewritten as: p = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))
* Here, p is the probability of the event to occur.
* The logistic regression equation can be rewritten as: ln(p/(1-p)) = b0 + b1*x
* Here, ln(p/(1-p)) is called the logit function.


In [2]:
# logreg.predictproba is used to predict the probability of the class
# logreg.predict is used to predict the class

### Odds Ratio (OR)

- Odds ratio is the ratio of the probability of success to the probability of failure.

Example:
Odd Ratio = Odd(Male)/Odd(Female) = 3 = Men have 3 times probability to default loan

### Binary Logistic Regression

- In binary logistic regression, the target variable can have only two possible outcomes.
- Example: Spam or Not, Pass or Fail, Default or Not Default, etc.


Example:

Loan repayment 

## Try Logistic Regression 

In [3]:
#Import Libraries
## EDA Standard Libary

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.stats as ss

In [4]:
# logreg.coef
#ML Library

#ML Models
from sklearn.linear_model import LogisticRegression
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.svm import SVC
#ML TrainTest Split
from sklearn.model_selection import train_test_split
#ML Report
from sklearn.metrics import  accuracy_score,confusion_matrix,classification_report,roc_auc_score,roc_curve

In [5]:
#Read data from csv file
bank = pd.read_csv('/Users/Dwika/My Projects/DATASETS/bankloan.csv')

In [6]:
bank.head()

Unnamed: 0,age,ed,employ,address,income,debtinc,creddebt,othdebt,default
0,41,3,17,12,176,9.3,11.359392,5.008608,1
1,27,1,10,6,31,17.3,1.362202,4.000798,0
2,40,1,15,14,55,5.5,0.856075,2.168925,0
3,41,1,15,14,120,2.9,2.65872,0.82128,0
4,24,2,2,0,28,17.3,1.787436,3.056564,1


In [7]:
#Determine Variables
y = bank['default']
x = bank[['age', 'creddebt', 'othdebt' ]]

In [8]:
y.head()

0    1
1    0
2    0
3    0
4    1
Name: default, dtype: int64

In [9]:
x.head()

Unnamed: 0,age,creddebt,othdebt
0,41,11.359392,5.008608
1,27,1.362202,4.000798
2,40,0.856075,2.168925
3,41,2.65872,0.82128
4,24,1.787436,3.056564


In [10]:
#Logistic Regression

logreg = LogisticRegression()
logreg.fit(x,y)
print(f'LogReg Coefficient (B1): {logreg.coef_}')
print(f'LogReg Intercept (B0): {logreg.intercept_}')

LogReg Coefficient (B1): [[-0.07965935  0.33845357  0.0337601 ]]
LogReg Intercept (B0): [1.00619047]


In [11]:
#Create a logreg equation

#Odd(logreg) = exp(B0 + B1x1 + B2x2 + B3x3 + ... + Bnxn)



Suppose
We only look at Age
c = 35
d = 20

We want to predict if a person will default on a loan or not.

In [12]:
#exponential equation in python
OR_age = np.exp(15*logreg.coef_[0][0])



In [13]:
#ODD Age
def ODD_age(cd):
    return np.exp(logreg.intercept_ + (cd*logreg.coef_[0][0]))[0]

ODD_age(35), ODD_age(20)

(0.1683202531134927, 0.5559946709099294)

In [18]:
print(f'OR: {OR_age[0]}')
print(f'{ODD_age(20)/ODD_age(35)} times more likely to default')


IndexError: invalid index to scalar variable.

OR < 1, c > d:
- Default rate decrease when age increase
- Unit observations which have Age = 20 have 3.32 times more likely to default 
than unit observations which have Age = 35.
-  where OR = exp(-0.079(35-20)).

if OR < 1, c > d:
- Success rate decrease when Xi increase
- Unit observations which have Xi = d have 1/OR times more likely to achieve success event
than unit observations which have Xi = c. where OR = exp(Bi(c-d)).

### Predict Loan Repayment by creddebt

In [None]:
logreg.intercept_ + (logreg.coef_[0][1])

array([6.08299401])

In [None]:
#exponential equation in python by creddebt 1.5 and 1.0

OR_creddebt = np.exp(0.5*logreg.coef_[0][1])
OR_creddebt

1.1843887098691461

if OR > 1, c > d :
- Success rate increase when Xi increase
- Unit observations which have Xi = c have OR times more likely to achieve success event than
unit observations which have Xi = d. where OR = exp(Bi(c-d)).


In [None]:
x.describe()

Unnamed: 0,age,creddebt,othdebt
count,700.0,700.0,700.0
mean,34.86,1.553553,3.058209
std,7.997342,2.117197,3.287555
min,20.0,0.011696,0.045584
25%,29.0,0.369059,1.044178
50%,34.0,0.854869,1.987567
75%,40.0,1.901955,3.923065
max,56.0,20.56131,27.0336


In [None]:
#Range of Age

x['age'].min(), x['age'].max()

(20, 56)

In [None]:
x[x['age'].isin([20,35])]

Unnamed: 0,age,creddebt,othdebt
46,35,0.43155,1.14345
49,35,0.205128,2.902872
56,35,0.3978,1.3022
83,35,1.84382,1.34618
164,35,0.591,2.409
213,35,0.581418,1.416582
233,35,0.103488,0.820512
248,35,1.2136,6.1864
249,35,4.874716,8.159284
252,35,1.418445,0.576555


In [None]:
#Age of 20 and 35

# P(Y=1) = 1 / (1 + exp(-(-1.7 + 0.04*20 + 0.6*35)))

In [None]:
#Logisti regression equation
# P(Y=1) = 1 / (1 + exp(-(B0 + B1x1 + B2x2 + B3x3 + ... + Bnxn)))


In [15]:
import matplotlib as plt