# Logistic Regression

Logistic regression is an example of classification algorithm, when we predict category rather than value. For example, we can use logistic regression to predict **whether a transaction is fraudlent or not** (let's take it as an example for further derivations).

It's interesting to note though that linear regression could also be applied when we have just 2 possible values for $Y$ - let's say 1 (fraud) and 0 (not fraud). Using linear regression, we can find the coefficients $\hat{\beta}$ and our $\hat{y}$ then will be showing us the probability of fraud, given the data. In other words: 

$$\hat{y} = P(\text{fraud}|X)$$

But this way our predictions CAN be outside of the [0,1] range, and it's hard to interpret it. However, similar approach is used in LDA. 

Logistic regression estimates the probability that Y belongs to some category (in our example - to 1 or 0): 

$$p(\text{fraud}=1|X)$$

Then we can make prediction, based on our threshold. Typically, it's 0.5. I.e. if $p(\text{fraud}=1|X) > 0.5$, then predict fraud, else - not fraud. But if we want to be super cautions, we can use different threshold - 0.3, for example. 

Probability is always in the range of [0,1], so we need to find a suitable function that return values in tha interval. For logistic regression it's a logistic function (aka sigmoid), which looks as follows: 

$$p(X) = \frac{e^{Xb}}{1+e^{Xb}}$$

To fit the model the **method of maximum likelihood** is used. We can rewrite the above logistic function as follows: 

$$e^{Xb} = p + p e^{Xb}$$ 

$$e^{Xb} - p e^{Xb} = p$$

$$e^{Xb} (1-p) = p$$

$$ \frac{p}{1-p} = e^{Xb}$$

$ \frac{p}{1-p}$ - the odds of fraud. 

We can take logarithm of both sides: 

$$ \log(\frac{p}{1-p}) = Xb$$

The logarithm of the odds is called **LOGIT** (aka log-odds) - $\log(\frac{p}{1-p})$. Logit is linear in X. 

### Interpretation of the coefficietns 

In linear regression, one unit increase in $x_i$ meant an increase by $\beta_i$, in logistic regression one unit increase in $x_i$ means an increase by $\beta_i$ in log-odds or increase by $e^{\beta_i}$ in odds. 

### Maximum Likelihood 

Idea - we want to find the coefficients $\hat{b}$ such that that predicted probability is closest to the actual value (i.e. 0 or 1): i.e. closest to 1 for fraudlent transaction, closest to 0 for non-fraudlent transactions.  

The likekihood functions looks as follows:  

$$l(b)= \prod_{i:y=1} p(x_i) \prod_{i':y=1} (1- p(x_i'))$$

The goal is to maximize that function with respect to b **OR** minimize $l(b) =  {-\prod_{i:y=1} p(x_i) \prod_{i':y=1} (1- p(x_i'))}$ with respect to b. 

### Null-hypothesis in logistic regression 

In logistic regression the null-hypothesis is that a coefficient is equal to 0. I.e. it doesn't influence the outcome. Small value tells us that we can reject the null-hypothesis and confirm that the corresponding $b_i$ is not zero. 

### Example: 

In [13]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

df = pd.read_csv('data/fraud_dataset.csv')
df['intercept'] = 1
df['weekday'] = (df.day=='weekday').astype(int)
lm = sm.Logit(df.fraud, df[['intercept','weekday','duration']])
result = lm.fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.002411
         Iterations 16


0,1,2,3
Dep. Variable:,fraud,No. Observations:,8793.0
Model:,Logit,Df Residuals:,8790.0
Method:,MLE,Df Model:,2.0
Date:,"Thu, 20 Dec 2018",Pseudo R-squ.:,0.9633
Time:,21:55:39,Log-Likelihood:,-21.2
converged:,True,LL-Null:,-578.1
,,LLR p-value:,1.39e-242

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,9.8709,1.944,5.078,0.000,6.061,13.681
weekday,2.5465,0.904,2.816,0.005,0.774,4.319
duration,-1.4637,0.290,-5.039,0.000,-2.033,-0.894


### Interpretation: 

**For quantitative variables:**.   
For every one unit increase in $x_1$, we expect a multiplicative change in the odds a 1 of a $e^{b_1}$, holding all other variables constant. 

**For categorical variables:**.   
When in category $x_1$, we expect a multiplicative change in the odds of a 1 by$e^{b_1}$ compared to the baseline.

**Coefficient < 1**:  
It's often convenient to calculate the reciprocal - $\frac{1}{e^{b_1}}$ and reverse the statement -> to decrease by the obtainted value


## Model Evaluation
**Accuracy:** $\frac{\text{number of correct labels}}{\text{number of rows}}$  
BUT that doesn't work if the categories are too unequal in terms of size.

### Precision and Recall 

**Recall**: $$\frac{\text{True Positive}}{\text{True Positive + False Negative}}$$

Out of all true positives, what fraction were actually recognized (recalled) as positive? 

**Precision**: $$\frac{\text{True Positive}}{\text{True Positive + False Positive}}$$ 

Out of all predicted positives, what fraction are really truly positive? 

### Example

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, accuracy_score, confusion_matrix

In [30]:
df = pd.read_csv('data/admissions.csv')
df_new = df.join(pd.get_dummies(df.prestige))
X = df_new[['gre','gpa',2,3,4]]
y = df_new.admit
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state = 0)
log_mod = LogisticRegression(solver='lbfgs',max_iter=150)
log_mod.fit(X_train, y_train)
y_pred = log_mod.predict(X_test)

In [31]:
confusion_matrix(y_test, y_pred)

array([[56,  0],
       [20,  4]])

The confusion matrix for binary classification has 0,1 column and row names in that order. Also the column are for predicted values, the row - for actual values.

For further studies and making notes - <a href="https://community.alteryx.com/t5/Data-Science-Blog/ROC-Curves-in-Python-and-R/ba-p/138430">ROC</a>