## Logistic Regression

### Modeling the Log Likelihood

In classification problems, the target variable in the model is categorical data, which cannot be fitted with a linear model directly. To design a better predictive model structure, we first define how the computer can understand the category in the target, then use a different approach to have the computer return a probability of some category given the value of the predictors.

How can a computer understand a binary target?

The most obvious way to get the computer to understand a binary output or target is to assign numerical values to the target: 

1 represents Yes, True, class 1, etc ... and   
0 represents No, False, class 2, etc ...   

This is what a computer can read and understand; usually we call it "Boolean".

How can we ask the computer to calculate a probability of the outcome given some predictor's value?

In ML, we usually use what we call an "**Activation Function**" to normalize the output value of any input values. One of the activation functions we used in Logistic Regression modeling is the **Sigmoid** function.

$$y = \frac{1}{1+e^{-z}}$$

Sometimes you may see the sigmoid function in the following form by multiplying the right hand side by $\frac{e^z}{e^z}$,

$$y = \frac{e^z}{e^z+1}$$

where y is the predicted value and z is an input to this sigmoid function, the inverse of which is called the "**logit function**". 

$$logit = \beta_0 + \beta_1(x_1) + ... \beta_n(x_n)$$

Why the Sigmoid function?

The behavior of the sigmoid function always returns the S-shape curve where the predicted value always falls between 0 and 1. The predicted value therefore can be interpreted as the probability of the target given some predictor values.

So far, everything seems to be making sense, but how do we model the sigmoid function or probability?  Before we model the probability, we need to take little steps to transform the sigmoid function,

$$y = \frac{1}{1+e^{-z}}$$

$$1 + e^{-z} = \frac{1}{y}$$

$$e^{-z} = \frac{1}{y} - 1$$

$$ln(e^{-z}) = ln(\frac{1}{y} - \frac{y}{y})$$

$$-z = ln(\frac{1 - y}{y})$$

$$z = -ln(\frac{1 - y}{y})$$

$$z = ln(\frac{y}{1-y})$$

The final equation is called the inverse sigmoid function, which demonstrates how we can use the z input to estimate the log of the odds ratio ($\frac{y}{1-y}$), where y is the probability of success (yes, true, or class 1 in general), and (1 - y) is the probability of fail (no, false, or class 2 in general).

Putting everything together, now we have an equation,

$$ln(\frac{y}{1-y}) = z$$

$$ln(\frac{y}{1-y}) = \beta_0 + \beta_1(x_1) + ... \beta_n(x_n)$$ 

Since the input z is a linear combination of the predictors to predict the **log-odds ratio**, we can apply some of the properties we have learned about linear regression in the logistic regression model.

More technically speaking, we are using the derivative of the logit function to **maximize the likelihood** with a given set of predictors. Here is a resource for reading the maximum likelihood process. 

Resource: https://web.stanford.edu/class/archive/cs/cs109/cs109.1178/lectureHandouts/220-logistic-regression.pdf

#### Prediction of the Model!!

Keep in mind that when we run the logistic regression model, the returned predicted value is the log-odds ratio of the classification problem, so we need to always transform it back to the probability measure, $y$, for prediction. The transformation is relatively easy and SKlearn can handle the transformation for you,

$$\hat{y} = \frac{1}{1+e^{-\hat{z}}}$$

Also, the threshold for the classification decision is based on domain knowledge and classification metrics, for example

$\hat{y}$ > 50%, classify as category A, otherwise B  
$\hat{y}$ > 75%, classify as category A, otherwise B  