# Logistic Regression
Author: Geraldine Klarenberg

Based on the Google Machine Learning Crash Course

## Probabilities

Sometimes problems require probability as an output (as opposed to the estimate or a value of something). Probabilities are always represented as values between 0 and 1 - essentially 0 is 0% chance of something happening (or not), and 1 is 100% of something happening (or not).
You can use probabilities in this way, or apply them in calculations. For instance, the probability of a dog barking in the middle of the night is
```text
p(dog | night)
```
("the probability of a dog barking given that it is night")

If the probability is given as 0.05, we can calculate that an owner will be woken up by a barking dog approximately 18 times in a year:
```text
p(dog | night) * number of nights = number of times woken up
0.05 * 365 ~= 18
```
Now, this approach brings some constraints to our calculations. For instance, we have to make sure all results are *always* between zero and one, we cannot allow negative values.

Also, these types of calculations are typically applied to situations where the feature values are categorical. E.g. the probability of an email being "spam" or "not spam". Drawing a straight line through the results, as we learned in an earlier lesson, would not yield good results. In the plot below, 1 could represent "spam" and 0 "not spam", and the data points are the number of "spam" and "not spam" emails people have in their inboxes. 

![An image illustrating showing a linear regression line through categorical data](../nb-images/binary-graph.png)

## Logistic regression to the rescue 

Instead of linear regression, we are better off using a so-called logistic regression model. This better represents the data, and always ensures values are between 0 and 1. A logistic regression model produces a sigmoid as a result. 

![Image showing a sigmoid plot](../nb-images/SigmoidFunction.png)

The function to get a plot like this is:

$$ y = \frac{1}{1 + e^{-z}} $$

In ML terms, if *z* is the output of the linear layer of a model that you're training, *y* will give the value, a probability, between 0 and 1. So you use

$$y' = \frac{1}{1 + e^{-(z)}}$$

where
* *y'* is the output of the logistic regression model (the probability)
* *z* is the linear function that we discussed in previous lessons: $ b + w_1x_1 + w_2x_2 + w_3x_3 + \ldots + w_nx_n $

Remember: the *w* values are the model's learned weights, *b* is the bias and *x* are the feature values. So we are basically still dealing with the same function as before!

** It's that easy! **

Using these ML labels, you can interpret the sigmoid plot as:

![Image showing the sigmoid with y and z labels](../nb-images/LogisticRegressionOutput.svg)

### A few (math) notes 

**NB** In math terms this type of transformation is also referred to as using a "link function". To find more information online about this topic, search for "logit link function". Also note that a *logistic* function is different from *logarithmic* function! Don't confuse the two.

**NB2** Sometimes *z* is referred to as the *log-odds* because if you take the inverse of the sigmoid, i.e. you separate out *z*, you get $$z = ln(\frac{y}{1-y})$$

In regular words: the natural log of the odds. And odds are defined as the probability of label 1 (e.g. "dog barks", *y*) divided by the probability of label 0 (e.g. "dog does not bark", *y-1*).

So let's say a dog barks 6 out of 10 times. The probability of the dog barking is 0.6, the probability of the dog not barking is 0.4, and the odds that the dog wil bark is 1.5. **Yes**, odds can be larger than 1, but probabilities cannot! 

## An example 

We have a logistic regression model with three features that learned the following bias and weights:
* *b* = 1
* *w1* = 2
* *w2* = -1
* *w3* = 5

We then have a situation with the following feature values:
* *x1* = 0
* *x2* =10
* *x3* = 2

The *log-odds* (*z*) is then:
```text
(1) + (2)(0) + (-1)(10) + (5)(2) = 1
```
And the regression prediction will be:
$$y' = \frac{1}{1 + e^{-(1)}} = 0.731$$

Looking at the plot, this is where our result sits:

![Image of the sigmoid with the result (0.731) highlighted](../nb-images/LogisticRegressionOutput0.731.svg)

## Loss 
Remember, the loss function for linear regression is squared loss. For a logistic function it is **Log Loss**:

$$\text{Log Loss} = \sum_{(x,y)\in D} -y\log(y') - (1 - y)\log(1 - y')$$

with
* $(x,y)\in D$ is the dataset (D) with many labeled (x-y paired) examples
* *y* is the label in the labeled example. Every value of *y* **must** be 0 or 1.
* *y'* is the predicted value given the set of features in x. The predicted value is between 0 and 1.

**NB** The Log Loss equation is the negative log of the likelihood function (assuming a Bernouilli distribution of *y*), so minimizing the loss function, gives you a Maximum Likelihood Estimate (MLE).

## Regularization
Since logistic regression is asymptotic, regularization is very important. Without it, loss would continue driving to 0 in high dimensions. This means the model would completely overfit, trying to drive loss to 0 on all examples - and never get there. This would drive the weights to +infinity or -infinity. So somehow, we need to specify when to stop (or dampen complexity). We do this with
* $L_{2}$
 regularization; or
* Early stopping - limiting the training steps or learning rate