# DSCI 6003 2.3 Lecture: Logistic Regression


### What you should know by the end of this lecture:

1. What a logistic/sigmoid function is
2. What logistic regression is
3. How to estimate logistic regression coefficients
4. How to interpret logistic regression coefficients
5. The distinction between generative and discriminative models


## Binary Classification

In a binary classification problem, we try to predict a binary outcome and we assign labels of 0 and 1 to our data.

Here are some example questions:

* Does a patient have cancer?
* Will a team win the next game?
* Will the customer buy my product?
* Will I get the loan?


## Logistic Regression Motivation
Let's start by looking at an example. We're going to be using some NFL data. The x axis is the number of touchdowns scored by team over a season and the y axis is whether they lost or won the game indicated by a value of 0 or 1 respectively.

![NFL data](images/nfl.png)

So, how do we predict whether we have a win or a loss if we are given a score? Note that we are going to be predicting values between 0 and 1. Close to 0 means we're sure it's in class 0, close to 1 means we're sure it's in class 1, and closer to 0.5 means we don't know.

If we use linear regression for the NFL example above, we will certainly do better than randomly guessing, but it doesn't accurately represent the data:

![NFL linear regression](images/linefit.png)

So clearly a line is not the best way to model this data. So we need to find a better curve.


## The Logistic Function

First, we will just pull a function out of the data science bag of tricks and show that it works reasonably well.

And, second, we are going to understand how we came up with that function and how it is related to binary outcomes and odds. But before that let's understand this a bit better.

This function will need to have a value of 0 for the loss scores and 1 for the win scores. To make sense it will need to be 0 for some score and all scores below it and be 1 for some other score and all scores above it. And it will need to smoothly increase from 0 to 1 in the intermediate range.

It will need to look something like this:

![logistic](images/standardLogisticFunction.png)


A function that has the above shape is:

$$ f(t) = \frac{e^t}{1 + e^t} = \frac{1}{1 + e^{-t}} $$

This is the logistic function, also known as the sigmoid function. Note that as t approaches infinity, the value of the logistic function approaches 1 and as t approaches negative infinity, the value of the logistic function approaches 0.

We will use $ t = \beta_0 + \beta_1x $, which means we'll be dealing with a familiar looking linear function.

This gives us:

$$ p(x) = \frac{e^{\beta_0 + \beta_1 x}}{1 + e^{\beta_0 + \beta_1 x}} = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}} $$

* p(x) is our hypothesis and it represents the probability of a score of x leading to a win. 

\begin{align*}
& P(Y = 1 | X) = p(X) \\
& P(Y = 0 | X) = 1 - p(X)\\
\end{align*}

* $\beta_0$ and $\beta_1$ are parameters that we will optimize to best fit our data.

Also note that this is just for the one variable case, but we can similarly plug in the linear equation for any size feature matrix.

This is not necessarily the "best possible" function, but there's two main reasons we use the logistic/sigmoid function:

1. It has the correct shape
2. It makes the math easy


##Parameter Estimation

We choose $\bf{\beta}$ by maxmizing the likelihood of the data.

**What is the likelihood of seeing a particular data point $(x_i, y_i)$?**

$$L(\beta) = P(y_i | x_i) = p(x_i)^{y_i} \times (1 - p(x_i))^{1 - y_i}$$

**The likelihood of all data points** (We are going to calculate the *likelihood* that we predict all the data points correctly. This is calculated by taking the product of all the individual likelihoods for each data point in our training set.)

$$L(\beta) = \prod_{i = 1}^{n} p(x_i)^{y_i} \times (1 - p(x_i))^{1 - y_i}$$

For computational reasons, we will be maximizing the log of this function instead. Maximizing the **log-likelihood** is equivalent, but will make computations easier. One issue with calculating small probabilities with a computer is *numerical underflow*. Once the values get sufficiently small, they will be rounded to 0 and we will lose all information.

Using properties of log, we can simplify the formula: 

\begin{align*}
l(\beta) & = log(L(\beta)) \\
         & = log \left( \prod_{i = 1}^{n} p(x_i)^{y_i} \times (1 - p(x_i))^{1 - y_i} \right) \\
         & = \sum_{i = 1}^{n} log(p(x_i)^{y_i} \times (1 - p(x_i))^{1 - y_i}) \\
         & = \sum_{i = 1}^{n} y_i log(p(x_i)) + (1 - y_i) log(1 - p(x_i)) \\
\end{align*}


### Gradient Descent with Logistic Regression

Just like we did in linear regression, we could use gradient descent to find the optimal coefficients. Recall that the gradient descent algorithm requires us to choose a cost function and calculate the gradient of the cost function.

#### Cost function

Maximizing the likelihood or the log-likelihood is equivalent to minimizing the negative log-likelihood of the data, i.e., the total cost.

$$Total \  cost = - \left( \sum_{i = 1}^{n} y_i log(p(x_i)) + (1 - y_i) log(1 - p(x_i)) \right) $$

We define the logistic regression cost function as the average cost,

$$ J(\beta) = - \frac{1}{n} \left( \sum_{i = 1}^{n} y_i log(p(x_i)) + (1 - y_i) log(1 - p(x_i)) \right) $$

#### Gradient of the cost function
Recall that to use gradient descent, we need to know the gradient of the cost function. This is where we see the advantages of the logistic function. First, let's just calculate the derivative of the logistic function:

\begin{align*}
f'(t) & = \frac{d}{dt} \frac{1}{1 + e^{-t}} \\
      & = \frac{1}{(1 + e^{-t})^2} (e^{-t}) \\
      & = \frac{1}{(1 + e^{-t})} \cdot \frac{e^{-t}}{(1 + e^{-t})} \\
      & = \frac{1}{(1 + e^{-t})} \cdot \left( 1 - \frac{1}{(1 + e^{-t})} \right) \\
      & = f(t)(1 - f(t)) \\
\end{align*}

Isn't it nice how that works out?

Now we'll take the partial derivative of the cost function for each coefficient:

\begin{align*}
\frac{\partial}{\partial \beta_j} J(\beta) 
    & = -\frac{1}{n} \cdot \frac{\partial}{\partial \beta_j} \left( \sum_{i=1}^{n} y_i log(p(x_i)) + (1 - y_i) log(1 - p(x_i)) \right) \\
    & = -\frac{1}{n} \sum_{i=1}^{n} \frac{\partial}{\partial \beta_j} (y_i log(p(x_i)) + (1 - y_i) log(1 - p(x_i))) \\
    & = -\frac{1}{n} \sum_{i=1}^{n} \left( y_i \cdot \frac{1}{p(x_i)} \cdot \frac{\partial}{\partial \beta_j} p(x_i) + (1 - y_i) \cdot \frac{1}{1 - p(x_i)} \cdot \frac{\partial}{\partial \beta_j} (1 - p(x_i)) \right) \\
    & = -\frac{1}{n} \sum_{i=1}^{n} \left( y_i \cdot \frac{1}{p(x_i)} - (1 - y_i) \cdot \frac{1}{1 - p(x_i)} \right) \frac{\partial}{\partial \beta_j} p(x_i) \\
    & = -\frac{1}{n} \sum_{i=1}^{n} \left( y_i \cdot \frac{1}{p(x_i)} - (1 - y_i) \cdot \frac{1}{1 - p(x_i)} \right) p(x_i) (1 - p(x_i)) \frac{\partial}{\partial \beta_j} (\beta_0 + \beta_1 x_i) \\
    & = -\frac{1}{n} \sum_{i=1}^{n} \left( y_i (1 - p(x_i)) - (1 - y_i) p(x_i) \right) x_{ij} \\
    & = -\frac{1}{n} \sum_{i=1}^{n} \left( y_i - p(x_i) \right) x_{ij} \\
\end{align*}

Where $x_{ij} = 1$ if $j = 0$ and $x_{ij} = x_i$ if $j = 1$ in the single variable case.

This is what we'll use to update the coefficients in each iteration of gradient descent.

##Model Interpretation

We define the odds of Y = 1 as,
$$odds = \frac{P(Y = 1)}{1 - P(Y = 1)} = e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}$$

We interpret the $\beta's$ in multiplicative terms with respect to the odds.

E.g., the interpretation of $\beta_1$ is, holding all the other variables/features fixed, for every increase of 1 in $X_1$, the odds of $Y = 1$ increases by a factor of $e^{\beta_1}$. 


##Regularization in Logistic Regression

Logistic regression is immensely powerful in many cases but in particular tends to over-fit when there are many features. Weights tend to become skewed in the OLS optimum, partially due to the effect of the logit function on the gradient. Just as in linear regression, the effect can be mitigated using regularization, penalizing the model for using additional predictors.

Recall that the overall cost of the data is 
$$Total \  cost = - \left( \sum_{i = 1}^{n} y_i log(p(x_i)) + (1 - y_i) log(1 - p(x_i)) \right) $$

**Lasso Regularization** ($L_{1}$) Effective in eliminating redundant predictors. This sometimes leads to a better production (!) regression.
$$Total \  cost = - \left( \sum_{i = 1}^{n} y_i log(p(x_i)) + (1 - y_i) log(1 - p(x_i)) \right) + \lambda \sum_{j = 1}^{p} |\beta_j|$$

**Ridge Regularization** ($L_{2}$) Standard shrinkage when you want to keep predictors and believe that there is collinearity present.

$$Total \  cost = - \left( \sum_{i = 1}^{n} y_i log(p(x_i)) + (1 - y_i) log(1 - p(x_i)) \right) + \lambda \sum_{j = 1}^{p} \beta_j^2$$

Note:
* n is the number of data points
* p is the number of features
* $\beta_0$ (the intercept term) is not penalized


##Generative vs. Discriminative Models

###Generative Model

* Attempts to model the joint probability, P(X, Y)
* Example: Probabilistic Graphical Models, Naive Bayes(!)
* Provides full probabilistic model of all variables
* Allows simulating values of any variable in the model

###Discriminative Model

* Attempts to model the conditional probablility, P(Y | X) given X predictor variables (features)
* Example: Regression
* Provides a model for the target variable(s) conditional on the observed variables
* Allows sampling of the target variables conditional on the observed quantities
* Inherently supervised

In modern applications, the two types of models are seen as complementary, or as different views of the same procedure.