# Logistic Regression: A Probabilistic, Discriminative Model

Logistic regression, a discriminative model, is used for classification decisions in which the output is binary. Specifically, its model is that the probability of the event of interest occuring is Bernoulli with $\mu_i=\sigma(w^tx_i)$:

$$
\begin{align}
    p(y_i|x_i,w) &= Ber(y_i|\sigma(w^Tx_i)) \\
             &= (\sigma(w^Tx))^{I(y_i=1)} (1 - \sigma(w^Tx))^{I(y_i=0)}
\end{align}
$$


As with other methods, we can first consider the MLE $\hat{w}$. To begin, we write out the negative log-likelihood of the data:
$$
\begin{align}
    NLL(w) &= -log\ p(y|x, w) \\
           &= -log \prod_{i=1}^N p(y_i|x_i,w) \\
           &= - \sum_{i=1}^N log[p(y_i|x_i,w)] \\
           &= - \sum_{i=1}^N y_i\ log \mu_i + (1-y_i)log(1 - \mu_i) \\
           &= - \sum_{i=1}^N y_i\ log( \lbrace \frac {e^{w^Tx_i}} {1 + e^{w^Tx_i}} \rbrace ) + (1-y_i)\ log(1 - \lbrace \frac {e^{w^Tx_i}} {1 + e^{w^Tx_i}} \rbrace )
\end{align}
$$

You may recognize the last expression as standard "cross-entropy" loss. Either way, you can now also remember this expression as being the negative log-likelihood of a sequence $\{y_i\}$ of $N$ Bernoulli random variables which follow $Ber(y|\sigma(w^Tx))$.

Now, we just need to solve for the $w$ that minimizes this negative log-likelihood. The problem here is that, unlike in the linear regresssion case, there is no closed-form solution $\hat{w}$ -- we cannot analytically write down what $\hat{w}$ is in terms of the other variables. However, what we can do is define what the gradient (and Hessian) are for $NLL(w)$ and _iterate_ towards the solution. Additionally, since this $NLL(w)$ is convex, iterating in the opposite direction of the gradient will (eventually) reach the global minimum, and accordingly yield the optimal $\hat{w}$. Be careful to not assume that, just because the algorithm is iterative, it is solving a non-convex problem. The problem is indeed convex, but requires iterations, instead of a single equation solve.

The gradient and Hessian of $NLL(w)$ are shown below:
$$
\begin{align}
    g &= \frac d {dw} NLL(w) \\
      &= \sum_{i=1}^N (\mu_i-y_i)x_i \\
      &= X^T(\mu - y)
\end{align}
$$

$$
\begin{align}
    H &= \frac d {dw} g \\
      &= \sum_{i=1}^N (\nabla_w \mu_i)x^T_i \\
      &= \sum_{i=1}^N \mu_i(1 - \mu_i)x^T_i \\
      &= X^TSX,\ S=diag(\mu_i(1-\mu_i))
\end{align}
$$

### Multi-class Logistic Regression
We know that the likelihood of a binary response variable for logistic regression modeling is a Bernoulli likelihood. Now let's generalize this to the multi-class case, where the response variable becomes Multinoulli:
$$
\begin{align}
    p(y=c|x,W) &= \frac {e^{w^T_cx}} {\sum_{c'=1}^C e^{w^T_{c'}x}}
\end{align}
$$

## Aside: Gaussian Approximation
Sometimes, it is convenient to approximate a distribution with a class of functions that are more computationally feasible than the target distribution itself. This will be the case next, when we look at Bayesian logistic regression. One such approximation is known as the "Gaussian Approximation" or "Laplace Approximation".

Suppose we wish to approximate the posterior distribution of an arbitrary parameter vector $\theta$ in $R^D$. In this case, let
$$
\begin{align}
    p(\theta|D) &= \frac 1 Z e^{-E(\theta)}
\end{align}
$$

We know that we want to use a Gaussian to approximate the posterior distribution. We also know that the logarithm of the Gaussian has a quadratic (second-order) relationship in $x$. Because of this, we can arrive at an approximation to the distribution by using its Taylor series expansion out to the second-order term. By choosing a mode $\theta^*$ as the point to compute the Taylor series expansion from, the gradient term is zero.

$$
\begin{align}
    E(\theta) &\approx E(\theta^*) + (\theta-\theta^*)^Tg + \frac 1 2 (\theta-\theta^*)^T H (\theta-\theta^*) \\
              &= E(\theta^*) + \frac 1 2 (\theta-\theta^*)^T H (\theta-\theta^*) \\
              \\
   \Rightarrow \hat{p}(\theta|D) &= \frac 1 Z e^{E(\theta^*)} e^{-\frac 1 2(\theta-\theta^*)^T H (\theta-\theta^*)} \\
                                 &= N(\theta|\theta^*, H^{-1}) \\
                               Z &= p(D) \approx \int \hat{p}(\theta|D)d\theta = e^{-E(\theta^*)}(2\pi)^{D/2} |H|^{- \frac 1 2} \\
\end{align}
$$

## Bayesian Logistic Regression
Just like in linear regression, we can consider the _distribution_ of $\hat{w}$, as opposed to just a point estimate. However, unlike in linear regression, we cannot compute the posterior exactly. This is due to the fact that there is no mathematically convenient conjugate prior for the Bernoulli likelihood.