### Logistic Regression

From the previous notebook, I said that we want to try to find a better model than Naive Bayes.

**(Run this cell to define useful Latex macros)**
\\[
\newcommand{\card}[1]{\left\lvert#1\right\rvert}
\newcommand{\condbar}[0]{\,\big|\,}
\newcommand{\eprob}[1]{\widehat{\text{Pr}}\left[#1\right]}
\newcommand{\fpartial}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\norm}[1]{\left\lvert\left\lvert#1\right\rvert\right\rvert}
\newcommand{\prob}[1]{\text{Pr}\left[#1\right]}
\newcommand{\pprob}[2]{\text{Pr}_{#1}\left[#2\right]}
\newcommand{\set}[1]{\left\{#1\right\}}
\\]

## Equations From Last Time

Here is our Naive Bayes model equation:

\\[
\frac{
    \prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \prob{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
=
\frac{
    \prob{\text{S} = 1}
}{
    \prob{\text{S} = 0}
}
\frac{
    \prob{W_1 = w_1 \condbar \text{S} = 1}
}{
    \prob{W_1 = w_1 \condbar \text{S} = 0}
}
\cdots
\frac{
    \prob{W_M = w_M \condbar \text{S} = 1}
}{
    \prob{W_M = w_M \condbar \text{S} = 0}
}
\\]

And we saw we could rewrite this as:

\\[
\begin{align}
\frac{
    \prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \prob{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
=
\phi_0
\prod_{i = 1}^M
\phi_i^{w_i}
\phi_i^{\prime 1 - w_i}
\end{align}
\\]

Finally, we saw that we could then write:

\\[
\begin{align}
\prob{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
&=
\frac{
    \phi_0
    \prod_{i = 1}^M
    \phi_i^{w_i}
    \phi_i^{\prime 1 - w_i}
}{
    1
    +
    \left(
        \phi_0
        \prod_{i = 1}^M
        \phi_i^{w_i}
        \phi_i^{\prime 1 - w_i}
    \right)
}
\\
\prob{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
&=
\frac{
    1
}{
    1
    +
    \left(
        \phi_0
        \prod_{i = 1}^M
        \phi_i^{w_i}
        \phi_i^{\prime 1 - w_i}
    \right)
}
\end{align}
\\]

### Maximizing Likelihood

From the last notebook, we saw that we want to maximize:

\\[
\begin{align}
\pprob{\phi}{\mathcal{D}}
&=
\prod_i^N
\pprob{\phi}{\text{S} = s_i \condbar W_1 = w_{i, 1}, \ldots, W_M = w_{i, M}}
\\
&=
\prod_i^N
\left(
    \pprob{\phi}{\text{S} = 1 \condbar W_1 = w_{i, 1}, \ldots, W_M = w_{i, M}}
\right)^{s_i}
\left(
    \pprob{\phi}{\text{S} = 0 \condbar W_1 = w_{i, 1}, \ldots, W_M = w_{i, M}}
\right)^{1 - s_i}
\\
&=
\prod_i^N
\left(
    \frac{
        \phi_0
        \prod_{j = 1}^M
        \phi_j^{w_{i, j}}
        \phi_j^{\prime 1 - w_{i, j}}
    }{
        1
        +
        \left(
            \phi_0
            \prod_{j = 1}^M
            \phi_j^{w_{i, j}}
            \phi_j^{\prime 1 - w_{i, j}}
        \right)
    }
\right)^{s_i}
\left(
    \frac{
        1
    }{
        1
        +
        \left(
            \phi_0
            \prod_{j = 1}^M
            \phi_j^{w_{i, j}}
            \phi_j^{\prime 1 - w_{i, j}}
        \right)
    }
\right)^{1 - s_i}
\end{align}
\\]

I've written the $\phi$ as a subscript of $\Pr$ to try to emphasize that the "probability" is what our model thinks, and it depends on our choice of $\phi$. Our job is going to be pick the $\phi$ that maximizes this probability of the dataset.

### Products Are Hard To Optimize

Right now, we could turn Gradient *Ascent* loose on the likelihood function above and try to calculate the $\phi$ that do the best job. In theory that should work, but Gradient Ascent will do a bad job.

Here is why. Gradient Ascent/Descent is all about making many *small* changes to the parameters, each time trying to improve just a little bit. It's like trying to tune in to a radio station with a dial: you want to slowly turn the knob to find the right frequency. You don't want to rapidly turn the knob because you'll go way past the station you're trying to tune.

So let's try to choose a small "step size." Does $\epsilon = 0.01$ sound pretty small?

Well, let's go back and think about the odds for a single training example. That is:

\\[
\begin{align}
\frac{
    \pprob{\phi}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \pprob{\phi}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
=
\phi_0
\prod_{i = 1}^M
\phi_i^{w_i}
\phi_i^{\prime 1 - w_i}
\end{align}
\\]

Say you propose to change $\phi_0$ by $\epsilon$. Is this a small change? Well, it depends on what $\phi_0$ is! If $\phi_0$ is very large, then changing $\epsilon$ won't have a very great percentage change to the odds. On the other hand, if $\phi_0$ is small, then a change of $\epsilon$ will have a *huge* change on the odds. For instance, if $\phi_0 = \epsilon$, then adding $\epsilon$ to $\phi_0$ will *double* the odds! The effect can be even greater when $\phi_0$ is smaller!

In general, a change of $\epsilon$ to a $\phi_i$ causes a percentage change in the odds equal to $\frac{\epsilon}{\phi_i}$. This can be very great whenever $\phi_i$ is small.

Basically, there is no such thing as a "small" step, beacuse the size of a step is relative to the magnitude of the $\phi_i$ are being changed.

The problem is that specifically that a small step can cause a big percentage change in a parameter $\phi_i$. That was also the case with linear regression. The problem is that it causes a big change in the overall result: the product: the odds.


### Sums Are Easier To Optimize

This doesn't really happen with sums. Consider a weighted sum:

\\[
y = w_0 + \sum w_i x_i
\\]

If we change $w_i$ by $\epsilon$, how much will $y$ change? It will change by $\epsilon * x_i$. The percentage change is therefore: $\frac{\epsilon x_i}{w_0 + \sum w_i x_i}$.

In order for a change of $\epsilon$ to have a huge percentage change, it would have to be that $w_0 + \sum w_i x_i$ is very small. That is much less likely than just the one $w_i$ value being small.


### Turn A Product Into A Sum

Since sums are easy to optimize using Gradient Ascent/Descent, and products are hard, the natural choice is to turn a product into a sum! The way to do this is to use the logarithm function.

Let's start with the initial odds function. Let's turn that into a *log odds* function. Since the odds was a product, the log odds will be a sum.

\\[
\begin{align}
\frac{
    \pprob{\phi}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \pprob{\phi}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
&=
\phi_0
\prod_{i = 1}^{M}
\phi_i^{w_i}
\phi_i^{\prime 1 - w_i}
\\
\log
\frac{
    \pprob{\phi}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \pprob{\phi}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
&=
\log\left(
    \phi_0
    \prod_{i = 1}^{M}
    \phi_i^{w_i}
    \phi_i^{\prime 1 - w_i}
\right)
\\
&=
\log\phi_0
+
\sum_{i = 1}^M
    w_i \log\phi_i
    +
    (1 - w_i) \log\phi_i^\prime
\\
&=
\left(
    \log\phi_0 + \sum_{i = 1}^M \log\phi_i^\prime
\right)
+
\sum_{i = 1}^M
    w_i (\log\phi_i - \log\phi_i^\prime)
\end{align}
\\]


Now, I could try to pick the $\phi$ values that maximize this equation for the log probability. However, that would be kind of annoying, because the equation is more complex than it needs to be.

Instead, I will *reparameterize* the equation. Instead of trying to find the best $\phi$ values, let me try to find the best $\theta$ values for the equation below:


\\[
\log
\frac{
    \pprob{\theta}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \pprob{\theta}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
=
\theta_0
+ \sum_{i=1}^M w_i \theta_i
\\]

This really doesn't change anything. Instead of trying to find the best model in terms of $\phi$, I will look for it in terms of $\theta$. It's like the difference between measuring a distance in meters or feet: it doesn't really matter.


### The Reparameterized Problem

We now have an equation for the log odds in terms of $\theta$. It is easy to get an equation for the odds by just exponentiating both sides:

\\[
\frac{
    \pprob{\theta}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
}{
    \pprob{\theta}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
}
=
\exp\left(
    \theta_0
    + \sum_{i=1}^M w_i \theta_i
\right)
\\]


Let's reflect on what we have accomplished through reparameterization. Now, if we propose to increase $\theta_i$ by $\epsilon$, this will scale the odds by a factor of $e^\epsilon$, regardless of what $\theta_i$ is. That means a change of $\epsilon$ always has a consistent effect. We can now choose $\epsilon$ to be small enough that the change in odds will always be very minor each time.

And of course, we know how to calculate the probabilities from the odds:

\\[
\begin{align}
\pprob{\theta}{\text{S} = 1 \condbar W_1 = w_1, \ldots, W_M = w_M}
&=
\frac{
    \exp\left(
        \theta_0
        + \sum_{i=1}^M w_i \theta_i
    \right)
}{
    1
    +
    \exp\left(
        \theta_0
        + \sum_{i=1}^M w_i \theta_i
    \right)
}
\\
\pprob{\theta}{\text{S} = 0 \condbar W_1 = w_1, \ldots, W_M = w_M}
&=
\frac{
    1
}{
    1
    +
    \exp\left(
        \theta_0
        + \sum_{i=1}^M w_i \theta_i
    \right)
}
\end{align}
\\]


The function $\sigma(z) = \frac{e^z}{1 + e^z}$ is called the *logistic function*. It is the function that converts a log odds into a probability, which is exactly what we've done here.

That where the name *logistic regression* comes from!

We can plug this into our likelihood equation above!

\\[
\begin{align}
\pprob{\theta}{\mathcal{D}}
&=
\prod_i^N
\left(
    \pprob{\theta}{\text{S} = 1 \condbar W_1 = w_{i, 1}, \ldots, W_M = w_{i, M}}
\right)^{s_i}
\left(
    \pprob{\theta}{\text{S} = 0 \condbar W_1 = w_{i, 1}, \ldots, W_M = w_{i, M}}
\right)^{1 - s_i}
\\
&=
\prod_i^N
\left(
    \frac{
        \exp\left(
            \theta_0
            + \sum_{j=1}^M w_{i, j} \theta_j
        \right)
    }{
        1
        +
        \exp\left(
            \theta_0
            + \sum_{j=1}^M w_{i, j} \theta_j
        \right)
    }
\right)^{s_i}
\left(
    \frac{
        1
    }{
        1
        +
        \exp\left(
            \theta_0
            + \sum_{j=1}^M w_{i, j} \theta_j
        \right)
    }
\right)^{1 - s_i}
\end{align}
\\]


Okay! We're back to something we can optimize! But now, we will optimize it in terms of $\theta$. This is easier, for the reasons stated above.

We haven't changed anything. We've just reparameterized our problem in a way such that Gradient Ascent can work better!


### Optimizing Log Likelihood Instead Likelihood

Remember that Gradient Ascent/Descent doesn't just try to change parameters by a fixed "step size" of $\epsilon$. That's not exactly how it works.

Instead, it tries to change a parameter $\theta_i$ by $\lambda \fpartial{\pprob{\theta}{\mathcal{D}}}{\theta_i}$, where $\lambda$ is the learning rate. That is, it wants to change parameters *more* if they have a bigger impact on $\pprob{\theta}{\mathcal{D}}$, and *less* if they have a smaller impact on $\pprob{\theta}{\mathcal{D}}$.

The idea is this: it's more efficient to try to focus on changes that are giving you the biggest bang for your buck.

This only really works if a change $\epsilon$ in the optimized function has a consistent value. For instance, if we want to change $\theta_1$ twice as much as $\theta_2$ beacuse $\fpartial{f_\theta}{\theta_1} = 2\fpartial{f_\theta}{\theta_2}$, that only makes sense if a change in $f_\theta$ of two units is twice as good as a change of one unit.


This isn't true with probabilities. Think in terms of odds. If a probability $p = 0.01$ then adding $\epsilon = 0.01$ more than *doubles* the odds, because:

\\[
\frac{0.01}{0.99} \Rightarrow \frac{0.02}{0.98}
\\]

That is *very* different from the effect if we change $p = 0.50$ by an $\epsilon = 0.01$:

\\[
\frac{0.50}{0.50} \Rightarrow \frac{0.51}{0.49}
\\]

In this case, the odds change by only 4%.

So it is clear that the same change $\epsilon$ can have very different changes in the odds.

In fact, even a change of $\epsilon$ in the odds can mean very different things. Consider if the odds double or half: those are symmetric changes, right?

However, doubling the odds is a 100% change in the odds, whereas halving the odds is a 50% change.

So optimizing the odds would suffer a similar problem as optimizing the probability. The size of a change is not consistent.

### Fixing By Optimizing Log Likelihood

**TODO**...