**(Run this cell to define useful Latex macros)**
\\[
\newcommand{\card}[1]{\left\lvert#1\right\rvert}
\newcommand{\condbar}[0]{\,\big|\,}
\newcommand{\eprob}[1]{\widehat{\text{Pr}}\left[#1\right]}
\newcommand{\norm}[1]{\left\lvert\left\lvert#1\right\rvert\right\rvert}
\newcommand{\prob}[1]{\text{Pr}\left[#1\right]}
\newcommand{\pprob}[2]{\text{Pr}_{#1}\left[#2\right]}
\newcommand{\set}[1]{\left\{#1\right\}}
\newcommand{\fpartial}[2]{\frac{\partial #1}{\partial #2}}
\\]

Our error is the negative log likelihood of the data.

\\[
\begin{align}
E(\theta)
&=
    -\sum_{i = 1}^N
        y^i \log \pprob{\theta}{Y = 1 \condbar X = x^i}
        + (1 - y^i) \log \pprob{\theta}{Y = 0 \condbar X = x^i}
\\
E(\theta)
&=
    -\sum_{i = 1}^N
        y^i \log \frac{1}{1 + e^{-z(\theta)}}
        + (1 - y^i) \log \left(1 - \frac{1}{1 + e^{-z(\theta)}} \right)
\\
E(\theta)
&=
    -\sum_{i = 1}^N
        y^i \log \frac{1}{1 + e^{-z(\theta)}}
        + (1 - y^i) \log \frac{\left(1 + e^{-z(\theta)} \right) - 1}{1 + e^{-z(\theta)}}
\\
E(\theta)
&=
    -\sum_{i = 1}^N
        y^i \log \frac{1}{1 + e^{-z(\theta)}}
        + (1 - y^i) \log \frac{e^{-z(\theta)}}{1 + e^{-z(\theta)}}
\end{align}
\\]

To do gradient descent, we need to find the partial derivative of the error function with respect to a proposed change in parameter $\theta_k$:

\\[
\begin{align}
\fpartial{
    E
}{
    \theta_i
}(\theta)
&=
-\fpartial{}{\theta_k}
    \sum_{i = 1}^N
        y^i \log \frac{1}{1 + e^{-z_\theta(x^i)}}
        + (1 - y^i) \log \frac{e^{-z_\theta(x^i)}}{1 + e^{-z_\theta(x^i)}}
\\
&=
    -\sum_{i = 1}^N
        y^i
        \fpartial{}{\theta_k}
        \left( \log \frac{1}{1 + e^{-z_\theta(x^i)}} \right)
        +
        (1 - y^i)
        \fpartial{}{\theta_k}
        \left( \log \frac{e^{-z_\theta(x^i)}}{1 + e^{-z_\theta(x^i)}} \right)
\end{align}
\\]

Bring on the calculus! Let's solve the first derivative first!

\\[
\fpartial{}{\theta_k}
\left( \log \frac{1}{1 + e^{-z_\theta(x^i)}} \right)
=
\fpartial{}{\theta_k}
\left( \log 1 - \log \left( 1 + e^{-z_\theta(x^i)} \right) \right)
=
-\fpartial{}{\theta_k}
\log \left( 1 + e^{-z_\theta(x^i)} \right)
\\]

(Here I use the rule about logs of fractions, and that $\log 1 = 0$.)

The first thing to know is that $\fpartial{}{x} \log x = \frac{1}{x}$. So we may start applying the derivative operation:

\\[
-
\fpartial{}{\theta_k}
\log \left( 1 + e^{-z_\theta(x^i)} \right)
=
-
\frac{1}{
    1 + e^{-z_\theta(x^i)}
}
\fpartial{}{\theta_k}
\left( 1 + e^{-z_\theta(x^i)} \right)
=
-
\frac{1}{
    1 + e^{-z_\theta(x^i)}
}
e^{-z\theta(x^i)}
(-1)
\fpartial{}{\theta_k}
z_\theta(x^i)
=
\frac{
    e^{-z_\theta(x^i)}
}{
    1 + e^{-z_\theta(x^i)}
}
\fpartial{}{\theta_k}
z_\theta(x^i)
\\]

Here I have applied the chain rule many times. I did use the rule that $\fpartial{}{x} e^x = e^x$.

So the last question is: what is $\fpartial{}{\theta_k} z_\theta(x^i)$? This is a little weird because I've written $z$ as if it is *parameterized* by $\theta$, and as a *function of* $x^i$. But we can still consider the derivative:

\\[
\fpartial{}{\theta_k}
    z_\theta(x^i)
=
\fpartial{}{\theta_k}
\left(
    \theta_0
    +
    \sum_{j = 1}^M
        \theta_j x_j^i
\right)
=
x_k^i
\\]

For binary valued features (what we're working with here), when $x_k^i = 0$, that means that feature isn't present for example $x^i$, so therefore changing $\theta_k$ won't change $z_\theta(x^i)$. Thus the overall derivative is zero.

In any case, we can now write:

\\[
\fpartial{}{\theta_k}
\left( \log \frac{1}{1 + e^{-z_\theta(x^i)}} \right)
=
x_k^i
\frac{
    e^{-z_\theta(x^i)}
}{
    1 + e^{-z_\theta(x^i)}
}
\\]

We also need to calculate

\\[
\fpartial{}{\theta_k}
\left( \log \frac{e^{-z_\theta(x^i)}}{1 + e^{-z_\theta(x^i)}} \right)
=
\fpartial{}{\theta_k}
\left( \log e^{-z_\theta(x^i)} - \log \left(1 + e^{-z_\theta(x^i)} \right) \right)
=
\fpartial{}{\theta_k}
\left( -z_\theta(x^i) - \log \left(1 + e^{-z_\theta(x^i)} \right) \right)
\\]

We already calculated the derivative of the log on the right, and we know:

\\[
\fpartial{}{\theta_k}
\left( -z_\theta(x^i) \right)
=
-x_k^i
\\]

because we already talked about the partial derivative of $z_\theta(x^i)$.

Therefore, we have:

\\[
\begin{align}
\fpartial{}{\theta_k}
\left( \log \frac{e^{-z_\theta(x^i)}}{1 + e^{-z_\theta(x^i)}} \right)
&=
-x_k^i
+
x_k^i
\frac{
    e^{-z_\theta(x^i)}
}{
    1 + e^{-z_\theta(x^i)}
}
\\
&=
x_k^i
\left(
    -1
    +
    \frac{
        e^{-z_\theta(x^i)}
    }{
        1 + e^{-z_\theta(x^i)}
    }
\right)
\\
&=
x_k^i
\left(
    \frac{
        -\left(
            1 + e^{-z_\theta(x^i)}
        \right)
        +
        e^{-z_\theta(x^i)}
    }{
        1 + e^{-z_\theta(x^i)}
    }
\right)
\\
&=
x_k^i
\left(
    \frac{
        -1
    }{
        1 + e^{-z_\theta(x^i)}
    }
\right)
\\
&=
-x_k^i
\left(
    \frac{
        1
    }{
        1 + e^{-z_\theta(x^i)}
    }
\right)
\end{align}
\\]

Therefore, finally we have:

\\[
\begin{align}
\fpartial{
    E
}{
    \theta_i
}(\theta)
&=
    -\sum_{i = 1}^N
        y^i
        \fpartial{}{\theta_k}
        \left( \log \frac{1}{1 + e^{-z_\theta(x^i)}} \right)
        +
        (1 - y^i)
        \fpartial{}{\theta_k}
        \left( \log \frac{e^{-z_\theta(x^i)}}{1 + e^{-z_\theta(x^i)}} \right)
\\
&=
    -\sum_{i = 1}^N
    y^i
    x_k^i
    \frac{
        e^{-z_\theta(x^i)}
    }{
        1 + e^{-z_\theta(x^i)}
    }
    +
    (1 - y^i)
    \left(
        -x_k^i
        \left(
            \frac{
                1
            }{
                1 + e^{-z_\theta(x^i)}
            }
        \right)
    \right)
\end{align}
\\]

This equation is simplified substantially by defining $\sigma$ to be the logistic function (the formula for converting log odds to probability.

\\[
\sigma(z) = \frac{1}{1 + e^{-z}}
\\]

In that case, we have:

\\[
\begin{align}
\fpartial{
    E
}{
    \theta_i
}(\theta)
&=
    -\sum_{i = 1}^N
    y^i
    x_k^i
    \frac{
        e^{-z_\theta(x^i)}
    }{
        1 + e^{-z_\theta(x^i)}
    }
    +
    (1 - y^i)
    \left(
        -x_k^i
        \left(
            \frac{
                1
            }{
                1 + e^{-z_\theta(x^i)}
            }
        \right)
    \right)
\\
&=
    -\sum_{i = 1}^N
    y^i
    x_k^i
    \frac{
        1
    }{
        e^{z_\theta(x^i)} + 1
    }
    +
    (1 - y^i)
    \left(
        -x_k^i
        \sigma(z_\theta(x^i))
    \right)
\\
&=
    -\sum_{i = 1}^N
    y^i
    x_k^i
    \sigma(-z_\theta(x^i))
    +
    (1 - y^i)
    \left(
        -x_k^i
        \sigma(z_\theta(x^i))
    \right)
\end{align}
\\]

I will make a last point. Recall that:

\\[
z_\theta(x^i)
=
\log \frac{
    \pprob{\theta}{Y = 1 \condbar X = x^i}
}{
    \pprob{\theta}{Y = 0 \condbar X = x^i}
}
\\]

And of course $\sigma$ undoes a log probability. Therefore:

\\[
\begin{align}
\fpartial{
    E
}{
    \theta_i
}(\theta)
&=
-\sum_{i = 1}^N
y^i
x_k^i
\sigma(-z_\theta(x^i))
+
(1 - y^i)
\left(
    -x_k^i
    \sigma(z_\theta(x^i))
\right)
\\
&=
-\sum_{i = 1}^N
    y^i
    x_k^i
    \pprob{\theta}{Y = 0 \condbar X = x^i}
    -
    (1 - y^i)
    \left( x_k^i \right)
    \pprob{\theta}{Y = 1 \condbar X = x^i}
\end{align}
\\]

Here is how I read the equation. Consider an example $(x^i, y^i)$. If $x_k^i = 0$, then changing the coefficient has no effect. So forget it.

Next, assume that $y^i = 1$. In that case, the more your model currently (wrongly) thinks this is a negative example, the greater the impact on increasing $\theta^k$.

On the other hand, if $y^i = 0$, the opposite applies. The more your model currently (wrongly) thinks this is a positive example, the greater the impact of increasing $\theta^k$.

Take note of the signs. When we are dealing with a positive example, increasing $\theta_k$ will increase our belief this is a positive example. That should *lower* the error, which is why there is a leading negative sign outside the sum.

On the other hand, when we have a *negative* example, then increasing $\theta_k$ is increasing our belief in the wrong direction! That's why this effect has the *opposite* sign. The negative on the inside and outside the sum cancels. That indicates that increasing $\theta_k$ will *increase* our error on this example!