# Cross Entropy Derivatives

**(Run this cell to define useful Latex macros)**
\\[
\newcommand{\card}[1]{\left\lvert#1\right\rvert}
\newcommand{\condbar}[0]{\,\big|\,}
\newcommand{\eprob}[1]{\widehat{\text{Pr}}\left[#1\right]}
\newcommand{\norm}[1]{\left\lvert\left\lvert#1\right\rvert\right\rvert}
\newcommand{\prob}[1]{\text{Pr}\left[#1\right]}
\newcommand{\pprob}[2]{\text{Pr}_{#1}\left[#2\right]}
\newcommand{\set}[1]{\left\{#1\right\}}
\newcommand{\fpartial}[2]{\frac{\partial #1}{\partial #2}}
\\]

To be able to do Gradient Descent, we need to calculate the partial derivatives of the cross entropy error function.

For the rest of this notebook, I will be using the notation $y^i$ as the label of the $i$th training example. This is the same as spam or not spam. It is encoded using a zero or a one.

$x^i$ as the vector of features for example $i$. $x_j^i = 0$ if the $j$th feature is absent, $x_j^i = 1$ if the feature is present.

Basically, I want to not use symbols that are specific to the spam classification problem, because Logistic Regression is a technique that can be applied to any similar problem.

From our prior book, we have:

\\[
\begin{align}
E(\theta)
&=
-
\sum_i^N
y_i
\left(
    \log
    \sigma\left(
        \theta_0
        +
        \sum_{j = 1}^M \theta_j w_j^i
    \right)
\right)
+
(1 - y_i)
\left(
    \log
    \sigma\left(
        -\theta_0
        -
        \sum_{j = 1}^M \theta_j x_j^i
    \right)
\right)
\end{align}
\\]

As a notational convenience, I will often write

\\[
z_\theta(x^i) = \theta_0 + \sum_j \theta_j x_j^i
\\]

Thus:

\\[
\begin{align}
E(\theta)
&=
-
\sum_i^N
y_i
\left(
    \log
    \sigma\left(z_\theta(x^i)\right)
\right)
+
(1 - y_i)
\left(
    \log
    \sigma\left(-z_\theta(x^i)\right)
\right)
\end{align}
\\]

To do gradient descent, we need to find the partial derivative of the error function with respect to a proposed change in parameter $\theta_k$:

\\[
\begin{align}
\fpartial{
    E
}{
    \theta_k
}(\theta)
&=
-\fpartial{}{\theta_k}
    \sum_{i = 1}^N
        y^i
        \left(
            \log
            \sigma\left(z_\theta(x^i)\right)
        \right)
        +
        (1 - y^i)
        \left(
            \log
            \sigma\left(-z_\theta(x^i)\right)
        \right)
\\
&=
-
\sum_{i = 1}^N
    y^i
    \fpartial{}{\theta_k}
    \left(
        \log
        \sigma\left(z_\theta(x^i)\right)
    \right)
    +
    (1 - y^i)
    \fpartial{}{\theta_k}
    \left(
        \log
        \sigma\left(-z_\theta(x^i)\right)
    \right)
\end{align}
\\]

Bring on the calculus! Let's solve the first derivative first!

\\[
\fpartial{}{\theta_k}
\left(
    \log
    \sigma\left(z_\theta(x^i)\right)
\right)
=
\frac{
    1
}{
    \sigma\left(z_\theta(x^i)\right)
}
\fpartial{}{\theta_k}
\sigma\left(z_\theta(x^i)\right)
\\]


To solve the next part, we'll need to use the chain rule again. Let's start by computing the partial derivative:

\\[
\begin{align}
\fpartial{}{z}
\sigma(z)
&=
\fpartial{}{z}
\frac{
    e^z
}{
    1 + e^z
}
\\
&=
\fpartial{}{z}
\frac{
    1
}{
    1 + e^{-z}
}
\\
&=
\fpartial{}{z}
\left(
    1 + e^{-z}
\right)^-1
\\
&=
(-1)
\left(
    1 + e^{-z}
\right)^2
e^{-z}
(-1)
\\
&=
\frac{
    1
}{
    1 + e^{-z}
}
\frac{
    e^{-z}
}{
    1 + e^{-z}
}
\\
&=
\frac{
    e^z
}{
    1 + e^z
}
\frac{
    1
}{
    1 + e^z
}
\\
&=
\sigma(z)
(1 - \sigma(z))
\end{align}
\\]


Now we can proceed!

\\[
\begin{align}
\fpartial{}{\theta_k}
\left(
    \log
    \sigma\left(z_\theta(x^i)\right)
\right)
&=
\frac{
    1
}{
    \sigma\left(z_\theta(x^i)\right)
}
\fpartial{}{\theta_k}
\sigma\left(z_\theta(x^i)\right)
\\
&=
\frac{
    1
}{
    \sigma\left(z_\theta(x^i)\right)
}
\left(
    \sigma\left(z_\theta(x^i)\right)
    \left(
        1 - \sigma\left( z_\theta(x^i) \right)
    \right)
\right)
\fpartial{}{\theta_k}
z_\theta(x^i)
\\
&=
\left(
    1 - \sigma\left(z_\theta(x^i)\right)
\right)
\fpartial{}{\theta_k}
z_\theta(x^i)
\end{align}
\\]


So the last question is: what is $\fpartial{}{\theta_k} z_\theta(x^i)$? This is a little weird because I've written $z$ as if it is *parameterized* by $\theta$, and as a *function of* $x^i$. But we can still consider the derivative:

\\[
\fpartial{}{\theta_k}
    z_\theta(x^i)
=
\fpartial{}{\theta_k}
\left(
    \theta_0
    +
    \sum_{j = 1}^M
        \theta_j x_j^i
\right)
=
x_k^i
\\]

For binary valued features (what we're working with here), when $x_k^i = 0$, that means that feature isn't present for example $x^i$, so therefore changing $\theta_k$ won't change $z_\theta(x^i)$. Thus it makes sense that this partial derivative would be zero.


Thus, finally we have:

\\[
\begin{align}
\fpartial{}{\theta_k}
\left(
    \log
    \sigma\left(z_\theta(x^i)\right)
\right)
&=
\left(
    1 - \sigma\left(z_\theta(x^i)\right)
\right)
x_k^i
\end{align}
\\]

To save you much boredom, it is also true that:

\\[
\begin{align}
\fpartial{}{\theta_k}
\left(
    \log
    \sigma\left( -z_\theta(x^i) \right)
\right)
&=
-
\sigma\left(z_\theta(x^i)\right)
x_k^i
\end{align}
\\]

Combining this, we have:
    
\\[
\begin{align}
\fpartial{}{\theta_k}
E(\theta)
&=
-
\sum_i^N
    y_i
    \left(
        1 - \sigma\left(z_\theta(x^i)\right)
    \right)
    x_k^i
+
    (1 - y_i)
    \left(
        -\sigma\left( z_\theta(x^i) \right)
    \right)
    x_k^i
\\
&=
\sum_i^N
    -
    y_i
    x_k^i
    \left(
        1 - \sigma\left(z_\theta(x^i)\right)
    \right)
+
    (1 - y_i)
    x_k^i
    \sigma\left(z_\theta(x^i)\right)
\end{align}
\\]

Now, holy Moses! By the definition of $\sigma$ and $z_\theta$, we know that:

\\[
\begin{align}
\pprob{\theta}{Y = 0 \condbar X = x^i}
&=
\left(
    1 - \sigma\left(z_\theta(x^i)\right)
\right)
\\
\pprob{\theta}{Y = 1 \condbar X = x^i}
&=
\sigma\left(z_\theta(x^i)\right)
\end{align}
\\]


So I can substitute back in:

\\[
\begin{align}
\fpartial{}{\theta_k}
E(\theta)
&=
\sum_i^N
    -
    y_i
    x_k^i
    \left(
        1 - \sigma\left(z_\theta(x^i)\right)
    \right)
+
    (1 - y_i)
    x_k^i
    \sigma\left(z_\theta(x^i)\right)
\\
&=
\sum_i^N
    -
    y_i
    x_k^i
    \pprob{\theta}{Y = 0 \condbar X = x^i}
+
    (1 - y_i)
    x_k^i
    \pprob{\theta}{Y = 1 \condbar X = x^i}
\end{align}
\\]

Here is how I read the equation. Consider an example $(x^i, y^i)$. If $x_k^i = 0$, then changing the coefficient $\theta_k$ has no effect. So forget it.

Next, assume that $y^i = 1$. In that case, the more your model currently (wrongly) thinks this is a negative example, the greater the impact on increasing $\theta_k$.

On the other hand, if $y^i = 0$, the opposite applies. The more your model currently (wrongly) thinks this is a positive example, the greater the impact of increasing $\theta_k$.

Take note of the signs. When we are dealing with a positive example, increasing $\theta_k$ will increase our belief this is a positive example. That should *lower* the error, which is why there is a leading negative sign outside the sum.

On the other hand, when we have a *negative* example, then increasing $\theta_k$ is increasing our belief in the wrong direction! That's why this effect has the *opposite* sign. The negative on the inside and outside the sum cancels. That indicates that increasing $\theta_k$ will *increase* our error on this example!