# Deep Learning Notes

by zwl

In [1]:
import numpy as np

## Essential Formulae

Definitions first:

$$ 
\begin{aligned}
\sigma(x) =& \frac{1}{1 + e^{-x}} \\
\zeta(x) =& \log(1 + e^x) \\
\end{aligned}
$$

## Properties of Sigmoid & Softplus functions
Derivations for Page 67 formulae. 

### 3.33. 

$$
\begin{aligned}
\sigma(x) =& \frac{e^x}{e^x} \times \frac{1}{1 + e^{-x}} \\
=& \frac{e^x}{e^x(1 + e^{-x})} \\
=& \frac{e^x}{e^x + 1}
\end{aligned}
$$

### 3.34. 

Let $y = e^{-x}$, $z = 1 + y$, then we have $\sigma(x) = z^{-1} $.

$$
\begin{aligned}
\frac{d}{dx}\sigma(x) =& \frac{d\sigma}{dz} \times \frac{dz}{dy} \times \frac{dy}{dx} \\
=& -z^{-2} \times 1 \times -e^{-x} \\
=& -(1 + e^{-1})^{-2} \times -e^{-x} \\
=& \frac{e^{-x}}{(1 + e^{-1})^{2}} \\
=& \frac{1}{1 + e^{-1}} \times \frac{e^{-x}}{1 + e^{-1}} \\
=& \sigma(x) (1-\sigma(x))
\end{aligned}
$$

### 3.35.

$$
\begin{aligned}
1 - \sigma(x) =& 1 - \frac{1}{1 + e^{-x}}\\
=& \frac{1 + e^{-x} - 1}{1 + e^{-x}} \\
=& \frac{e^{-x}}{1 + e^{-x}} \\
=& \frac{1}{e^x + 1} \\
=& \sigma(-x)
\end{aligned}
$$

### 3.36.

$$
\begin{aligned}
\log(\sigma(x)) &= \log\big(\frac{1}{1 + e^{-x}}\big) \\
&= \log(1) - \log(1 + e^{-x}) \\
&= -\log(1 + e^{-x}) \\
&= -\zeta(-x)
\end{aligned}
$$

### 3.37.

Let $u = 1 + e^x$, hence $\zeta(x) = \log(u)$, then

$$
\begin{aligned}
\frac{d}{dx}\zeta(x) =& \frac{d\zeta}{du} \times \frac{du}{dx} \\
&= \frac{1}{u} \times e^x \\
&= \frac{e^x}{1 + e^x} \\
&= \frac{1}{e^{-x} + 1} \\
&= \sigma(x)
\end{aligned}
$$

### 3.38.

**Logit function**, for $ \forall x \in (0, 1)$:

$$ \sigma^{-1}(x) = \log\bigg(\frac{x}{1-x}\bigg) $$

Here the power of -1 does not mean reciprical, but the **inverse**. Eg. given $\sigma(x)$, find $x$. 

$$
\begin{aligned}
\sigma(x) &= \frac{1}{1 + e^{-x}} \\
1 + e^{-x} &= \frac{1}{\sigma(x)} \\
e^{-x} &= \frac{1}{\sigma(x)} - 1 \\
-x &= \log\bigg(\frac{1-\sigma(x)}{\sigma(x)}\bigg) \\
x &= -\log\bigg(\frac{1-\sigma(x)}{\sigma(x)}\bigg) \\
x &= \log\bigg(\frac{\sigma(x)}{1-\sigma(x)}\bigg)
\end{aligned}
$$

In [8]:
def sigmoid(x):
    return 1. / (1 + np.exp(-x))

def logit(x):
    return np.log(x / (1.-x))

logit(.6)

0.40546510810816422

In [12]:
sigmoid(logit(.6))

0.59999999999999998

### 3.39.

Inverse of $\zeta(x)$, let $u=\zeta(x)$:

$$
\begin{aligned}
u &= \log(1 + e^x) \\
e^u &= 1 + e^x \\
e^x &= e^u - 1 \\
\forall x &> 0 \text{, take log on both sides} \\
x &= \log\big( e^u -1 \big)
\end{aligned}
$$

### 3.40.

$$
\begin{aligned}
\int^{x}_{-\infty} \sigma(y)dy &= \int^{x}_{-\infty} \frac{1}{1 + e^{-y}} \\
&= \log \big \lvert 1 + e^{-y} \big \rvert + y \\
&= \log \big ( \frac{e^y + 1}{e^y} \big ) + y \\
&= \log(e^y + 1) - \log(e^y) + y \\
&= \log(e^y + 1)
\end{aligned}
$$

Key here is the integration part. Results can be checked with `SymPy`. Or, to reverse that back:

$$
\begin{aligned}
\frac{d}{dx}\big(\log(1 + e^{-x}) + x\big) &= \frac{1}{1+e^{-x}} \times -e^{-x} + 1 \\
&= \frac{1 + e^{-x} - e^{-x}}{1 + e^{-x}} \\
&= \sigma(x)
\end{aligned}
$$

In [14]:
import sympy as spy

In [16]:
y = spy.Symbol('y')

spy.integrate(1 / (1 + spy.exp(-y)))

y + log(1 + exp(-y))

In [17]:
spy.diff(1 / (1 + spy.exp(-y)))

exp(-y)/(1 + exp(-y))**2

In [18]:
spy.diff(y + spy.log(1 + spy.exp(-y)))

1 - exp(-y)/(1 + exp(-y))

### 3.41

$$
\begin{aligned}
\zeta(x)-\zeta(-x) &= \log(1 + e^x) - \log(1 + e^{-x}) \\
&= \log\bigg(\frac{1 + e^x}{1 + e^{-x}}\bigg) \\
&= \log\bigg(\frac{e^{-x}(1 + e^x)}{e^{-x}(1 + e^{-x})}\bigg) \\
&= \log\bigg(\frac{e^{-x} + 1}{e^{-x}(1 + e^{-x})}\bigg) \\
&= \log\bigg(\frac{1}{e^{-x}}\bigg) \\
&= \log(e^x) \\
&= x
\end{aligned}
$$

## Derivatives

### Softmax Function, p78

$$ softmax(x)_i = \frac{\exp(x_i)}{\sum_{j=1}^{n} \exp(x_j)} $$

Overflow when $x_i$ is very large, underflow when $x_i$ is very negative. Solution is to use $softmax(z)$, where:
$z = x - max_i x_i$.

### Poor Conditioning

Given function $f(x) = A^{-1}x. When $A \in \mathbb{R}^{n\times n} has an **eigenvalue decomposition**, its **condition number** is:

$$ \max_{i,j} \bigg \lvert \frac{\lambda_i}{\lambda_j} \bigg \rvert $$

I.e. Ratio of largest and smallest eigenvalues.

Poor conditioning makes choosing a good optimization step size difficult.

## Hessian Matrix & Min/Max/Saddle Points

At a **critical point** where $\triangledown_x f(x) = 0$:

* **Local minimum** if the Hessian matrix is **positive definite** (i.e. all of its eigenvalues are positive).
* **local maximum** if the Hessian matrix is **negative definite** (i.e. all of its eigenvalues are negative).
* **Inconclusive** if all non-zero eigenvalues have the same sign but **at least one eigenvalue is zero**.

