## Activation functions

### **sigmoid:** $\sigma(z) = \frac{1}{1+e^{-z}}$

- Good for binary classification's output layer, but almost nothing else

### **tanh:** $\tanh(z) = \frac{e^{z}-e^{-z}}{e^{z}+e^{-z}}$

- Categorically shifted $\sigma$, good for centering data. Strictly superior to $\sigma$ in nearly all cases

If z is very large/small, then the derivative shrinks in both $\sigma$ and $\tanh$. This is the [_vanishing gradient problem_](https://en.wikipedia.org/wiki/Vanishing_gradient_problem).

### **reLU:** $relu(z) = \max(0, z)$

- Generally default func of choice

### **leaky reLU:** $relu(z) = \max(0.01, z)$

- Very similar results to `relu` so the choice to use a particular gradient other than zero is model-dependent

## Derivatives of activation functions

$\sigma = a(1-a)$

$\tanh = 1-a^{2}$

$relu = \begin{cases}
  g, & \text{if } z < g, \\
  1, & \text{if } z > g, \\
  undefined, &\text{if } z ==g
\end{cases}$

where `g` is zero for relu, 0.01 or otherwise for leaky relu. Although non-rigorous, the zeroth case can be ignored in practice.

## Random initialization
- initializing with zeros, as in `logreg`, will not work