## Why Activation Functions?

In Perceptron, Replace step function with the logistic (sigmoid) function,
σ(z) = 1 / (1 + exp(–z)). This was essential because the step function contains
only flat segments, so there is no gradient to work with (Gradient Descent cannot
move on a flat surface), while the logistic function has a well-defined nonzero derivative
everywhere, allowing Gradient Descent to make some progress at every step

Well, if
you chain several linear transformations, all you get is a linear transformation. For
example, if f(x) = 2x + 3 and g(x) = 5x – 1, then chaining these two linear functions
gives you another linear function: f(g(x)) = 2(5x – 1) + 3 = 10x + 1. So if you don’t
have some nonlinearity between layers, then even a deep stack of layers is equivalent
to a single layer, and you can’t solve very complex problems with that. Conversely, a
large enough DNN with nonlinear activations can theoretically approximate any continuous
function.


* RelU & Variants:
    * RelU : max(0, z)
    * Relu is faster than sigmoid due to comparitively complex gradient calculations 
    * Grads die at negative areas (ie when the sum of weighted inputs are -) -> no sgd update 

* Softplus : log(1 + exp(z))
    * More smoother version of RelU (in terms of differentiation)
    * close to 0 when -
    * close to z when +
    * Smoother than RelU

* Leaky RelU:
    * max($\alpha$ z, z)
    * $\alpha$ -> determines how much leak it can take ~ 0.01.
    * Ensures they dont go to coma/die out

* RReLU:
    * Randomized Leaky RelU
    * $\alpha$ is picked randomly from given range & fixed to avg during testing

* PReLU:
    * Parametric ReLU 
    * $\alpha$ is learned during training
    * Better with large datasets, poor with small datasets (overfitting)

* ELU:
    * Exponential Linear Unit. Exp decreasing when -
    * $\alpha$(exp (z) - 1) if -
    * z if +
    * Slower compute

* SELU:
    * Scaled ELU
    * Self -Normalize : if all hidden layers had SELu -> each output layers will have mean 0, stddev 1 -> no vanishing/exloding grad problem
    * Constraints:
        * Inputs to be normalized
        * Weights -> lecun normalization
        * WOrks only with sequential data

* Summary :
    * SELU > ELU > leak RELU > RELU > Tanh > sigmoid
    * If self-normalization not possible ELu > SELU
    * Run time latency -> LRELU  > SELU


    