In [2]:
import math 
import numpy as np

# Declaring "e" for future use
e = math.e

**Activation Functions**

> Are **mathematical functions** used within a neural network that determines whether the Neuron is fired or not.

*Formal Definition*
> Are Non-linear Transformations applied to the next layer of the neurons
$$
Y = \text{Activation}\!\left(\sum_{i=1}^{n} W_i X_i \right)
$$

**Properties of Activation Functions?**
- Non-Linear
    - Non-Linearity enables the model to understand complex patterns from the data
- Differentiable
    - Activation functions should be differentiable in order to facilitate the *Back-propogation* process

## Linear Function 
$$
f(x) = x 
$$ 

- It could be used for the output layer for regression 
- It recognizes only linear shapes

In [3]:
def linear(x): 
    return x 

## Sigmoid
$$
f(x) = \frac{1} {1 + e^{-x}}
$$

The Output value is mapped between 0 and 1. 
However, this function introduces *Vanishing Gradient* 

> Vanishing Gradient is the phenomenon where the weights of a NN approximate itself to zero/negligible hence causing training resources to be spent. It does not change to weight updates. 

In [4]:
def sigmoid(x): 
    return 1/(1 + (e**-x))

## TanH
$$
f(x) = \frac{e^x - e^{-x}} {e^x + e^{-x}}
$$ 

The Output value is mapped between -1 and 1. 

**How is this different from Sigmoid?** 
- Here, smaller-lesser values are spread around 0, hence improved training effect and weight adjustments move in the right direction 
- Scales smaller values in the output range

However, both Sigmoid and TanH suffer from the **Vanishing Gradient** and **Saturation Effect**

In [5]:
def tanH(x): 
    return (e**x - e**(-x))/ (e**x + e**(-x))

## ReLU: **Rectified Linear Unit**

$$
f(x) = max(0, x)
$$

> Definition in function itself.

*Advantages*: 
- Simple Calculation
- No Vanishing Gradient 
- Better results for newer model architectures
- More economic value 

*Disadvantages*: 
- 20-50% of neurons die off 
- Highly dependent on a well-choosen learning rate
- Theoretically, it can assume large values

In [6]:
def relu(x): 
    return max(0, x)

## Leaky ReLU 

$$
f(x) = max(ax, x)
$$ 

Here, even if the neuron recieves negative values, it still does not become zero and can generate a small gradient

In [7]:
def leaky_relu(a, x): 
    return max(a*x, x)

## **Softmax**

$$
\sigma_j = \frac{e^{z_j}}{\sum_{k=1}^{K} e^{z_k}}
$$
> Takes a vector, convert's it's values as probabilities depending on their size

In [8]:
def softMax(x: np.ndarray):
    pass