# Activation Functions
Activation functions are mathematical functions used in neural networks to introduce non-linearity into the model. In the following image, the function $g$ would be the activation function:

<div style="align=center">
    <img src="media/activation_intro.png" width=800>
</div>

We need to use activation functions because linear transformations of the input data (i.e. $g$ would be the identity function, $f(x) = x$) in a neural network results in a linear output, regardless of the number of layers or neurons used in the model. This limits the model's ability to learn complex patterns in the data. For example, in a binary classification problem, a model without non-linearity would be limited to fitting a linear decision boundary between the two classes.

## Need for Activation Functions
Activation functions are crucial in neural networks to introduce non-linearity, enabling them to learn from complex data patterns. If every neuron in a neural network were to use a linear activation function, the network would function like linear regression. Regardless of the network's depth, it could only fit linear relationships in data, limiting its utility. 

Let's simplify this concept with a one-hidden-unit network example. If a linear function is used everywhere, the output becomes a linear function of the input, equivalent to using a simple linear regression model.

This limitation arises from the fact that the composition of linear functions is also a linear function. Therefore, a multilayer neural network employing linear activation functions would equate to linear or logistic regression, depending on the output layer function. This would prevent the network from learning complex features and diminish the benefit of multiple layers. Therefore, it's advised not to use linear activation functions in hidden layers. The Rectified Linear Unit (ReLU) is a commonly recommended alternative for hidden layers. Activation functions other than linear ones enable neural networks to tackle a wider range of problems, including binary classification, regression, and multi-category classification.

## Linear
A straight line function where activation is proportional to input (which is the weighted sum from neuron).

<div style="align=center">
    <img src="media/linear_act.png" width=600>
</div>

**Pros:**
- It gives a range of activations, so it is not binary activation.
- We can definitely connect a few neurons together and if more than 1 fires, we could take the max ( or softmax) and decide based on that.

**Cons:**
- For this function, derivative is a constant. That means, the gradient has no relationship with X.
- It is a constant gradient and the descent is going to be on constant gradient.
- If there is an error in prediction, the changes made by back propagation is constant and not depending on the change in input delta(x)!

## ELU
**Exponential Linear Unit** or its widely known name ELU is a function that tend to converge cost to zero faster and produce more accurate results. Different to other activation functions, ELU has a extra alpha constant which should be positive number.

ELU is very similiar to RELU except negative inputs. They are both in identity function form for non-negative inputs. On the other hand, ELU becomes smooth slowly until its output equal to -α whereas RELU sharply smoothes.

<div style="align=center">
    <img src="media/elu_act.png" width=800>
</div>

**Pros:**
- ELU becomes smooth slowly until its output equal to -α whereas RELU sharply smoothes.
- ELU is a strong alternative to ReLU.
- Unlike to ReLU, ELU can produce negative outputs.

**Cons:**
- For x > 0, it can blow up the activation with the output range of $[0, \infty]$.

## ReLU
ReLU (Rectified Linear Unit) is most commonly used activation function in deep learning neural networks. It works by mapping any negative input to zero and any positive input to its own value. Mathematically, ReLU is defined as: $F(x) = \text{max} (0, x)$. Despite its name and appearance, it’s not linear and provides the same benefits as Sigmoid (i.e. the ability to learn nonlinear functions), but with better performance.

<div style="align=center">
    <img src="media/relu_act.png" width=800>
</div>

**Pros:**
- It avoids and rectifies vanishing gradient problem.
- ReLu is less computationally expensive than `tanh` and `sigmoid` because it involves simpler mathematical operations.

**Cons:**
- One of its limitations is that it should only be used within hidden layers of a neural network model.
- Some gradients can be fragile during training and can die. It can cause a weight update which will makes it never activate on any data point again. In other words, ReLu can result in dead neurons.
- In another words, For activations in the region $(x<0)$ of ReLu, gradient will be 0 because of which the weights will not get adjusted during descent. That means, those neurons which go into that state will stop responding to variations in error/ input (simply because gradient is 0, nothing changes). This is called the dying ReLu problem.
- The range of ReLu is $[0,\infty)$. This means it can blow up the activation.

## LeakyReLU
LeakyRelu is a variant of ReLU. Instead of being $0$ when $z<0$, a leaky ReLU allows a small, non-zero, constant gradient $\alpha$ (Normally, $\alpha=0.01$).

<div style="align=center">
    <img src="media/leaky_relu_act.png" width=800>
</div>

**Pros:**
- Leaky ReLUs are one attempt to fix the "dying ReLU" problem by having a small negative slope (of 0.01, or so).

**Cons:**
- As it possess linearity, it can't be used for the complex Classification. It lags behind the Sigmoid and Tanh for some of the use cases.

## Sigmoid
Sigmoid takes a real value as input and outputs another value between 0 and 1. It's easy to work with and has all the nice properties of activation functions: it’s non-linear, continuously differentiable, monotonic, and has a fixed output range.

<div style="align=center">
    <img src="media/sigmoid_act.png" width=800>
</div>

**Pros:**
- It is nonlinear in nature. Combinations of this function are also nonlinear!
- It will give an analog activation unlike step function.
- It has a smooth gradient too.
- It’s good for a classifier.
- The output of the activation function is always going to be in range (0,1) compared to $(-\infty, \infty)$ of linear function. So we have our activations bound in a range. Nice, it won’t blow up the activations then.

**Cons:**
- Towards either end of the sigmoid function, the $Y$ values tend to respond very less to changes in $X$.
- It gives rise to a problem of "vanishing gradients".
- Its output isn't zero centered. It makes the gradient updates go too far in different directions. $0 < \text{output} < 1$, and it makes optimization harder.
- Sigmoids saturate and kill gradients.
- The network refuses to learn further or is drastically slow ( depending on use case and until gradient /computation gets hit by floating point value limits ).

## Tanh
Tanh squashes a real-valued number to the range $[-1, 1].$ It’s non-linear. But unlike Sigmoid, its output is zero-centered. Therefore, in practice the tanh non-linearity is always preferred to the sigmoid nonlinearity.

<div style="align=center">
    <img src="media/tanh_act.png" width=800>
</div>

**Pros:**
- The gradient is stronger for tanh than sigmoid (derivatives are steeper).

**Cons:**
- Tanh also has the vanishing gradient problem.

## Softplus
The Softplus activation function is a smooth and continuous function that is a variation of the ReLU activation function. It maps any input value to a value between 0 and infinity. The math behind the Softplus function is:

$$f(x) = \log (1 + \exp(x))$$

<div style="align=center">
    <img src="media/softplus_act.png" width=500>
</div>

**Pros:**
- It has a range of output values between 0 and infinity, which can be useful in some cases.
- It is computationally efficient to calculate.

**Cons:**
- It is not zero-centered, which can cause problems with convergence based on the neural network architectures.
- It can be sensitive to the initial values of the weights in the network, which can affect the training process.

## GELU (Gaussian Error Linear Unit)
The GELU (Gaussian Error Linear Unit) activation function is a smooth, non-linear function that is used in deep learning models. It is defined as:

$$\text{GELU} = \frac{1}{2} x (1 + \text{erf} (\frac{x}{\sqrt{2}})$$

where $\text{erf}$ is the error function.

<div style="align=center">
    <img src="media/gelu_act.png" width=600>
</div>

**Pros:**
- It has shown to perform well in deep neural networks, especially in natural language processing (NLP) tasks.
- It is computationally efficient and can be easily implemented in neural network architectures.

**Cons:**
- It may not perform as well in image recognition tasks compared to ReLU and its variants.
- It may not be as stable as other activation functions, especially when using large learning rates.




## Choosing the Right Activation Function
Selecting the appropriate activation function depends on the nature of the problem and the type of output you're predicting.

For the **output layer:**

- **Sigmoid** is best for binary classification tasks due to its probability interpretation.
- **Linear activation** is suitable for tasks like predicting stock prices, which can take on any real number (positive or negative).
- **ReLU** is ideal for predicting values that are always non-negative, such as house prices.
- **TanH** is a good choice for multi-class classification tasks.

For **hidden layers, ReLU** is the most commonly used due to its ability to speed up computation and learning by enabling sparse activation.