Activation functions are mathematical equations that determine the output of a neural network node. They introduce non-linearity into the network, enabling it to learn complex patterns in the data. Activation functions are typically applied to the output of each neuron in a neural network layer, except for the input layer.

There are several types of activation functions commonly used in neural networks:

### 1. Sigmoid Function:

1. Equation:
$\sigma(x)=\frac{1}{1+e^{-x}}$

2. Range: (0,1)
3. Properties:
    * Smooth, continuous function
    * Output values are squashed between 0 and 1, representing probabilities
    * Suffers from vanishing gradients problem for very large or very small inputs, which can slow down learning in deep networks
4. Usage:
    * Historically used in the output layer of binary classification problems
    
![Softmax.png](attachment:Softmax.png)

### 2. Hyperbolic Tangent (Tanh) Function:

1. Equation: $\tanh (x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}$
2. Range: (−1,1)
3. Properties:
    * Similar to the sigmoid function, but output values are centered around 0
    * Also suffers from vanishing gradients problem
    * Generally preferred over sigmoid for hidden layers in deep networks because the output is zero-centered, which can speed up convergence
3. Usage:
    * Commonly used in hidden layers of neural networks
    
    
![Tanh%20.png](attachment:Tanh%20.png)

### 3. Rectified Linear Unit (ReLU):

1. Equation: ReLU(x)=max(0,x)
2. Range: [0,∞)
3. Properties:
    * Simple and computationally efficient
    * Introduces sparsity by setting negative values to zero
    * Solves the vanishing gradients problem for positive inputs
    * However, neurons can "die" if they output zero for all inputs, which can slow down learning
4. Usage:
    * Widely used in hidden layers of deep neural networks due to its simplicity and effectiveness
    
    
![RELU.png](attachment:RELU.png)

### 4. Leaky ReLU:

1. Equation: Leaky ReLU(x)=max(0.01x,x) or Leaky ReLU(x)= $\begin{cases}0.01 x & \text { if } x<0 \\ x & \text { if } x \geq 0\end{cases}$
2. Range: (−∞,∞)
3. Properties:
    * Similar to ReLU but with a small slope for negative inputs (α is a hyperparameter typically set to a small value like 0.01)
    * Prevents dying ReLU problem by allowing a small gradient for negative inputs
4. Usage:
    * Used when ReLU leads to dead neurons
    
    
![Leaky%20Relu.png](attachment:Leaky%20Relu.png)

### 5. Parametric ReLU (PReLU):

1. Equation: PReLU(x) = $\begin{cases}\alpha x & \text { if } x<0 \\ x & \text { if } x \geq 0\end{cases}$
2. Range: (−∞,∞)
3. Properties:
    * Similar to Leaky ReLU but with a learnable parameter α
    * Allows the network to learn the optimal slope for negative inputs
4. Usage:
    * Used when a learnable slope for negative inputs is desired
    
 
![Parameterized%20ReLU.png](attachment:Parameterized%20ReLU.png)

### 6. Exponential Linear Unit (ELU):

1. Equation: ELU $(x)= \begin{cases}\alpha\left(e^x-1\right) & \text { if } x<0 \\ x & \text { if } x \geq 0\end{cases}$
2. Range: (−α,∞)
3. Properties:
    * Smooth approximation to ReLU for positive inputs
    * Has a non-zero gradient for negative inputs, avoiding dying ReLU problem
    * Generally slower to compute than ReLU due to the exponential operation
4. Usage:
    * Used as an alternative to ReLU when a smooth activation function is desired   
    
    
![Exponential%20Relu.png](attachment:Exponential%20Relu.png)

### 7. Softmax Function

1. Formula: $\operatorname{Softmax}\left(x_i\right)=\frac{e^{x_i}}{\sum_{j=1}^N e^{x_j}}$
2. Range: (0, 1) and the sum of all outputs is 1.
3. Properties:
    * Used in the output layer for multi-class classification problems.
    * Converts raw scores (logits) into probabilities.
4. Advantages:
    * Ensures that the output probabilities sum up to 1, making it suitable for classification tasks.
    
    
![Softmax%20.png](attachment:Softmax%20.png)

### 8. Swish Function

Swish $(x)=x \cdot \operatorname{sigmoid}(\beta x)$

where β is a hyperparameter that controls the smoothness of the function. When β=1, the Swish function simplifies to $x \cdot \operatorname{sigmoid}(x)$


![SWISH%20and%20RELU.png](attachment:SWISH%20and%20RELU.png)