<a href="https://colab.research.google.com/github/abubakarkhanlakhwera/Deepl-Learing/blob/main/Activation_Function/Activation_Function.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Activation Function

An **activation function** introduces non-linearity to a neural network, enabling it to learn complex relationships in data. It transforms the input signal into an output using mathematical functions like **Sigmoid**, **ReLU**, or **Tanh** to decide whether a neuron should be activated.


> https://en.wikipedia.org/wiki/Activation_function


### Sigmoid Activation Function

The **Sigmoid function** is defined as:  
\[
f(x) = \frac{1}{1 + e^{-x}}
\]

It maps any input to a range between **0 and 1**, making it suitable for binary classification problems.

#### ✅ **Advantages:**
- **Smooth Gradient:** Provides a smooth gradient, preventing abrupt jumps in output.  
- **Probabilistic Interpretation:** Output can be interpreted as probabilities.  
- **Works Well for Binary Classification:** Suitable for problems requiring outputs between 0 and 1.  

#### ❌ **Disadvantages:**
- **Vanishing Gradient Problem:** For very large or very small inputs, gradients approach zero, slowing down training.  
- **Not Zero-Centered:** Outputs are always positive, causing inefficient weight updates.  
- **Computationally Expensive:** Involves exponential calculations, which can be computationally intensive.  


### Comparison of Activation Functions

| **Activation Function** | **Mathematical Formula** | **Range** | **Gradient Type** | **Advantages** | **Disadvantages** | **Best Used In** |
|--------------------------|--------------------------|-----------|-------------------|---------------|-------------------|-----------------|
| **Sigmoid** | \( f(x) = \frac{1}{1 + e^{-x}} \) | (0, 1) | **Vanishing Gradient** for large/small inputs | Smooth gradient, probabilistic output | Vanishing gradient, not zero-centered | Binary classification |
| **Tanh** | \( f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \) | (-1, 1) | **Vanishing Gradient** for extreme values | Zero-centered output, stronger gradients | Vanishing gradient | Hidden layers in neural networks |
| **ReLU (Rectified Linear Unit)** | \( f(x) = \max(0, x) \) | [0, ∞) | **Non-vanishing Gradient** for positive inputs | Computationally efficient, avoids vanishing gradient | Dying ReLU problem (neurons stop activating) | Hidden layers in deep neural networks |
| **Leaky ReLU** | \( f(x) = \max(\alpha x, x) \) | (-∞, ∞) | **Non-vanishing Gradient** for negative and positive inputs | Allows small gradients for negative values | Can result in unstable training | Addressing dying ReLU problem |
| **Softmax** | \( f(x_i) = \frac{e^{x_i}}{\sum e^{x_j}} \) | (0, 1) (sums to 1) | **Non-vanishing Gradient** | Outputs probabilities, good for multi-class classification | Computationally expensive | Output layer in multi-class classification |
| **ELU (Exponential Linear Unit)** | \( f(x) = x \) if \( x > 0 \), \( f(x) = \alpha(e^x - 1) \) if \( x \leq 0 \) | (-α, ∞) | **Non-vanishing Gradient** for all inputs | Smooth gradient, avoids vanishing gradient | Higher computational cost | Deep learning models |

### Key Takeaways:
- **Vanishing Gradient:** Sigmoid and Tanh suffer in deeper networks.
- **Non-vanishing Gradient:** ReLU, Leaky ReLU, Softmax, and ELU handle gradients better in deep networks.
- **Usage:** Choose the activation function based on layer position and problem type.
