# **Why We Use activation function in Neural Network**

The activation function decides whether a neuron should be activated or not by calculating the weighted sum and further adding bias to it.

1. **Non-linearity**: Activation functions introduce non-linearities into the model. Without them, the neural network would behave like a single-layer perceptron, regardless of the number of layers, as the composition of linear functions is still a linear function. Non-linearity allows the network to learn complex patterns.

2. **Enabling Learning of Complex Patterns**: By introducing non-linear transformations, activation functions enable neural networks to approximate complex mappings between inputs and outputs, which is crucial for tasks such as image recognition, natural language processing, and more.

3. **Controlling Output Range**: Many activation functions squash the output to a specific range (e.g., [0, 1] for sigmoid, [-1, 1] for tanh, and [0, ∞) for ReLU). This can help with the stability of the network and can make the training process more efficient.

4. **Gradients for Backpropagation**: Activation functions also affect the gradients used in backpropagation. Functions like ReLU mitigate the vanishing gradient problem by providing a gradient of 1 for positive inputs, whereas functions like sigmoid can suffer from vanishing gradients as their derivatives are very small for large positive or negative inputs.

5. **Sparsity**: Some activation functions, like ReLU, introduce sparsity in the network by outputting zero for negative inputs. This can make the network more efficient and can help with the interpretability of the model.

Commonly used activation functions include:

- **Sigmoid**: \( \sigma(x) = \frac{1}{1 + e^{-x}} \)
- **Tanh**: \( \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \)
- **ReLU (Rectified Linear Unit)**: \( \text{ReLU}(x) = \max(0, x) \)
- **Leaky ReLU**: \( \text{Leaky ReLU}(x) = \max(\alpha x, x) \) where \( \alpha \) is a small constant
- **Softmax**: Typically used in the output layer for classification tasks to convert logits into probabilities.

Choosing the right activation function depends on the specific task and the architecture of the neural network.

1. **Sigmoid Activation Function**:
   - **Formula**: \( \sigma(x) = \frac{1}{1 + e^{-x}} \)
   - **Range**: (0, 1)
   - **Properties**: The sigmoid function maps any input value to a value between 0 and 1. It's often used in the output layer of binary classification problems.
   - **Graph**: The sigmoid curve is S-shaped and asymptotically approaches 0 and 1.

2. **Tanh (Hyperbolic Tangent) Activation Function**:
   - **Formula**: \( \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \)
   - **Range**: (-1, 1)
   - **Properties**: Tanh is similar to the sigmoid function but maps input values to a range between -1 and 1. It is zero-centered, which helps in having a more balanced gradient.
   - **Graph**: The tanh curve is also S-shaped but symmetric around the origin.

3. **ReLU (Rectified Linear Unit) Activation Function**:
   - **Formula**: \( \text{ReLU}(x) = \max(0, x) \)
   - **Range**: [0, ∞)
   - **Properties**: ReLU outputs the input directly if it is positive; otherwise, it outputs zero. This function is widely used because it helps to mitigate the vanishing gradient problem and introduces sparsity in the network.
   - **Graph**: The ReLU function is linear for positive values and flat for negative values.

4. **Leaky ReLU Activation Function**:
   - **Formula**: \( \text{Leaky ReLU}(x) = \begin{cases} 
      x & \text{if } x \ge 0 \\
      \alpha x & \text{if } x < 0 
      \end{cases} \)
   - **Range**: (-∞, ∞)
   - **Properties**: Similar to ReLU, but it allows a small gradient when the input is negative. This small slope (controlled by \(\alpha\)) helps to keep the gradient flow alive, preventing the dying ReLU problem.
   - **Graph**: The Leaky ReLU function is linear for positive values and has a small slope for negative values.

5. **Softmax Activation Function**:
   - **Formula**: \( \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}} \)
   - **Range**: (0, 1) for each output, and the sum of all outputs is 1
   - **Properties**: The softmax function is used in the output layer of a network for multi-class classification problems. It converts logits into probabilities, with the output vector summing to 1.
   - **Graph**: The softmax function doesn't have a single graph because it operates on a vector. Each element of the output is a fraction of the exponentiated input relative to the sum of all exponentiated inputs.