### Q1. What is an activation function in the context of artificial neural networks?
An activation function in artificial neural networks is a mathematical function applied to the output of each neuron. It determines whether a neuron should be activated or not by calculating a weighted sum and adding a bias. The primary purpose is to introduce non-linearity into the model, enabling the network to learn and represent more complex patterns.

### Q2. What are some common types of activation functions used in neural networks?
Some common types of activation functions include:
- **Sigmoid (Logistic)**: \( \sigma(x) = \frac{1}{1 + e^{-x}} \)
- **Hyperbolic Tangent (tanh)**: \( \text{tanh}(x) = \frac{2}{1 + e^{-2x}} - 1 \)
- **Rectified Linear Unit (ReLU)**: \( \text{ReLU}(x) = \max(0, x) \)
- **Leaky ReLU**: \( \text{Leaky ReLU}(x) = \max(\alpha x, x) \) where \( \alpha \) is a small constant
- **Softmax**: \( \text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}} \)

### Q3. How do activation functions affect the training process and performance of a neural network?
Activation functions affect the training process and performance of a neural network in several ways:
- **Non-linearity**: They introduce non-linearity, allowing the network to learn complex patterns and functions.
- **Gradient Flow**: They affect how gradients are backpropagated through the network, impacting the learning process. Some functions can lead to issues like vanishing or exploding gradients.
- **Convergence**: The choice of activation function can influence the speed and stability of the training process.
- **Expressiveness**: Different activation functions can enhance the model’s ability to capture diverse features in the data.

### Q4. How does the sigmoid activation function work? What are its advantages and disadvantages?
**Sigmoid Activation Function**:
- **Function**: \( \sigma(x) = \frac{1}{1 + e^{-x}} \)
- **Advantages**:
  - Smooth gradient, preventing abrupt changes in gradient values.
  - Output values range between 0 and 1, useful for probabilistic interpretations.
- **Disadvantages**:
  - **Vanishing Gradient**: Can cause gradients to vanish during backpropagation, making it difficult for the network to learn.
  - **Slow Convergence**: Can lead to slow convergence due to the squashing of input space.
  - **Outputs Not Zero-Centered**: Can make gradient updates less efficient.

### Q5. What is the rectified linear unit (ReLU) activation function? How does it differ from the sigmoid function?
**ReLU Activation Function**:
- **Function**: \( \text{ReLU}(x) = \max(0, x) \)
- **Differences from Sigmoid**:
  - **Non-linearity**: ReLU is non-linear like sigmoid but does not squash the input values into a small range.
  - **Range**: ReLU outputs range from 0 to \( \infty \), unlike sigmoid which ranges between 0 and 1.
  - **Gradient Issues**: ReLU can suffer from the "dying ReLU" problem (neurons outputting 0 for all inputs) but does not have the vanishing gradient problem as severely as sigmoid.

### Q6. What are the benefits of using the ReLU activation function over the sigmoid function?
- **Efficient Computation**: ReLU is computationally efficient, requiring only a thresholding at zero.
- **Mitigates Vanishing Gradient**: ReLU mitigates the vanishing gradient problem, allowing for deeper networks to be trained.
- **Sparsity**: Activates only a portion of neurons at a time, promoting sparsity and efficient computation.
- **Faster Convergence**: Typically leads to faster convergence in training compared to sigmoid.

### Q7. Explain the concept of "leaky ReLU" and how it addresses the vanishing gradient problem.
**Leaky ReLU**:
- **Function**: \( \text{Leaky ReLU}(x) = \max(\alpha x, x) \) where \( \alpha \) is a small constant (e.g., 0.01).
- **Purpose**: Allows a small, non-zero gradient when the input is negative, addressing the "dying ReLU" problem.
- **Benefits**:
  - Ensures that neurons have a small gradient even when inactive, which helps in avoiding dead neurons and maintaining gradient flow.
  - Provides a small slope for negative inputs, thus mitigating the issue of zero gradients.

### Q8. What is the purpose of the softmax activation function? When is it commonly used?
**Softmax Activation Function**:
- **Function**: \( \text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}} \)
- **Purpose**: Converts logits (raw prediction scores) into probabilities that sum to 1.
- **Usage**: Commonly used in the output layer of classification networks where the task is to assign an input to one of multiple classes.

### Q9. What is the hyperbolic tangent (tanh) activation function? How does it compare to the sigmoid function?
**tanh Activation Function**:
- **Function**: \( \text{tanh}(x) = \frac{2}{1 + e^{-2x}} - 1 \)
- **Comparison to Sigmoid**:
  - **Output Range**: tanh outputs range between -1 and 1, whereas sigmoid outputs range between 0 and 1.
  - **Zero-Centered**: tanh outputs are zero-centered, which can lead to more efficient training compared to sigmoid’s non-zero-centered outputs.
  - **Gradient Issues**: tanh also suffers from the vanishing gradient problem but generally performs better in practice than sigmoid due to its zero-centered nature.