# Activation Function

Q1. What is an activation function in the context of artificial neural networks?

Ans. An activation function in the context of artificial neural networks is a crucial component of a perceptron within a neural network. It determines the output of the perceptron(artificial neuron) based on its input. It is used to dsetermine the output of neuron based on input.

The primary purpose of an activation function is to introduce non-linearity into the model, allowing neural networks to learn complex patterns and make them capable of approximating any continuous function, a property known as universal approximation.


Q2. What are some common types of activation functions used in neural networks?

Ans. Here are some common Types of Activation Functions:
- **Sigmoid**: The sigmoid function produces an output in the range (0, 1). It was commonly used in the past for binary classification problems.
- **Hyperbolic Tangent (tanh)**: The tanh function produces an output in the range (-1, 1). It is similar to the sigmoid but centered at zero, making it zero-centered and better for training.
- **Rectified Linear Unit (ReLU)**: ReLU is the most widely used activation function. It returns zero if the input is less than zero and the input itself if it's positive, making it computationally efficient.
- **Leaky ReLU**: Leaky ReLU is a modification of ReLU where it allows a small gradient when the input is less than zero, preventing the "dying ReLU" problem.
- **Parametric ReLU (PReLU)**: Similar to Leaky ReLU, but with a learnable parameter for controlling the slope of the negative side.
- **Exponential Linear Unit (ELU)**: ELU is a variant of ReLU that has a non-zero gradient for negative inputs. It helps mitigate some of the issues with ReLU.
- **Scaled Exponential Linear Unit (SELU)**: SELU is a self-normalizing activation function, designed to maintain a fixed mean and variance throughout the network.

Q3. How do activation functions affect the training process and performance of a neural network?

Ans. Activation functions have a significant impact on the training process and performance of a neural network:

1. **Non-linearity and Model Capacity**: Activation functions introduce non-linearity into the neural network. This non-linearity is crucial for the network's ability to learn and represent complex, non-linear relationships in data.

2. **Gradient Flow and Vanishing/Exploding Gradients**: During the training process, gradients are used to update the network's weights through backpropagation. The shape of the activation function affects how gradients flow backward through the network.
  

3. **Avoiding Dead Neurons**: Dead neurons are neurons that never activate (produce an output of zero) for any input during training. This can occur when using ReLU and all inputs to a neuron are negative. Activation functions like Leaky ReLU or Parametric ReLU can prevent this issue by allowing a small gradient for negative inputs, ensuring that neurons continue to learn.

4. **Convergence Speed**: The choice of activation function can affect how quickly a neural network converges during training. Functions with gentle slopes near the origin, like sigmoid and tanh, may lead to slower convergence due to small gradients. In contrast, ReLU and its variants often lead to faster convergence because they provide stronger gradients for positive inputs.

5. **Efficiency and Computational Resources**: The computational efficiency of activation functions is an essential consideration, especially when training large neural networks.


Q4. How does the sigmoid activation function work? What are its advantages and disadvantages?

Ans. The sigmoid function produces an output in the range (0, 1). It was commonly used in the past for binary classification problems.

Formula:

**$$ \sigma(x) = \frac{1}{1 + e^{-x}} $$**
$$where\ \sigma(x) \in (0,1),\\
and\ x \in[-\infty,+\infty]$$ 

Advantages:

- Smooth gradient, preventing “jumps” in output values.
- Output values bound between 0 and 1, normalizing the output of each neuron.
- Clear predictions, i.e very close to 1 or 0.

Disadvantages:
* Prone to gradient vanishing
* Function output is not zero-centered
* Power operations are relatively time consuming


Q5.What is the rectified linear unit (ReLU) activation function? How does it differ from the sigmoid function?

Ans.  ReLU is the most widely used activation function. It returns zero if the input is less than zero and the input itself if it's positive, making it computationally efficient.

Formula:

$$ReLU(x)= max(x,0)$$

$$where\ ReLU(x) \in (0, x),\\
and\ x \in [-\infty, +\infty]$$



Q6. What are the benefits of using the ReLU activation function over the sigmoid function?

Ans. Advantages of ReLU over sigmoid:

1. ReLU is much faster than sigmoid, it requires less computational power
2. When the input is positive, there is no vanishing gradient problem.

Q7. Explain the concept of "leaky ReLU" and how it addresses the vanishing gradient problem.

Ans.
Leaky ReLU (Rectified Linear Unit) is an activation function that is designed to address the vanishing gradient problem, which can occur during the training of deep neural networks. It is a variation of the standard ReLU (Rectified Linear Unit) activation function. While the standard ReLU sets all negative values to zero, Leaky ReLU allows a small, non-zero gradient for negative inputs. The formula for Leaky ReLU is as follows:
 
$$ 
leaky\_relu(x, \alpha) = \left\{\begin{matrix} 
x & x\geq 0 \\ 
\alpha x & x \lt 0 
\end{matrix}\right.
$$

$$where\ x \in [-\infty, +\infty]$$

$\alpha$ is a small positive constant, typically in the range of 0.01 to 0.3. The value of $\alpha$ determines how much the gradient "leaks" for negative inputs. A common choice is $\alpha = 0.01.$





Q8. What is the purpose of the softmax activation function? When is it commonly used?

Ans. The softmax activation function is primarily used in the context of multi-class classification tasks in neural networks. Its main purpose is to transform the raw output scores, often referred to as logits, from the final layer of a neural network into a probability distribution over multiple classes. The softmax function accomplishes this by exponentiating and normalizing the scores, ensuring that the output values sum to 1.

Formula for softmax: $$S(x_j)=\frac{e^{x_j}}{\sum\limits_{k=1}^{K} e^{x_k}}, where\ j = 1,2, \cdots, K $$

Softmax is commonly used in the following scenarios:

1. **Multi-Class Classification**: When you have a classification problem with more than two classes and each input belongs to one specific class, such as image classification into multiple categories, natural language processing tasks like text categorization, or speech recognition.

2. **Output Layer Activation**: The softmax function is often used as the activation function in the output layer of a neural network designed for multi-class classification tasks.

3. **Probabilistic Interpretation**: If you need to obtain probability scores for each class, softmax is the appropriate choice because it ensures that class probabilities sum to 1.


Q9. What is the hyperbolic tangent (tanh) activation function? How does it compare to the sigmoid function?

Ans. The tanh function produces an output in the range (-1, 1). 
Formula-
$$ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $$
$$where\ \tanh(x) \in (-1,1),\\
and\ x \in[-\infty,+\infty]$$ 

Differences with sigmoid:

- It is zero centered buty sigmoid is not
- it returns value between (-1,1) but sigmoid returns between (0,1)