# Activation Functions in Deep Learning  

---

## 1. Introduction  
Activation functions introduce **non-linearity** into neural networks.  
They decide whether a neuron should be "activated" or not, allowing the model to learn complex patterns beyond simple linear relationships.  

---

## 2. Key Terms  

### Linear Transformation  
Before activation, a neuron computes a weighted sum:  

$$
z = w \cdot x + b
$$  

Where:  
- $x$: input  
- $w$: weight  
- $b$: bias  

### Activation Function  
A mathematical function applied to $z$ to produce the neuron output:  

$$
a = f(z)
$$  

- Adds non-linearity.  
- Helps the network approximate complex functions.  

### Vanishing Gradient  
- A problem where gradients become very small during backpropagation.  
- Causes weights to update very slowly (common with sigmoid/tanh).  

### Exploding Gradient  
- Gradients become very large during backpropagation.  
- Can cause unstable training.  

---

## 3. Types of Activation Functions  

### 3.1 Sigmoid Function  

$$
f(x) = \frac{1}{1 + e^{-x}}
$$  

- Range: $(0, 1)$  
- Smooth, S-shaped curve.  
- Often used in binary classification (output layer).  

**Pros:**  
- Maps output to probability-like values.  

**Cons:**  
- Vanishing gradient problem.  
- Not zero-centered.  

---

### 3.2 Hyperbolic Tangent (tanh)  

$$
f(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
$$  

- Range: $(-1, 1)$  
- Similar to sigmoid but centered at 0.  

**Pros:**  
- Better than sigmoid for hidden layers.  

**Cons:**  
- Still suffers from vanishing gradients.  

---

### 3.3 ReLU (Rectified Linear Unit)  

$$
f(x) = \max(0, x)
$$  

- Range: $[0, \infty)$  
- Very widely used in hidden layers.  

**Pros:**  
- Computationally efficient.  
- Reduces vanishing gradient issue.  

**Cons:**  
- "Dying ReLU" problem (neurons stuck at 0).  

---

### 3.4 Leaky ReLU  

$$
f(x) = 
\begin{cases} 
x & \text{if } x > 0 \\
\alpha x & \text{if } x \leq 0
\end{cases}
$$  

Where $\alpha$ is a small constant (e.g., 0.01).  

**Pros:**  
- Solves dying ReLU problem by allowing small negative slope.  

**Cons:**  
- Adds a small complexity compared to ReLU.  

---

### 3.5 Parametric ReLU (PReLU)  

$$
f(x) = 
\begin{cases} 
x & \text{if } x > 0 \\
a x & \text{if } x \leq 0
\end{cases}
$$  

Where $a$ is a learnable parameter.  

**Pros:**  
- Model learns the negative slope.  

**Cons:**  
- Slightly more complex than Leaky ReLU.  

---

### 3.6 Softmax Function  

$$
\sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}
$$  

Where $z_i$ is the score for class $i$ among $K$ classes.  

- Converts raw scores into probabilities.  
- Used in the **output layer for multi-class classification**.  

**Pros:**  
- Produces probability distribution (sum = 1).  

**Cons:**  
- Computationally heavier due to exponential calculations.  

---

### 3.7 Swish (Google Research)  

$$
f(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}
$$  

- A smooth, non-monotonic function.  
- Proposed by Google, often outperforms ReLU.  

**Pros:**  
- Avoids dying neuron problem.  
- Better performance in deeper models.  

**Cons:**  
- More computationally expensive than ReLU.  

---

## 4. Rule of Thumb for Activation Functions  

- Hidden layers: ReLU (or Leaky ReLU if dying ReLU is observed).  
- Output layer (binary classification): Sigmoid.  
- Output layer (multi-class classification): Softmax.  
- For experimental deep models: Swish or advanced variants may give better results.  

---

## 5. Summary Table  

| Function   | Formula | Range | Pros | Cons |
|------------|---------|-------|------|------|
| Sigmoid    | $\frac{1}{1+e^{-x}}$ | (0,1) | Probabilistic output | Vanishing gradients |
| tanh       | $\tanh(x)$ | (-1,1) | Zero-centered | Vanishing gradients |
| ReLU       | $\max(0,x)$ | [0,∞) | Efficient, widely used | Dying neurons |
| Leaky ReLU | $\max(\alpha x, x)$ | (-∞,∞) | Solves dying ReLU | Small added complexity |
| PReLU      | Learnable slope | (-∞,∞) | Learns slope automatically | More params |
| Softmax    | $\frac{e^{z_i}}{\sum_j e^{z_j}}$ | (0,1), sum=1 | Multi-class probs | Expensive computation |
| Swish      | $x \cdot \sigma(x)$ | (-∞,∞) | Better deep learning results | Slower than ReLU |

---
