# **Activation Functions and Optimization Algorithms**

## **Activation Functions**

Activation functions are mathematical equations that determine the output of a neural network layer. They introduce non-linearity, helping neural networks learn complex patterns.

### **1. Sigmoid**
- **Equation**:  
  $$f(x) = \frac{1}{1 + e^{-x}}$$
- **Range**: (0, 1)  
- **Usage**: Output layers in binary classification.  
- **Pros**:
  - Outputs interpretable as probabilities.
- **Cons**:
  - Vanishing gradient problem.
  - Not zero-centered.

---

### **2. Tanh (Hyperbolic Tangent)**
- **Equation**:  
  $$f(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$
- **Range**: (-1, 1)  
- **Usage**: Hidden layers for zero-centered outputs.  
- **Pros**: Zero-centered.  
- **Cons**: Vanishing gradients for large inputs.

---

### **3. ReLU (Rectified Linear Unit)**
- **Equation**:  
  $$f(x) = \max(0, x)$$
- **Range**: [0, ∞)  
- **Usage**: Hidden layers in most modern architectures.  
- **Pros**:
  - Computationally efficient.
  - Mitigates vanishing gradient issues.
- **Cons**: Dead neurons for  \(x \leq 0\).

---

### **4. Leaky ReLU**
- **Equation**:  
  $$f(x) = \begin{cases} 
      x & \text{if } x > 0, \\
      \alpha x & \text{if } x \leq 0
   \end{cases}$$
  where $\alpha$ is a small positive constant (e.g., 0.01).  
- **Range**: (-∞, ∞)  
- **Usage**: Prevents dead neurons.  
- **Pros**: Allows small gradients for negative inputs.

---

### **5. Softmax**
- **Equation**:  
  $$f(x_i) = \frac{e^{x_i}}{\sum_{j=1}^N e^{x_j}}$$
- **Range**: (0, 1), where all outputs sum to 1.  
- **Usage**: Multi-class classification outputs.  

---

### **6. Swish**
- **Equation**:  
  $$f(x) = x \cdot \text{sigmoid}(x)$$
- **Range**: (-∞, ∞)  
- **Usage**: Deep learning tasks for smoother gradients.  

---
____
____
## **Optimization Algorithms**

Optimization algorithms minimize the loss function by updating weights and biases in the neural network.

### **1. Gradient Descent**
- **Update Rule**:  
  $$\theta = \theta - \eta \cdot \nabla L(\theta)$$
- **Variants**:
  - Batch Gradient Descent (entire dataset per update).
  - Stochastic Gradient Descent (SGD, one sample per update).
  - Mini-Batch Gradient Descent (small batches of data).

---

### **2. Momentum**
- **Update Rule**:
  $$v_t = \beta v_{t-1} - \eta \nabla L(\theta)$$  
  $$\theta = \theta + v_t$$
- **Benefit**: Speeds up convergence and reduces oscillations.

---

### **3. RMSprop**
- **Update Rule**:
  $$G_t = \beta G_{t-1} + (1 - \beta) \nabla L(\theta)^2$$  
  $$\theta = \theta - \frac{\eta}{\sqrt{G_t + \epsilon}} \cdot \nabla L(\theta)$$
- **Benefit**: Works well for non-stationary objectives.
- Non-stationary objectives refer to situations where the data distribution or loss landscape changes during training. This is common in deep learning as different parts of the network adjust their weights.

---

### **4. Adam (Adaptive Moment Estimation)**
- **Update Rule**:
  $$m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla L(\theta)$$  
  $$v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla L(\theta))^2$$  
  $$\theta = \theta - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t$$
- **Benefits**:
  - Adaptive learning rate.
  - Fast convergence.

---

### **5. Nesterov Accelerated Gradient (NAG)**
- **Update Rule**:
  $$v_t = \beta v_{t-1} - \eta \nabla L(\theta + \beta v_{t-1})$$
- **Benefit**: Anticipates future gradients.

---

### **6. AdaGrad**
- **Update Rule**:
  $$\theta = \theta - \frac{\eta}{\sqrt{G_{ii} + \epsilon}} \cdot \nabla L(\theta)$$
- **Benefit**: Suitable for sparse data.
- **Drawback**: Learning rate decays too fast.

---

### **7. AdaDelta**
- **Update Rule**:
  $$\theta = \theta - \frac{\eta \cdot \nabla L(\theta)}{\sqrt{E[\nabla L(\theta)^2] + \epsilon}}$$
- **Benefit**: Eliminates the need for a global learning rate.

---

### **8. AMSGrad**
- **Description**: A variant of Adam using the maximum of past squared gradients for stability.  

---

### **9. LAMB (Layer-wise Adaptive Moments)**
- **Description**: Used for large-batch training in distributed environments.  

---

### **10. Cosine Annealing with Warm Restarts (SGDR)**
- **Learning Rate Schedule**:
  $$\eta_t = \eta_{\text{min}} + 0.5 (\eta_{\text{max}} - \eta_{\text{min}}) (1 + \cos(\frac{t}{T} \pi))$$
- **Benefit**: Avoids local minima and helps explore the loss landscape.

---

This covers the most commonly used **activation functions** and **optimization algorithms**. Let me know if you need further clarifications or examples!
