# Optimizer Theory and Mathematical Formulations

## 1. Stochastic Gradient Descent (SGD)

### Theory
SGD is the fundamental optimization algorithm that updates parameters in the direction opposite to the gradient.

### Mathematical Formulation
```
θ(t+1) = θ(t) - η * ∇L(θ(t))
```
Where:
- θ(t): parameters at iteration t
- η: learning rate
- ∇L(θ(t)): gradient of loss function

### Characteristics
- **Pros**: Simple, memory efficient, can escape local minima due to noise
- **Cons**: Oscillatory convergence, sensitive to learning rate, slow on ill-conditioned problems

## 2. Adam (Adaptive Moment Estimation)

### Theory
Adam combines momentum with adaptive learning rates, maintaining running averages of both gradients and squared gradients.

### Mathematical Formulation
```
m(t) = β₁ * m(t-1) + (1-β₁) * ∇L(θ(t))     [First moment]
v(t) = β₂ * v(t-1) + (1-β₂) * (∇L(θ(t)))²  [Second moment]

m̂(t) = m(t) / (1 - β₁ᵗ)  [Bias correction]
v̂(t) = v(t) / (1 - β₂ᵗ)  [Bias correction]

θ(t+1) = θ(t) - η * m̂(t) / (√v̂(t) + ε)
```
Where:
- β₁ = 0.9 (momentum decay rate)
- β₂ = 0.999 (squared gradient decay rate)
- ε = 1e-8 (small constant for numerical stability)

### Characteristics
- **Pros**: Fast convergence, adaptive learning rates, robust to hyperparameters
- **Cons**: May not converge to optimal solution in some cases, high memory usage

## 3. RMSprop (Root Mean Square Propagation)

### Theory
RMSprop adapts learning rates based on moving average of squared gradients, solving AdaGrad's diminishing learning rate problem.

### Mathematical Formulation
```
v(t) = γ * v(t-1) + (1-γ) * (∇L(θ(t)))²
θ(t+1) = θ(t) - η * ∇L(θ(t)) / (√v(t) + ε)
```
Where:
- γ = 0.9 (decay rate)
- ε = 1e-8 (small constant)

### Characteristics
- **Pros**: Adaptive learning rates, works well on non-stationary objectives
- **Cons**: Still requires manual tuning of learning rate

## 4. Adagrad (Adaptive Gradient Algorithm)


### Theory
Adagrad adapts learning rate based on historical gradients, giving frequently updated parameters smaller learning rates.

### Mathematical Formulation
```
G(t) = G(t-1) + (∇L(θ(t)))²  [Cumulative squared gradients]
θ(t+1) = θ(t) - η * ∇L(θ(t)) / (√G(t) + ε)
```

### Characteristics
- **Pros**: No manual learning rate tuning, good for sparse features
- **Cons**: Learning rate diminishes to zero, may stop learning prematurely



In [None]:
# Results Analysis and Optimizer Selection

## Expected Performance Patterns

### Convergence Speed
1. **Adam/Nadam**: Fastest initial convergence due to adaptive learning rates
2. **RMSprop**: Good convergence, especially for RNNs
3. **Adamax**: Stable convergence, may be slower than Adam
4. **SGD**: Slower but steady convergence
5. **Adagrad**: May slow down significantly in later epochs

### Final Accuracy
- **Adam/Nadam**: Usually achieve high accuracy quickly
- **SGD**: May achieve good final accuracy but needs more epochs
- **RMSprop**: Balanced performance
- **Adagrad**: May plateau early due to diminishing learning rates

## How to Choose the Right Optimizer

### For Different Scenarios:

1. **Computer Vision (CNNs)**:
   - **Recommended**: Adam, SGD with momentum
   - **Why**: Adam for quick prototyping, SGD for final tuning

2. **Natural Language Processing (RNNs/Transformers)**:
   - **Recommended**: Adam, RMSprop
   - **Why**: Handle non-stationary objectives well

3. **Large Datasets**:
   - **Recommended**: SGD, Adam
   - **Why**: Memory efficient, stable convergence

4. **Small Datasets**:
   - **Recommended**: Adam, RMSprop
   - **Why**: Adaptive learning rates help with limited data

### Selection Criteria:

1. **Training Time Constraints**: Adam/Nadam for quick results
2. **Memory Constraints**: SGD for minimal memory usage
3. **Hyperparameter Sensitivity**: Adam for robustness
4. **Final Performance**: SGD with proper tuning often achieves best results
5. **Sparse Features**: Adagrad for sparse gradients

## Hyperparameter Tuning Guidelines

### Learning Rate Selection:
- **SGD**: 0.01 - 0.1
- **Adam**: 0.001 - 0.01
- **RMSprop**: 0.001 - 0.01
- **Adagrad**: 0.01 - 0.1

### Batch Size Impact:
- **Larger batches**: More stable gradients, better for SGD
- **Smaller batches**: More noise, can help escape local minima
- **Memory constraints**: Balance between performance and available memory

# Potential Viva Questions and Answers

## Technical Implementation Questions

### Q1: Why do we normalize the MNIST data by dividing by 255?
**Answer**: 
- Pixel values range from 0-255 (8-bit integers)
- Normalization to [0,1] helps with:
  - Faster convergence (gradients don't explode/vanish)
  - Numerical stability
  - Consistent scale across features
  - Better performance of activation functions

### Q2: Why use different batch sizes for different optimizers?
**Answer**:
- **SGD**: Smaller batches (32) provide more frequent updates and noise
- **Mini-batch SGD**: Larger batches (64) for more stable gradient estimates
- **Adaptive optimizers (Adam, RMSprop)**: Standard batch size (32) works well
- Trade-off between computational efficiency and gradient accuracy

### Q3: Explain the choice of loss function and why it's suitable.
**Answer**:
- **Sparse Categorical Crossentropy**: Used because:
  - Multi-class classification (10 classes)
  - Labels are integers (0-9), not one-hot encoded
  - Outputs probability distribution via softmax
  - Mathematically: L = -log(p_true_class)

## Theoretical Questions

### Q4: What is the vanishing gradient problem and how do optimizers address it?
**Answer**:
- **Problem**: Gradients become very small in deep networks, slowing learning
- **Solutions**:
  - **Adam/RMSprop**: Adaptive learning rates scale gradients appropriately
  - **Proper initialization**: Xavier/He initialization
  - **Better activations**: ReLU instead of sigmoid/tanh

### Q5: Explain the momentum concept in optimization.
**Answer**:
- **Concept**: Use weighted average of past gradients to smooth updates
- **Benefits**:
  - Accelerates convergence in consistent directions
  - Dampens oscillations in valleys
  - Helps escape local minima
- **Mathematical**: v(t) = γ*v(t-1) + η*∇L, θ(t+1) = θ(t) - v(t)

### Q6: Why might Adam fail to converge to the optimal solution?
**Answer**:
- **Exponential moving averages**: May not forget old gradients quickly enough
- **Learning rate scheduling**: May need decay for fine-tuning
- **Second moment estimation**: Can become too large, making updates too small
- **Solution**: Use learning rate scheduling or switch to SGD for final epochs

## Practical Questions

### Q7: How would you modify this code for a regression problem?
**Answer**:
- **Output layer**: Single neuron, linear activation
- **Loss function**: Mean Squared Error (MSE) or Mean Absolute Error (MAE)
- **Metrics**: MSE, MAE, R²
- **Data preprocessing**: StandardScaler for targets

### Q8: What if training accuracy is high but validation accuracy is low?
**Answer**:
- **Problem**: Overfitting
- **Solutions**:
  - Add dropout layers
  - Reduce model complexity
  - Add L1/L2 regularization
  - Increase training data
  - Early stopping
  - Data augmentation

### Q9: How would you implement early stopping?
**Answer**:
```python
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
model.fit(..., validation_split=0.2, callbacks=[early_stop])
```

## Advanced Questions

### Q10: Explain the bias correction in Adam optimizer.
**Answer**:
- **Problem**: Initial moments are biased toward zero
- **Solution**: Divide by (1 - β^t) where t is iteration number
- **Effect**: Larger effective learning rate in early iterations
- **Mathematical**: m̂ = m/(1-β₁^t), v̂ = v/(1-β₂^t)

### Q11: When would you use learning rate scheduling and how?
**Answer**:
- **When**: Long training, fine-tuning, avoiding overshooting minima
- **Methods**:
  - Step decay: Reduce by factor every N epochs
  - Exponential decay: Continuous exponential reduction
  - Cosine annealing: Smooth reduction following cosine function
  - Adaptive: Based on validation loss plateau

### Q12: How do you handle class imbalance in this dataset?
**Answer**:
- **Check distribution**: Plot class frequencies
- **Solutions**:
  - Weighted loss function: class_weight parameter
  - Oversampling: SMOTE, random oversampling
  - Undersampling: Random undersampling
  - Focal loss: Focus on hard examples
  - Stratified sampling: Ensure balanced batches