# RMSProp Optimizer from Scratch

This notebook implements the RMSProp (Root Mean Square Propagation) optimizer from scratch using only NumPy, including support for learning rate, decay factor (beta), numerical stability (eps), and optional weight decay (L2 regularization).


## Imports


In [None]:
import numpy as np

print("Libraries imported successfully!")


## RMSProp Optimizer Implementation


In [None]:
class RMSProp:
    """
    RMSProp (Root Mean Square Propagation) optimizer implemented from scratch using NumPy.
    
    RMSProp adapts the learning rate for each parameter by using a moving average of squared gradients.
    This makes the algorithm effective in non-stationary settings and helps converge faster by
    reducing the step size in directions with high gradient magnitudes.
    
    Parameters:
    -----------
    lr : float, optional
        Learning rate (step size) for weight updates (default: 0.001)
    beta : float, optional
        Exponential decay rate for the moving average of squared gradients (default: 0.9)
    eps : float, optional
        Small constant added to the denominator for numerical stability (default: 1e-8)
    weight_decay : float, optional
        L2 regularization coefficient (default: 0.0)
    
    Algorithm:
    ----------
    1. Update moving average of squared gradients: v_t = β * v_{t-1} + (1 - β) * g_t^2
    2. Update parameters: w_t = w_{t-1} - lr * g_t / (√v_t + ε)
    
    Examples:
    --------
    >>> optimizer = RMSProp(lr=0.01, beta=0.9, eps=1e-8, weight_decay=0.01)
    >>> optimizer.step(weights, gradients)
    """
    
    def __init__(self, lr=0.001, beta=0.9, eps=1e-8, weight_decay=0.0):
        self.lr = lr
        self.beta = beta
        self.eps = eps
        self.weight_decay = weight_decay
        
        # Dictionary to store moving average of squared gradients for each parameter set
        self.v = {}
    
    def step(self, params, grads):
        """
        Perform a single RMSProp optimization step.
        
        Parameters:
        -----------
        params : numpy.ndarray
            Model parameters (weights) to be updated (modified in-place)
        grads : numpy.ndarray
            Gradients with respect to the parameters
        
        Algorithm:
        ----------
        1. Apply weight decay if enabled: g_t = g_t + weight_decay * w_t
        2. Update moving average of squared gradients: v_t = β * v_{t-1} + (1 - β) * g_t^2
        3. Update parameters: w_t = w_{t-1} - lr * g_t / (√v_t + ε)
        """
        # Get unique identifier for this parameter set
        param_id = id(params)
        
        # Initialize moving average if not exists
        if param_id not in self.v:
            self.v[param_id] = np.zeros_like(params)
        
        # Apply L2 regularization (weight decay) to gradients
        if self.weight_decay > 0:
            grads = grads + self.weight_decay * params
        
        # Update moving average of squared gradients
        # v_t = β * v_{t-1} + (1 - β) * g_t^2
        self.v[param_id] = self.beta * self.v[param_id] + (1 - self.beta) * (grads ** 2)
        
        # Update parameters
        # w_t = w_{t-1} - lr * g_t / (√v_t + ε)
        params[:] = params - self.lr * grads / (np.sqrt(self.v[param_id]) + self.eps)


## Numerical Example: Parameter Updates Over 5 Steps


In [None]:
# Simple numerical example
import numpy as np

np.random.seed(42)
W = np.random.randn(2, 2)
grads = np.random.randn(2, 2)

print("Initial weights W:")
print(W)
print("\nInitial gradients:")
print(grads)
print("\n" + "="*60 + "\n")

# Create RMSProp optimizer
rmsprop = RMSProp(lr=0.01, beta=0.9, eps=1e-8)

print("RMSProp optimizer parameters:")
print(f"  Learning rate: {rmsprop.lr}")
print(f"  Beta (decay rate): {rmsprop.beta}")
print(f"  Epsilon: {rmsprop.eps}")
print("\n" + "="*60 + "\n")

# Perform 5 optimization steps
for i in range(5):
    rmsprop.step(W, grads)
    print(f"Step {i+1}, Updated Weights:\n{W}\n")


## Detailed Example: Showing Intermediate Values (v_t and Step Sizes)


In [None]:
# Detailed example showing intermediate values (v_t and adaptive step sizes)
np.random.seed(42)
W = np.random.randn(2, 2)
grads = np.random.randn(2, 2)

print("Initial weights W:")
print(W)
print("\nInitial gradients:")
print(grads)
print("\n" + "="*70 + "\n")

rmsprop = RMSProp(lr=0.01, beta=0.9, eps=1e-8)

param_id = id(W)

for i in range(5):
    print(f"Step {i+1}:")
    print(f"  Learning rate: {rmsprop.lr}")
    print(f"  Beta: {rmsprop.beta}, Eps: {rmsprop.eps}")
    
    # Store previous values for display
    v_prev = rmsprop.v.get(param_id, np.zeros_like(W)).copy()
    W_prev = W.copy()
    
    print(f"\n  Before step():")
    print(f"    v_{i}: {v_prev}")
    print(f"    W_{i}: {W_prev}")
    print(f"    Current gradients: {grads}")
    print(f"    Gradient magnitudes: {np.abs(grads)}")
    
    # Perform step
    rmsprop.step(W, grads)
    
    # Display intermediate values
    print(f"\n  After step() - Updated moving average:")
    print(f"    v_{i+1} = beta * v_{i} + (1 - beta) * g_{i+1}^2")
    print(f"    v_{i+1} = {rmsprop.beta} * {v_prev} + {1 - rmsprop.beta} * {grads**2}")
    print(f"    v_{i+1}: {rmsprop.v[param_id]}")
    
    # Calculate adaptive step sizes
    sqrt_v = np.sqrt(rmsprop.v[param_id])
    denominator = sqrt_v + rmsprop.eps
    adaptive_lr = rmsprop.lr / denominator
    step_size = adaptive_lr * grads
    
    print(f"\n  Adaptive step size calculation:")
    print(f"    sqrt(v_{i+1}): {sqrt_v}")
    print(f"    sqrt(v_{i+1}) + eps: {denominator}")
    print(f"    Adaptive learning rate per parameter: {rmsprop.lr} / {denominator}")
    print(f"    Adaptive learning rate: {adaptive_lr}")
    print(f"    Step size: lr * g / (sqrt(v) + eps) = {step_size}")
    
    print(f"\n  Parameter update:")
    print(f"    W_{i+1} = W_{i} - lr * g_{i+1} / (sqrt(v_{i+1}) + eps)")
    print(f"    Updated W_{i+1}: {W}")
    print(f"    Change in W: {W - W_prev}")
    print("-" * 70)


## Example: Demonstrating Adaptive Learning Rates (Smaller Steps for High Gradients)


In [None]:
# Demonstrate how RMSProp adapts step sizes based on gradient magnitudes
np.random.seed(42)

# Create a scenario with different gradient magnitudes
W = np.array([[1.0, 2.0],
              [3.0, 4.0]])

# Gradients: large in first column, small in second column
grads = np.array([[10.0, 0.1],
                  [8.0,  0.2]])

print("Demonstrating adaptive step sizes:")
print(f"Initial weights W:\n{W}")
print(f"\nGradients:\n{grads}")
print(f"Gradient magnitudes:\n{np.abs(grads)}")
print(f"\nNote: Large gradients in first column, small in second column")
print("="*70 + "\n")

rmsprop = RMSProp(lr=0.1, beta=0.9, eps=1e-8)

for i in range(5):
    v_before = rmsprop.v.get(id(W), np.zeros_like(W)).copy()
    
    # Store weights before update
    W_before = W.copy()
    
    # Perform step
    rmsprop.step(W, grads)
    
    # Calculate adaptive learning rates
    sqrt_v = np.sqrt(rmsprop.v[id(W)])
    adaptive_lr = rmsprop.lr / (sqrt_v + rmsprop.eps)
    step_size = adaptive_lr * grads
    
    print(f"Step {i+1}:")
    print(f"  v (moving avg of squared gradients):\n{rmsprop.v[id(W)]}")
    print(f"  sqrt(v):\n{sqrt_v}")
    print(f"  Adaptive learning rate per parameter:\n{adaptive_lr}")
    print(f"  Step size per parameter:\n{step_size}")
    print(f"  Updated W:\n{W}")
    print(f"  Change in W:\n{W - W_before}")
    print(f"\n  Key observation: Steps are smaller for parameters with larger gradients!")
    print(f"  Column 1 (high gradients): avg step size = {np.mean(np.abs(step_size[:, 0])):.6f}")
    print(f"  Column 2 (low gradients): avg step size = {np.mean(np.abs(step_size[:, 1])):.6f}")
    print("-" * 70)


## Example: RMSProp with Weight Decay


In [None]:
# Example demonstrating RMSProp with weight decay
np.random.seed(42)

W = np.random.randn(2, 2)
grads = np.random.randn(2, 2)

print("RMSProp without weight decay:")
rmsprop_no_decay = RMSProp(lr=0.01, beta=0.9, weight_decay=0.0)
W_no_decay = W.copy()

for i in range(3):
    rmsprop_no_decay.step(W_no_decay, grads)
    print(f"Step {i+1}, W:\n{W_no_decay}\n")

print("="*70 + "\n")
print("RMSProp with weight decay (weight_decay=0.01):")
rmsprop_with_decay = RMSProp(lr=0.01, beta=0.9, weight_decay=0.01)
W_with_decay = W.copy()

for i in range(3):
    rmsprop_with_decay.step(W_with_decay, grads)
    print(f"Step {i+1}, W:\n{W_with_decay}\n")


## Comparison: Different Beta Values


In [None]:
# Compare RMSProp with different beta values
np.random.seed(42)

W_1 = np.random.randn(2, 2)
W_2 = np.random.randn(2, 2)
grads = np.random.randn(2, 2)

print("RMSProp with beta=0.9 (default, slower decay):")
rmsprop1 = RMSProp(lr=0.01, beta=0.9)
for i in range(3):
    rmsprop1.step(W_1, grads)
    print(f"Step {i+1}: {W_1}")

print("\n" + "="*70 + "\n")
print("RMSProp with beta=0.5 (faster decay, more responsive to recent gradients):")
rmsprop2 = RMSProp(lr=0.01, beta=0.5)
for i in range(3):
    rmsprop2.step(W_2, grads)
    print(f"Step {i+1}: {W_2}")


## Step-by-Step Explanation

### RMSProp Algorithm Overview

RMSProp (Root Mean Square Propagation) is an adaptive learning rate optimization algorithm that maintains a moving average of squared gradients. This allows it to adaptively adjust the learning rate for each parameter.

### Key Components:

1. **Moving Average of Squared Gradients**:
   - $v_t = \beta v_{t-1} + (1 - \beta) g_t^2$
   - Tracks the magnitude of gradients over time
   - $\beta$ controls how quickly old information decays (typically 0.9)

2. **Parameter Update**:
   - $\theta_t = \theta_{t-1} - \alpha \frac{g_t}{\sqrt{v_t} + \epsilon}$
   - Where $\alpha$ is the learning rate and $\epsilon$ is a small constant for numerical stability
   - Divides gradients by the square root of the moving average, effectively normalizing step sizes

3. **Weight Decay (L2 Regularization)**: If enabled, adds penalty to gradients:
   - $g_t = g_t + \text{weight\_decay} \times \theta_t$

### Advantages:

- **Adaptive Learning Rates**: Automatically adjusts step size per parameter
- **Handles Non-Stationary Objectives**: Works well when gradients vary significantly over time
- **Smoother Convergence**: Reduces oscillations by damping high-gradient directions
- **Smaller Steps for High Gradients**: Parameters with consistently large gradients get smaller step sizes, promoting stability

### Key Insight:

The algorithm divides the gradient by the square root of the moving average of squared gradients. This means:
- If a parameter has consistently large gradients, $v_t$ will be large, making the step size smaller
- If a parameter has consistently small gradients, $v_t$ will be small, making the step size relatively larger (but still bounded by the learning rate)
