# Optimization Theory for Machine Learning
## From Gradient Descent to Advanced Convex Optimization

Welcome to the mathematical foundation of **how machines learn**! Every time an AI system improves its performance, optimization algorithms are working behind the scenes.

### What You'll Master
By the end of this notebook, you'll understand:
1. **Optimization fundamentals** - What it means to find the "best" solution
2. **Gradient descent variants** - The workhorses of machine learning
3. **Convex optimization** - When we can guarantee finding global optima
4. **Constrained optimization** - Real-world problems with restrictions
5. **Advanced algorithms** - Modern techniques for deep learning

### Why This Matters
- **Every ML algorithm** uses optimization to learn from data
- **Neural networks** train using sophisticated optimization methods
- **Understanding optimization** helps you debug and improve models
- **Convexity** tells you when you can trust your solution

Let's optimize our understanding! üéØ

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
from scipy.optimize import minimize, minimize_scalar
from scipy.optimize import differential_evolution
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Set style for beautiful plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
np.random.seed(42)

print("üéØ Optimization toolkit loaded successfully!")
print("Ready to find the best solutions!")

## 1. Optimization Fundamentals: The Quest for the Best

### What is Optimization?
**Optimization** is the mathematical process of finding the **best solution** from a set of possible solutions.

**Mathematical Formulation**:
```
minimize   f(x)
subject to x ‚àà D
```
Where:
- `f(x)` is the **objective function** (what we want to minimize)
- `x` is the **decision variable** (what we can control)
- `D` is the **feasible region** (allowed values of x)

### Types of Optimization Problems

1. **Unconstrained**: No restrictions on x
2. **Constrained**: x must satisfy certain conditions
3. **Convex**: Guarantee global optimum (the holy grail!)
4. **Non-convex**: May have multiple local optima

### Real-World Analogy: Finding the Lowest Point
Imagine you're blindfolded on a hilly landscape and need to find the lowest valley:
- **Local minimum**: Lowest point in your immediate area
- **Global minimum**: The absolute lowest point in the entire landscape
- **Gradient**: The slope that tells you which direction is downhill

In [None]:
# Let's visualize different types of optimization landscapes

def create_optimization_landscapes():
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    
    # Create coordinate grids
    x = np.linspace(-3, 3, 100)
    y = np.linspace(-3, 3, 100)
    X, Y = np.meshgrid(x, y)
    
    # 1. Convex function (bowl shape) - EASY to optimize
    Z1 = X**2 + Y**2
    ax1 = axes[0, 0]
    contour1 = ax1.contour(X, Y, Z1, levels=15, alpha=0.6)
    ax1.clabel(contour1, inline=True, fontsize=8)
    ax1.scatter([0], [0], color='red', s=100, zorder=5, marker='*')
    ax1.set_title('Convex Function: f(x,y) = x¬≤ + y¬≤\n‚úÖ One Global Minimum')
    ax1.set_xlabel('x')
    ax1.set_ylabel('y')
    ax1.grid(True, alpha=0.3)
    
    # 2. Non-convex with multiple local minima - HARD to optimize
    Z2 = (X**2 + Y**2) * np.exp(-(X**2 + Y**2)) + 0.1 * np.sin(5*X) * np.sin(5*Y)
    ax2 = axes[0, 1]
    contour2 = ax2.contour(X, Y, Z2, levels=20, alpha=0.6)
    ax2.clabel(contour2, inline=True, fontsize=8)
    ax2.set_title('Non-Convex Function\n‚ö†Ô∏è Multiple Local Minima')
    ax2.set_xlabel('x')
    ax2.set_ylabel('y')
    ax2.grid(True, alpha=0.3)
    
    # 3. Saddle point - TRICKY to optimize
    Z3 = X**2 - Y**2
    ax3 = axes[1, 0]
    contour3 = ax3.contour(X, Y, Z3, levels=15, alpha=0.6)
    ax3.clabel(contour3, inline=True, fontsize=8)
    ax3.scatter([0], [0], color='orange', s=100, zorder=5, marker='s')
    ax3.set_title('Saddle Point: f(x,y) = x¬≤ - y¬≤\nüèîÔ∏è Not a Minimum or Maximum')
    ax3.set_xlabel('x')
    ax3.set_ylabel('y')
    ax3.grid(True, alpha=0.3)
    
    # 4. Rosenbrock function - CLASSIC optimization benchmark
    a, b = 1, 100
    Z4 = (a - X)**2 + b * (Y - X**2)**2
    Z4_log = np.log(Z4 + 1)  # Log scale for better visualization
    ax4 = axes[1, 1]
    contour4 = ax4.contour(X, Y, Z4_log, levels=20, alpha=0.6)
    ax4.clabel(contour4, inline=True, fontsize=8)
    ax4.scatter([1], [1], color='red', s=100, zorder=5, marker='*')
    ax4.set_title('Rosenbrock Function (log scale)\nüåπ Classic Optimization Challenge')
    ax4.set_xlabel('x')
    ax4.set_ylabel('y')
    ax4.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Print insights
    print("üéØ Optimization Landscape Types:")
    print("1. Convex: Gradient descent ALWAYS finds global minimum")
    print("2. Non-convex: May get stuck in local minima")
    print("3. Saddle points: Gradients can get confused")
    print("4. Rosenbrock: Tests algorithm robustness")

create_optimization_landscapes()

## 2. Gradient Descent: The Workhorse of Machine Learning

### The Big Idea
**Gradient descent** is like rolling a ball down a hill - it naturally finds the bottom by following the steepest path downward.

### The Algorithm
```
1. Start at random point x‚ÇÄ
2. Compute gradient ‚àáf(x)
3. Take step: x_{new} = x_{old} - Œ±‚àáf(x)
4. Repeat until convergence
```

### Key Components
- **Learning Rate (Œ±)**: How big steps to take
  - Too small ‚Üí slow convergence
  - Too large ‚Üí overshoot minimum
  - Just right ‚Üí efficient convergence

### Variants of Gradient Descent

1. **Batch Gradient Descent**: Uses entire dataset
2. **Stochastic Gradient Descent (SGD)**: Uses one sample at a time
3. **Mini-batch Gradient Descent**: Uses small batches
4. **Momentum**: Adds inertia to overcome local minima
5. **Adam**: Adaptive learning rates (very popular!)

In [None]:
# Implementation of various gradient descent algorithms

class OptimizationAlgorithms:
    """Collection of optimization algorithms for comparison"""
    
    @staticmethod
    def function_2d(x, y):
        """Test function: f(x,y) = x¬≤ + y¬≤ (simple convex)"""
        return x**2 + y**2
    
    @staticmethod
    def gradient_2d(x, y):
        """Gradient of test function"""
        return np.array([2*x, 2*y])
    
    @staticmethod
    def rosenbrock(x, y, a=1, b=100):
        """Rosenbrock function - classic optimization benchmark"""
        return (a - x)**2 + b * (y - x**2)**2
    
    @staticmethod
    def rosenbrock_gradient(x, y, a=1, b=100):
        """Gradient of Rosenbrock function"""
        dx = -2*(a - x) - 4*b*x*(y - x**2)
        dy = 2*b*(y - x**2)
        return np.array([dx, dy])
    
    def gradient_descent(self, start_x, start_y, learning_rate=0.01, max_iter=1000, func_type='simple'):
        """Basic gradient descent implementation"""
        path = [(start_x, start_y)]
        x, y = start_x, start_y
        
        for i in range(max_iter):
            if func_type == 'simple':
                grad = self.gradient_2d(x, y)
            else:  # rosenbrock
                grad = self.rosenbrock_gradient(x, y)
            
            # Update parameters
            x -= learning_rate * grad[0]
            y -= learning_rate * grad[1]
            path.append((x, y))
            
            # Check convergence
            if np.linalg.norm(grad) < 1e-6:
                break
                
        return np.array(path)
    
    def momentum_gradient_descent(self, start_x, start_y, learning_rate=0.01, momentum=0.9, max_iter=1000, func_type='simple'):
        """Gradient descent with momentum"""
        path = [(start_x, start_y)]
        x, y = start_x, start_y
        vx, vy = 0, 0  # Velocity terms
        
        for i in range(max_iter):
            if func_type == 'simple':
                grad = self.gradient_2d(x, y)
            else:  # rosenbrock
                grad = self.rosenbrock_gradient(x, y)
            
            # Update velocity (momentum)
            vx = momentum * vx - learning_rate * grad[0]
            vy = momentum * vy - learning_rate * grad[1]
            
            # Update parameters
            x += vx
            y += vy
            path.append((x, y))
            
            # Check convergence
            if np.linalg.norm(grad) < 1e-6:
                break
                
        return np.array(path)
    
    def adam_optimizer(self, start_x, start_y, learning_rate=0.01, beta1=0.9, beta2=0.999, epsilon=1e-8, max_iter=1000, func_type='simple'):
        """Adam optimizer implementation"""
        path = [(start_x, start_y)]
        x, y = start_x, start_y
        m1, m2 = 0, 0  # First moment
        v1, v2 = 0, 0  # Second moment
        
        for t in range(1, max_iter + 1):
            if func_type == 'simple':
                grad = self.gradient_2d(x, y)
            else:  # rosenbrock
                grad = self.rosenbrock_gradient(x, y)
            
            # Update biased first moment estimate
            m1 = beta1 * m1 + (1 - beta1) * grad[0]
            m2 = beta1 * m2 + (1 - beta1) * grad[1]
            
            # Update biased second raw moment estimate
            v1 = beta2 * v1 + (1 - beta2) * grad[0]**2
            v2 = beta2 * v2 + (1 - beta2) * grad[1]**2
            
            # Compute bias-corrected first moment estimate
            m1_hat = m1 / (1 - beta1**t)
            m2_hat = m2 / (1 - beta1**t)
            
            # Compute bias-corrected second raw moment estimate
            v1_hat = v1 / (1 - beta2**t)
            v2_hat = v2 / (1 - beta2**t)
            
            # Update parameters
            x -= learning_rate * m1_hat / (np.sqrt(v1_hat) + epsilon)
            y -= learning_rate * m2_hat / (np.sqrt(v2_hat) + epsilon)
            path.append((x, y))
            
            # Check convergence
            if np.linalg.norm(grad) < 1e-6:
                break
                
        return np.array(path)

# Test the algorithms
optimizer = OptimizationAlgorithms()
print("üöÄ Optimization algorithms ready for testing!")