# Real Analysis Fundamentals for Machine Learning
## Convergence Theory and Function Approximation

Welcome to the **theoretical foundation** that guarantees machine learning actually works! Real analysis provides the mathematical rigor behind why neural networks converge and how well they can approximate functions.

### What You'll Master
By the end of this notebook, you'll understand:
1. **Convergence theory** - When and why algorithms converge to solutions
2. **Function approximation** - How neural networks approximate any function
3. **Continuity and limits** - The foundation of optimization theory
4. **Uniform convergence** - When approximations work globally
5. **Metric spaces** - The mathematical framework for learning
6. **Compactness** - Why some problems are easier than others

### Why This Matters
- **Convergence guarantees** tell us if our algorithms will work
- **Approximation theory** explains why neural networks are so powerful
- **Continuity** ensures small changes in input don't break everything
- **Metric spaces** provide the foundation for similarity and distance

### Real-World Impact
- **Universal approximation theorem**: Neural networks can approximate any function
- **Gradient descent convergence**: Conditions under which training succeeds
- **Generalization bounds**: How well models perform on new data

Let's dive into the mathematical foundations! 📐

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import optimize
from scipy.interpolate import interp1d, BSpline, splrep, splev
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
np.random.seed(42)

# Define some mathematical constants
GOLDEN_RATIO = (1 + np.sqrt(5)) / 2
EULER_CONSTANT = np.e

print("📐 Real Analysis toolkit loaded!")
print("Ready to explore the foundations of mathematics!")

## 1. Sequences and Convergence: The Foundation

### What is Convergence?
A sequence `{aₙ}` **converges** to limit `L` if:
```
For every ε > 0, there exists N such that
for all n > N: |aₙ - L| < ε
```

**Translation**: No matter how close you want to get to L (ε), you can always find a point in the sequence after which all terms stay within that distance.

### Types of Convergence

#### 1. Pointwise Convergence
Functions `fₙ(x)` converge to `f(x)` at each point x individually.

#### 2. Uniform Convergence
Functions `fₙ(x)` converge to `f(x)` **simultaneously** for all x in the domain.
```
For every ε > 0, there exists N such that
for all n > N and all x: |fₙ(x) - f(x)| < ε
```

#### 3. L² Convergence (Mean Square)
```
∫ |fₙ(x) - f(x)|² dx → 0 as n → ∞
```

### Why This Matters in ML
- **Gradient descent**: Does the sequence of parameters converge?
- **Neural network training**: Do the weights converge to optimal values?
- **Function approximation**: How fast do approximations improve?
- **Generalization**: Does performance on training set predict test performance?

In [None]:
def demonstrate_convergence_types():
    """Visualize different types of convergence"""
    
    print("🎯 Types of Convergence Demonstration")
    print("=" * 45)
    
    # Domain
    x = np.linspace(0, 1, 1000)
    
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    
    # 1. Pointwise convergence (but not uniform)
    print("\n1. Pointwise Convergence (Non-Uniform)")
    print("   fₙ(x) = x^n on [0,1]")
    print("   Limit: f(x) = 0 for x ∈ [0,1), f(1) = 1")
    
    ax1 = axes[0, 0]
    for n in [1, 2, 5, 10, 20, 50]:
        fn = x**n
        ax1.plot(x, fn, alpha=0.7, label=f'n={n}')
    
    # Limit function
    limit_pointwise = np.zeros_like(x)
    limit_pointwise[-1] = 1  # Jump at x=1
    ax1.plot(x, limit_pointwise, 'k--', linewidth=3, label='Limit')
    ax1.set_title('Pointwise Convergence\n$f_n(x) = x^n$')
    ax1.set_xlabel('x')
    ax1.set_ylabel('$f_n(x)$')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # 2. Uniform convergence
    print("\n2. Uniform Convergence")
    print("   fₙ(x) = x/n on [0,1]")
    print("   Limit: f(x) = 0 uniformly")
    
    ax2 = axes[0, 1]
    for n in [1, 2, 5, 10, 20, 50]:
        fn = x / n
        ax2.plot(x, fn, alpha=0.7, label=f'n={n}')
    
    ax2.axhline(y=0, color='k', linestyle='--', linewidth=3, label='Limit')
    ax2.set_title('Uniform Convergence\n$f_n(x) = x/n$')
    ax2.set_xlabel('x')
    ax2.set_ylabel('$f_n(x)$')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    # 3. Convergence in L² norm
    print("\n3. L² Convergence")
    print("   fₙ(x) = sin(nx)/√n")
    print("   ||fₙ||₂ → 0 but pointwise limit doesn't exist")
    
    ax3 = axes[0, 2]
    l2_norms = []
    for n in [1, 2, 5, 10, 20, 50]:
        fn = np.sin(n * x) / np.sqrt(n)
        l2_norm = np.sqrt(np.trapz(fn**2, x))
        l2_norms.append(l2_norm)
        if n <= 10:
            ax3.plot(x, fn, alpha=0.7, label=f'n={n}')
    
    ax3.set_title('L² Convergence\n$f_n(x) = \\sin(nx)/\\sqrt{n}$')
    ax3.set_xlabel('x')
    ax3.set_ylabel('$f_n(x)$')
    ax3.legend()
    ax3.grid(True, alpha=0.3)
    
    # 4. Gradient descent convergence
    print("\n4. Gradient Descent Convergence")
    print("   Minimizing f(x) = x² starting from x₀ = 5")
    
    def gradient_descent_demo(x0, learning_rate, num_steps):
        x = x0
        path = [x]
        for _ in range(num_steps):
            gradient = 2 * x  # df/dx for f(x) = x²
            x = x - learning_rate * gradient
            path.append(x)
        return np.array(path)
    
    ax4 = axes[1, 0]
    learning_rates = [0.1, 0.3, 0.5, 0.9, 1.1]
    for lr in learning_rates:
        path = gradient_descent_demo(5.0, lr, 20)
        iterations = range(len(path))
        convergent = lr < 1.0
        style = '-' if convergent else '--'
        ax4.plot(iterations, path, style, alpha=0.8, label=f'lr={lr}')
    
    ax4.axhline(y=0, color='k', linestyle=':', label='Optimum')
    ax4.set_title('Gradient Descent Convergence')
    ax4.set_xlabel('Iteration')
    ax4.set_ylabel('x value')
    ax4.legend()
    ax4.grid(True, alpha=0.3)
    ax4.set_yscale('symlog')
    
    # 5. Neural network approximation convergence
    print("\n5. Neural Network Function Approximation")
    print("   Approximating f(x) = sin(2πx) + 0.3*sin(6πx)")
    
    def target_function(x):
        return np.sin(2 * np.pi * x) + 0.3 * np.sin(6 * np.pi * x)
    
    x_train = np.linspace(0, 1, 100).reshape(-1, 1)
    y_train = target_function(x_train.flatten())
    x_test = np.linspace(0, 1, 1000).reshape(-1, 1)
    y_true = target_function(x_test.flatten())
    
    ax5 = axes[1, 1]
    ax5.plot(x_test, y_true, 'k-', linewidth=3, label='True function')
    
    hidden_sizes = [5, 10, 20, 50]
    for size in hidden_sizes:
        mlp = MLPRegressor(hidden_layer_sizes=(size,), max_iter=1000, random_state=42)
        mlp.fit(x_train, y_train)
        y_pred = mlp.predict(x_test)
        mse = mean_squared_error(y_true, y_pred)
        ax5.plot(x_test, y_pred, alpha=0.7, label=f'{size} neurons (MSE={mse:.3f})')
    
    ax5.set_title('Neural Network Approximation')
    ax5.set_xlabel('x')
    ax5.set_ylabel('f(x)')
    ax5.legend()
    ax5.grid(True, alpha=0.3)
    
    # 6. Convergence rates
    print("\n6. Convergence Rate Comparison")
    
    n_values = np.arange(1, 51)
    
    # Different convergence rates
    linear = 1 / n_values
    quadratic = 1 / n_values**2
    exponential = np.exp(-n_values)
    geometric = 0.9**n_values
    
    ax6 = axes[1, 2]
    ax6.semilogy(n_values, linear, 'o-', label='Linear: 1/n', alpha=0.8)
    ax6.semilogy(n_values, quadratic, 's-', label='Quadratic: 1/n²', alpha=0.8)
    ax6.semilogy(n_values, exponential, '^-', label='Exponential: $e^{-n}$', alpha=0.8)
    ax6.semilogy(n_values, geometric, 'd-', label='Geometric: $0.9^n$', alpha=0.8)
    
    ax6.set_title('Convergence Rates')
    ax6.set_xlabel('n (iteration/degree)')
    ax6.set_ylabel('Error (log scale)')
    ax6.legend()
    ax6.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("\n📊 Key Insights:")
    print("• Pointwise ≠ Uniform: Functions can converge at each point but not uniformly")
    print("• Learning rate matters: Too large → divergence, too small → slow convergence")
    print("• More neurons → better approximation (with enough data)")
    print("• Exponential convergence is much faster than polynomial")

demonstrate_convergence_types()

## 2. Continuity and Differentiability: Smoothness Matters

### Continuity
A function f is **continuous** at point a if:
```
lim(x→a) f(x) = f(a)
```

**Three conditions must hold**:
1. `f(a)` exists (function is defined at a)
2. `lim(x→a) f(x)` exists (limit exists)
3. The limit equals the function value

### Types of Continuity

#### Uniform Continuity
For every ε > 0, there exists δ > 0 such that for **all** x, y:
```
|x - y| < δ  ⟹  |f(x) - f(y)| < ε
```

#### Lipschitz Continuity
There exists L > 0 such that for all x, y:
```
|f(x) - f(y)| ≤ L|x - y|
```

### Why Continuity Matters in ML

1. **Optimization**: Continuous functions are easier to optimize
2. **Generalization**: Small changes in input should give small changes in output
3. **Robustness**: Continuous models are less sensitive to noise
4. **Convergence**: Many convergence theorems require continuity

### Differentiability Hierarchy
```
Lipschitz ⟹ Uniformly Continuous ⟹ Continuous
                     ⇅
            Differentiable ⟹ Continuous
```

In [None]:
def explore_continuity_and_differentiability():
    """Explore different types of continuity and their ML implications"""
    
    print("🔄 Continuity and Differentiability in ML")
    print("=" * 45)
    
    x = np.linspace(-2, 2, 1000)
    
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    
    # 1. Discontinuous function
    print("\n1. Discontinuous Function (Step Function)")
    print("   Problem: Gradient doesn't exist at discontinuities")
    
    step_func = np.where(x < 0, -1, 1)
    axes[0, 0].plot(x, step_func, 'b-', linewidth=2)
    axes[0, 0].scatter([0], [1], color='red', s=100, zorder=5)
    axes[0, 0].scatter([0], [-1], color='red', s=100, zorder=5, facecolors='none', edgecolors='red')
    axes[0, 0].set_title('Discontinuous Function\n(Step Function)')
    axes[0, 0].set_xlabel('x')
    axes[0, 0].set_ylabel('f(x)')
    axes[0, 0].grid(True, alpha=0.3)
    axes[0, 0].text(0.5, 0, 'No gradient here!', fontsize=10, color='red')
    
    # 2. Continuous but not differentiable
    print("\n2. Continuous but Not Differentiable (ReLU)")
    print("   Used in: Neural networks (despite non-differentiability at 0)")
    
    relu = np.maximum(0, x)
    axes[0, 1].plot(x, relu, 'g-', linewidth=2)
    axes[0, 1].scatter([0], [0], color='red', s=100, zorder=5)
    axes[0, 1].set_title('ReLU Activation\n$f(x) = \\max(0, x)$')
    axes[0, 1].set_xlabel('x')
    axes[0, 1].set_ylabel('f(x)')
    axes[0, 1].grid(True, alpha=0.3)
    axes[0, 1].text(0.2, 0.2, 'Gradient undefined\nat x=0', fontsize=10, color='red')
    
    # 3. Smooth and differentiable
    print("\n3. Smooth and Differentiable (Sigmoid)")
    print("   Used in: Classical neural networks, logistic regression")
    
    sigmoid = 1 / (1 + np.exp(-x))
    axes[0, 2].plot(x, sigmoid, 'purple', linewidth=2)
    axes[0, 2].set_title('Sigmoid Activation\n$f(x) = \\frac{1}{1+e^{-x}}$')
    axes[0, 2].set_xlabel('x')
    axes[0, 2].set_ylabel('f(x)')
    axes[0, 2].grid(True, alpha=0.3)
    axes[0, 2].text(0, 0.2, 'Smooth everywhere', fontsize=10, color='green')
    
    # 4. Lipschitz continuous function
    print("\n4. Lipschitz Continuous Function")
    print("   Property: |f(x) - f(y)| ≤ L|x - y| for some L")
    
    # Example: f(x) = tanh(x) is 1-Lipschitz
    tanh_func = np.tanh(x)
    axes[1, 0].plot(x, tanh_func, 'orange', linewidth=2, label='tanh(x)')
    
    # Show Lipschitz property
    x1, x2 = 0.5, 1.0
    y1, y2 = np.tanh(x1), np.tanh(x2)
    axes[1, 0].plot([x1, x2], [y1, y2], 'ro-', markersize=8)
    axes[1, 0].plot([x1, x2], [y1, y1 + 1*(x2-x1)], 'r--', alpha=0.7, label='L=1 bound')
    axes[1, 0].set_title('Lipschitz Continuous\n$f(x) = \\tanh(x)$')
    axes[1, 0].set_xlabel('x')
    axes[1, 0].set_ylabel('f(x)')
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)
    
    # 5. Impact on optimization
    print("\n5. Impact on Gradient Descent Convergence")
    print("   Smooth functions → guaranteed convergence")
    print("   Non-smooth functions → may not converge")
    
    # Optimize smooth vs non-smooth functions
    def smooth_objective(x):
        return x**2 + 0.1 * x**4
    
    def nonsmooth_objective(x):
        return abs(x) + 0.1 * x**2
    
    x_opt = np.linspace(-2, 2, 1000)
    y_smooth = [smooth_objective(xi) for xi in x_opt]
    y_nonsmooth = [nonsmooth_objective(xi) for xi in x_opt]
    
    axes[1, 1].plot(x_opt, y_smooth, 'b-', linewidth=2, label='Smooth: $x^2 + 0.1x^4$')
    axes[1, 1].plot(x_opt, y_nonsmooth, 'r-', linewidth=2, label='Non-smooth: $|x| + 0.1x^2$')
    axes[1, 1].set_title('Optimization Landscapes')
    axes[1, 1].set_xlabel('x')
    axes[1, 1].set_ylabel('f(x)')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)
    
    # 6. Regularization and smoothness
    print("\n6. Regularization Promotes Smoothness")
    print("   L2 regularization makes functions smoother")
    
    # Generate noisy data
    np.random.seed(42)
    x_data = np.linspace(0, 1, 20)
    y_data = np.sin(2 * np.pi * x_data) + 0.2 * np.random.randn(20)
    
    x_smooth = np.linspace(0, 1, 200)
    
    # Polynomial fit without regularization (overfitting)
    coeffs_overfit = np.polyfit(x_data, y_data, 10)
    y_overfit = np.polyval(coeffs_overfit, x_smooth)
    
    # Polynomial fit with L2 regularization (Ridge regression)
    from sklearn.linear_model import Ridge
    from sklearn.preprocessing import PolynomialFeatures
    
    poly_features = PolynomialFeatures(degree=10)
    X_poly = poly_features.fit_transform(x_data.reshape(-1, 1))
    X_smooth_poly = poly_features.transform(x_smooth.reshape(-1, 1))
    
    ridge = Ridge(alpha=0.1)
    ridge.fit(X_poly, y_data)
    y_regularized = ridge.predict(X_smooth_poly)
    
    axes[1, 2].scatter(x_data, y_data, color='black', s=50, zorder=5, label='Data')
    axes[1, 2].plot(x_smooth, y_overfit, 'r-', linewidth=2, alpha=0.7, label='Overfit (non-smooth)')
    axes[1, 2].plot(x_smooth, y_regularized, 'b-', linewidth=2, label='Regularized (smooth)')
    axes[1, 2].set_title('Regularization & Smoothness')
    axes[1, 2].set_xlabel('x')
    axes[1, 2].set_ylabel('y')
    axes[1, 2].legend()
    axes[1, 2].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("\n🎯 Key Takeaways:")
    print("• Discontinuities break gradient-based optimization")
    print("• ReLU works despite non-differentiability (subgradients)")
    print("• Smooth functions are easier to optimize")
    print("• Lipschitz continuity guarantees bounded sensitivity")
    print("• Regularization promotes smoothness and generalization")

explore_continuity_and_differentiability()