# Applied Measure Theory for Machine Learning
## Rigorous Probability Foundations and Advanced Integration

Welcome to the **mathematical foundation of modern probability**! Measure theory provides the rigorous framework that makes probability theory logically consistent and enables advanced techniques in machine learning and statistics.

### What You'll Master
By the end of this notebook, you'll understand:
1. **Measure spaces** - The foundation of rigorous probability
2. **Lebesgue integration** - Beyond Riemann integration
3. **Probability measures** - Formal definition of probability
4. **Random variables** - Measurable functions on probability spaces
5. **Convergence theorems** - When limits and integrals commute
6. **Radon-Nikodym theorem** - Density functions rigorously defined

### Why This Matters
- **Rigorous foundations** - Make probability theory mathematically sound
- **Advanced ML theory** - Understand PAC learning, uniform convergence
- **Functional analysis** - Foundation for kernel methods, reproducing Hilbert spaces
- **Stochastic calculus** - Theoretical basis for continuous-time processes

### Real-World Applications
- **Option pricing**: Itô calculus and stochastic differential equations
- **Signal processing**: L² spaces and Fourier analysis
- **Deep learning theory**: Universal approximation theorems
- **Quantum computing**: Probability amplitudes in Hilbert spaces

Let's build the mathematical skyscraper of modern probability! 🏗️

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats, integrate
from scipy.stats import norm, uniform, expon, gamma
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import learning_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("plasma")
np.random.seed(42)

print("🏗️ Measure Theory toolkit loaded!")
print("Ready to build rigorous probability foundations!")

## 1. Measure Spaces: The Foundation

### What is a Measure?
A **measure** is a function that assigns a non-negative number (possibly infinite) to subsets of a space, generalizing concepts like length, area, and volume.

### The Triple (Ω, ℱ, μ)
A **measure space** consists of:
1. **Ω**: The sample space (set of all possible outcomes)
2. **ℱ**: A σ-algebra (collection of measurable sets)
3. **μ**: A measure function μ: ℱ → [0, ∞]

### σ-Algebra Properties
A collection ℱ of subsets of Ω is a σ-algebra if:
1. **Ω ∈ ℱ** (contains the whole space)
2. **A ∈ ℱ ⟹ Aᶜ ∈ ℱ** (closed under complements)
3. **A₁, A₂, ... ∈ ℱ ⟹ ⋃ᵢ Aᵢ ∈ ℱ** (closed under countable unions)

### Measure Properties
A function μ: ℱ → [0, ∞] is a measure if:
1. **μ(∅) = 0** (empty set has measure zero)
2. **Countable additivity**: For disjoint sets A₁, A₂, ...
   μ(⋃ᵢ Aᵢ) = Σᵢ μ(Aᵢ)

### Important Examples
- **Counting measure**: μ(A) = |A| (number of elements)
- **Lebesgue measure**: μ([a,b]) = b - a (length)
- **Dirac measure**: μ(A) = 1 if x₀ ∈ A, 0 otherwise
- **Probability measure**: μ(Ω) = 1

### Why σ-Algebras?
σ-algebras solve the **Banach-Tarski paradox** - without them, we could "decompose" a ball into two balls of the same size!

In [None]:
def demonstrate_measure_theory_foundations():
    """Explore the foundations of measure theory"""
    
    print("📐 Measure Theory: Building Rigorous Foundations")
    print("=" * 50)
    
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    
    # 1. Riemann vs Lebesgue integration
    print("\n1. Riemann vs Lebesgue Integration")
    print("   Different approaches to measuring 'area under curve'")
    
    # Create a function with discontinuities
    def discontinuous_function(x):
        """A function that's problematic for Riemann integration"""
        # Thomae function (modified)
        result = np.zeros_like(x)
        for i, val in enumerate(x):
            if abs(val - 0.5) < 0.01:  # Spike at 0.5
                result[i] = 1.0
            elif abs(val - 0.3) < 0.005:  # Smaller spike at 0.3
                result[i] = 0.5
            elif abs(val - 0.7) < 0.005:  # Smaller spike at 0.7
                result[i] = 0.5
            else:
                result[i] = 0.1 * np.sin(10 * np.pi * val)
        return result
    
    x = np.linspace(0, 1, 1000)
    y = discontinuous_function(x)
    
    # Riemann approach: vertical rectangles
    n_riemann = 20
    x_riemann = np.linspace(0, 1, n_riemann + 1)
    dx = x_riemann[1] - x_riemann[0]
    
    # Sample function at left endpoints
    y_riemann = discontinuous_function(x_riemann[:-1])
    
    axes[0, 0].plot(x, y, 'b-', linewidth=2, label='Function f(x)')
    axes[0, 0].bar(x_riemann[:-1], y_riemann, width=dx, alpha=0.3, 
                  color='blue', label='Riemann rectangles')
    axes[0, 0].set_xlabel('x')
    axes[0, 0].set_ylabel('f(x)')
    axes[0, 0].set_title('Riemann Integration\n(Vertical Rectangles)')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)
    
    riemann_sum = np.sum(y_riemann * dx)
    print(f"   Riemann sum approximation: {riemann_sum:.4f}")
    
    # Lebesgue approach: horizontal rectangles
    # Group points by function value
    y_levels = np.linspace(0, 1, 10)
    level_measures = []
    
    for i in range(len(y_levels) - 1):
        y_low, y_high = y_levels[i], y_levels[i + 1]
        # Find measure of set where y_low ≤ f(x) < y_high
        mask = (y >= y_low) & (y < y_high)
        measure = np.sum(mask) / len(x)  # Approximate measure
        level_measures.append(measure)
    
    # Plot Lebesgue perspective
    cumulative_measure = np.cumsum([0] + level_measures)
    
    for i, (y_low, y_high) in enumerate(zip(y_levels[:-1], y_levels[1:])):
        width = level_measures[i]
        if width > 0:
            axes[0, 1].barh(y_low, width, height=y_high - y_low, 
                          alpha=0.6, color=plt.cm.viridis(i/len(y_levels)))
    
    axes[0, 1].set_ylabel('Function Value')
    axes[0, 1].set_xlabel('Measure of Level Set')
    axes[0, 1].set_title('Lebesgue Integration\n(Horizontal Rectangles)')
    axes[0, 1].grid(True, alpha=0.3)
    
    lebesgue_sum = np.sum([y_levels[i] * level_measures[i] for i in range(len(level_measures))])
    print(f"   Lebesgue sum approximation: {lebesgue_sum:.4f}")
    
    # 2. Probability as a measure
    print("\n2. Probability as a Special Measure")
    print("   P(Ω) = 1, P(∅) = 0, countably additive")
    
    # Demonstrate probability measure properties
    # Simple example: coin flips
    outcomes = ['HH', 'HT', 'TH', 'TT']
    prob_fair = [0.25, 0.25, 0.25, 0.25]
    prob_biased = [0.49, 0.21, 0.21, 0.09]  # P(H) = 0.7
    
    x_pos = np.arange(len(outcomes))
    width = 0.35
    
    bars1 = axes[0, 2].bar(x_pos - width/2, prob_fair, width, 
                          label='Fair coin', alpha=0.8, color='skyblue')
    bars2 = axes[0, 2].bar(x_pos + width/2, prob_biased, width,
                          label='Biased coin', alpha=0.8, color='orange')
    
    axes[0, 2].set_xlabel('Outcome')
    axes[0, 2].set_ylabel('Probability')
    axes[0, 2].set_title('Probability Measures on Sample Space')
    axes[0, 2].set_xticks(x_pos)
    axes[0, 2].set_xticklabels(outcomes)
    axes[0, 2].legend()
    axes[0, 2].grid(True, alpha=0.3)
    
    # Verify measure properties
    print(f"   Fair coin: P(Ω) = {sum(prob_fair)} (should be 1)")
    print(f"   Biased coin: P(Ω) = {sum(prob_biased)} (should be 1)")
    
    # Event examples
    event_at_least_one_H = ['HH', 'HT', 'TH']
    p_fair_event = sum(prob_fair[i] for i, outcome in enumerate(outcomes) if outcome in event_at_least_one_H)
    p_biased_event = sum(prob_biased[i] for i, outcome in enumerate(outcomes) if outcome in event_at_least_one_H)
    
    print(f"   P(at least one H | fair) = {p_fair_event}")
    print(f"   P(at least one H | biased) = {p_biased_event}")
    
    # 3. Random variables as measurable functions
    print("\n3. Random Variables: Measurable Functions")
    print("   X: (Ω, ℱ) → (ℝ, B(ℝ))")
    
    # Example: sum of two dice
    # Sample space: all pairs (i,j) where i,j ∈ {1,2,3,4,5,6}
    dice_outcomes = [(i, j) for i in range(1, 7) for j in range(1, 7)]
    dice_sums = [i + j for i, j in dice_outcomes]
    
    # Count frequencies
    sum_counts = np.bincount(dice_sums)[2:]  # Start from sum=2
    sum_values = np.arange(2, 13)
    sum_probs = sum_counts / 36
    
    # Plot probability mass function
    bars = axes[1, 0].bar(sum_values, sum_probs, alpha=0.7, color='green')
    axes[1, 0].set_xlabel('Sum of Dice (X)')
    axes[1, 0].set_ylabel('P(X = x)')
    axes[1, 0].set_title('Random Variable: Sum of Two Dice')
    axes[1, 0].set_xticks(sum_values)
    axes[1, 0].grid(True, alpha=0.3)
    
    # Add probability labels
    for bar, prob in zip(bars, sum_probs):
        height = bar.get_height()
        axes[1, 0].text(bar.get_x() + bar.get_width()/2., height + 0.005,
                       f'{prob:.3f}', ha='center', va='bottom', fontsize=8)
    
    print(f"   Sample space size: {len(dice_outcomes)}")
    print(f"   Range of X: {set(dice_sums)}")
    print(f"   P(X = 7) = {sum_probs[5]:.3f} (most likely)")
    print(f"   E[X] = {np.sum(sum_values * sum_probs):.2f}")
    
    # 4. Convergence theorems
    print("\n4. Convergence Theorems: When Limits Commute with Integrals")
    print("   Monotone Convergence, Dominated Convergence, Fatou's Lemma")
    
    # Demonstrate Monotone Convergence Theorem
    x_conv = np.linspace(0, 2, 1000)
    
    # Sequence of functions f_n(x) = x^n * 1_{[0,1]}(x)
    n_values = [1, 2, 5, 10, 20]
    colors_conv = plt.cm.viridis(np.linspace(0, 1, len(n_values)))
    
    for i, n in enumerate(n_values):
        f_n = np.where(x_conv <= 1, x_conv**n, 0)
        axes[1, 1].plot(x_conv, f_n, color=colors_conv[i], linewidth=2, 
                       label=f'$f_{{{n}}}(x) = x^{{{n}}} \cdot 1_{{[0,1]}}$')
    
    # Limit function
    f_limit = np.where(x_conv < 1, 0, np.where(x_conv == 1, 1, 0))
    axes[1, 1].plot(x_conv, f_limit, 'r--', linewidth=3, label='Limit function')
    
    axes[1, 1].set_xlabel('x')
    axes[1, 1].set_ylabel('f_n(x)')
    axes[1, 1].set_title('Monotone Convergence Example')
    axes[1, 1].legend(fontsize=8)
    axes[1, 1].grid(True, alpha=0.3)
    axes[1, 1].set_ylim(-0.1, 1.1)
    
    # Calculate integrals
    integrals = [1/(n+1) for n in n_values]  # ∫₀¹ x^n dx = 1/(n+1)
    limit_integral = 0  # ∫ limit function = 0
    
    print(f"   Integrals of f_n: {[f'{val:.4f}' for val in integrals]}")
    print(f"   Limit of integrals: {limit_integral} (∫ limit function)")
    print(f"   MCT: lim ∫f_n = ∫(lim f_n) when f_n ↑")
    
    # 5. Radon-Nikodym theorem
    print("\n5. Radon-Nikodym Theorem: Existence of Density Functions")
    print("   When does dν = f dμ exist?")
    
    # Example: relationship between different probability measures
    x_rn = np.linspace(-3, 3, 1000)
    
    # Reference measure: standard normal
    mu_density = norm.pdf(x_rn, 0, 1)
    
    # Absolutely continuous measure: shifted normal
    nu_density = norm.pdf(x_rn, 1, 1)
    
    # Radon-Nikodym derivative
    rn_derivative = nu_density / mu_density
    
    axes[1, 2].plot(x_rn, mu_density, 'b-', linewidth=2, label='μ: N(0,1)')
    axes[1, 2].plot(x_rn, nu_density, 'r-', linewidth=2, label='ν: N(1,1)')
    axes[1, 2].plot(x_rn, rn_derivative, 'g--', linewidth=2, label='dν/dμ')
    axes[1, 2].set_xlabel('x')
    axes[1, 2].set_ylabel('Density')
    axes[1, 2].set_title('Radon-Nikodym Derivative')
    axes[1, 2].legend()
    axes[1, 2].grid(True, alpha=0.3)
    
    # Verify the relationship
    # ∫ g dν = ∫ g (dν/dμ) dμ for any measurable g
    test_function = x_rn**2  # g(x) = x²
    
    # Numerical integration
    dx = x_rn[1] - x_rn[0]
    integral_nu = np.sum(test_function * nu_density * dx)
    integral_mu_weighted = np.sum(test_function * rn_derivative * mu_density * dx)
    
    print(f"   ∫ x² dν = {integral_nu:.4f}")
    print(f"   ∫ x² (dν/dμ) dμ = {integral_mu_weighted:.4f}")
    print(f"   Difference: {abs(integral_nu - integral_mu_weighted):.6f}")
    
    plt.tight_layout()
    plt.show()
    
    print("\n🎯 Key Measure Theory Concepts:")
    print("• σ-algebras ensure mathematical consistency")
    print("• Lebesgue integration generalizes Riemann integration")
    print("• Probability is a special case of measure theory")
    print("• Random variables are measurable functions")
    print("• Convergence theorems enable limit-integral interchange")
    print("• Radon-Nikodym theorem formalizes density functions")

demonstrate_measure_theory_foundations()

## 2. Applications in Machine Learning Theory

### PAC Learning and Uniform Convergence
**Probably Approximately Correct (PAC)** learning uses measure theory to formalize when learning algorithms will succeed.

**Key Concepts**:
- **Empirical Risk Minimization (ERM)**: Choose hypothesis minimizing training error
- **Uniform Convergence**: Training error converges to true error uniformly over hypothesis class
- **VC Dimension**: Measure of hypothesis class complexity

### Rademacher Complexity
Measures how well a function class can fit random noise:
```
R_m(ℱ) = E[sup_{f∈ℱ} (1/m) Σᵢ σᵢ f(xᵢ)]
```
where σᵢ are Rademacher random variables (±1 with equal probability).

### Concentration Inequalities
Control how random variables deviate from their expectations:
- **Hoeffding's inequality**: For bounded random variables
- **McDiarmid's inequality**: For functions with bounded differences
- **Azuma's inequality**: For martingales

### Reproducing Kernel Hilbert Spaces (RKHS)
Functional analysis foundation for kernel methods:
- **Hilbert space**: Complete inner product space
- **Reproducing property**: ⟨k(·,x), f⟩ = f(x)
- **Representer theorem**: Solution lies in span of data points

In [None]:
def demonstrate_ml_theory_applications():
    """Explore measure theory applications in ML theory"""
    
    print("🤖 Measure Theory in Machine Learning Theory")
    print("=" * 45)
    
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    
    # 1. Empirical risk vs true risk
    print("\n1. Empirical Risk Minimization")
    print("   How training error relates to true error")
    
    # Generate synthetic data
    np.random.seed(42)
    n_samples = 1000
    X, y = make_classification(n_samples=n_samples, n_features=2, n_redundant=0,
                              n_informative=2, n_clusters_per_class=1, random_state=42)
    
    # Split into train and test
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Train models with different complexities
    train_sizes = np.logspace(1, np.log10(len(X_train)), 20).astype(int)
    
    models = {
        'Low complexity (RF depth=3)': RandomForestClassifier(max_depth=3, n_estimators=10, random_state=42),
        'Medium complexity (RF depth=10)': RandomForestClassifier(max_depth=10, n_estimators=10, random_state=42),
        'High complexity (RF depth=None)': RandomForestClassifier(max_depth=None, n_estimators=10, random_state=42)
    }
    
    colors = ['blue', 'green', 'red']
    
    for i, (name, model) in enumerate(models.items()):
        # Learning curves
        train_sizes_actual, train_scores, test_scores = learning_curve(
            model, X_train, y_train, train_sizes=train_sizes, cv=3,
            scoring='accuracy', random_state=42, n_jobs=-1
        )
        
        train_mean = train_scores.mean(axis=1)
        train_std = train_scores.std(axis=1)
        test_mean = test_scores.mean(axis=1)
        test_std = test_scores.std(axis=1)
        
        axes[0, 0].plot(train_sizes_actual, train_mean, 'o-', color=colors[i], 
                       linewidth=2, markersize=4, label=f'{name} (train)')
        axes[0, 0].fill_between(train_sizes_actual, train_mean - train_std,
                               train_mean + train_std, alpha=0.1, color=colors[i])
        
        axes[0, 0].plot(train_sizes_actual, test_mean, 's--', color=colors[i],
                       linewidth=2, markersize=4, label=f'{name} (test)')
        axes[0, 0].fill_between(train_sizes_actual, test_mean - test_std,
                               test_mean + test_std, alpha=0.1, color=colors[i])
    
    axes[0, 0].set_xlabel('Training Set Size')
    axes[0, 0].set_ylabel('Accuracy')
    axes[0, 0].set_title('Learning Curves: Empirical vs True Risk')
    axes[0, 0].set_xscale('log')
    axes[0, 0].legend(fontsize=8)
    axes[0, 0].grid(True, alpha=0.3)
    
    print(f"   Training samples: {len(X_train)}")
    print(f"   Test samples: {len(X_test)}")
    print(f"   Observe: High complexity models overfit with small data")
    
    # 2. Concentration inequalities
    print("\n2. Concentration Inequalities")
    print("   How sample means concentrate around true means")
    
    # Demonstrate Hoeffding's inequality
    # Generate bounded random variables [0, 1]
    true_mean = 0.6
    sample_sizes = [10, 50, 100, 500, 1000]
    n_experiments = 1000
    
    deviations = []
    hoeffding_bounds = []
    
    for n in sample_sizes:
        # Run experiments
        sample_means = []
        for _ in range(n_experiments):
            samples = np.random.uniform(0, 1, n)  # Bounded in [0,1]
            samples = (samples < true_mean).astype(float)  # Convert to Bernoulli
            sample_means.append(np.mean(samples))
        
        sample_means = np.array(sample_means)
        empirical_deviation = np.mean(np.abs(sample_means - true_mean))
        deviations.append(empirical_deviation)
        
        # Hoeffding bound: P(|X̄ - μ| ≥ t) ≤ 2exp(-2nt²/(b-a)²)
        # For t = expected deviation, solve for the bound
        epsilon = 0.1  # We want P(|X̄ - μ| ≥ ε)
        hoeffding_bound = 2 * np.exp(-2 * n * epsilon**2)  # (b-a)² = 1
        hoeffding_bounds.append(hoeffding_bound)
    
    axes[0, 1].loglog(sample_sizes, deviations, 'bo-', linewidth=2, 
                     markersize=6, label='Empirical deviation')
    axes[0, 1].loglog(sample_sizes, hoeffding_bounds, 'r--', linewidth=2,
                     label=f'Hoeffding bound (ε={epsilon})')
    axes[0, 1].loglog(sample_sizes, 1/np.sqrt(np.array(sample_sizes)), 'g:', 
                     linewidth=2, label='1/√n (CLT rate)')
    axes[0, 1].set_xlabel('Sample Size (n)')
    axes[0, 1].set_ylabel('Deviation from True Mean')
    axes[0, 1].set_title('Concentration of Sample Means')
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)
    
    print(f"   True mean: {true_mean}")
    print(f"   Empirical deviations: {[f'{d:.4f}' for d in deviations]}")
    print(f"   Hoeffding bounds: {[f'{b:.4f}' for b in hoeffding_bounds]}")
    
    # 3. VC dimension illustration
    print("\n3. VC Dimension: Measuring Model Complexity")
    print("   How many points can a hypothesis class shatter?")
    
    # Demonstrate VC dimension for linear classifiers in 2D
    np.random.seed(42)
    
    # VC dimension of linear classifiers in 2D is 3
    # Show that we can shatter 3 points but not 4
    
    # 3 points that can be shattered
    points_3 = np.array([[0, 0], [1, 0], [0.5, 1]])
    
    # All possible labelings of 3 points
    all_labelings_3 = []
    for i in range(2**3):
        labeling = [(i >> j) & 1 for j in range(3)]
        all_labelings_3.append(labeling)
    
    # Plot the 3 points
    axes[0, 2].scatter(points_3[:, 0], points_3[:, 1], s=100, c='red', 
                      edgecolors='black', linewidth=2, zorder=5)
    
    # Annotate points
    for i, (x, y) in enumerate(points_3):
        axes[0, 2].annotate(f'P{i+1}', (x, y), xytext=(5, 5), 
                          textcoords='offset points', fontsize=10)
    
    axes[0, 2].set_xlim(-0.5, 1.5)
    axes[0, 2].set_ylim(-0.5, 1.5)
    axes[0, 2].set_xlabel('x₁')
    axes[0, 2].set_ylabel('x₂')
    axes[0, 2].set_title('VC Dimension Example: 3 Points\n(Can be shattered by linear classifier)')
    axes[0, 2].grid(True, alpha=0.3)
    axes[0, 2].set_aspect('equal')
    
    print(f"   Linear classifiers in 2D: VC dimension = 3")
    print(f"   Can represent all {len(all_labelings_3)} labelings of 3 points")
    print(f"   Cannot shatter any set of 4 points in general position")
    
    # 4. Rademacher complexity
    print("\n4. Rademacher Complexity")
    print("   Measuring how well functions fit random noise")
    
    # Empirical Rademacher complexity for linear functions
    n_samples_rad = 100
    n_trials = 500
    dimensions = [1, 2, 5, 10, 20, 50]
    
    rademacher_complexities = []
    theoretical_bounds = []
    
    for d in dimensions:
        rad_values = []
        
        for trial in range(n_trials):
            # Generate random data
            X_rad = np.random.randn(n_samples_rad, d)
            X_rad = X_rad / np.linalg.norm(X_rad, axis=1, keepdims=True)  # Normalize
            
            # Generate Rademacher variables
            sigma = np.random.choice([-1, 1], size=n_samples_rad)
            
            # Find linear function that best fits the noise
            # This is equivalent to ||X^T σ||₂ / n
            rademacher_value = np.linalg.norm(X_rad.T @ sigma) / n_samples_rad
            rad_values.append(rademacher_value)
        
        empirical_rad = np.mean(rad_values)
        rademacher_complexities.append(empirical_rad)
        
        # Theoretical bound: R_m(linear functions) ≤ √(d/m)
        theoretical_bound = np.sqrt(d / n_samples_rad)
        theoretical_bounds.append(theoretical_bound)
    
    axes[1, 0].loglog(dimensions, rademacher_complexities, 'bo-', linewidth=2,
                     markersize=6, label='Empirical Rademacher complexity')
    axes[1, 0].loglog(dimensions, theoretical_bounds, 'r--', linewidth=2,
                     label='Theoretical bound √(d/m)')
    axes[1, 0].set_xlabel('Dimension (d)')
    axes[1, 0].set_ylabel('Rademacher Complexity')
    axes[1, 0].set_title(f'Rademacher Complexity vs Dimension\n(n={n_samples_rad} samples)')
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)
    
    print(f"   Sample size: {n_samples_rad}")
    print(f"   Empirical complexities: {[f'{r:.4f}' for r in rademacher_complexities[:4]]}...")
    print(f"   Theoretical bounds: {[f'{t:.4f}' for t in theoretical_bounds[:4]]}...")
    
    # 5. Kernel methods and RKHS
    print("\n5. Reproducing Kernel Hilbert Spaces")
    print("   Functional analysis foundation of kernel methods")
    
    # Demonstrate different kernels
    from sklearn.metrics.pairwise import rbf_kernel, polynomial_kernel
    
    # Generate 1D data
    x_kernel = np.linspace(-2, 2, 100).reshape(-1, 1)
    x_train_kernel = np.array([-1.5, -0.5, 0.5, 1.5]).reshape(-1, 1)
    y_train_kernel = np.array([1, -1, 1, -1])
    
    # Different kernels
    kernels = {
        'RBF (γ=1)': lambda X1, X2: rbf_kernel(X1, X2, gamma=1),
        'RBF (γ=5)': lambda X1, X2: rbf_kernel(X1, X2, gamma=5),
        'Polynomial (d=2)': lambda X1, X2: polynomial_kernel(X1, X2, degree=2, coef0=1)
    }
    
    colors_kernel = ['blue', 'green', 'red']
    
    for i, (name, kernel_func) in enumerate(kernels.items()):
        # Compute kernel matrix
        K_train = kernel_func(x_train_kernel, x_train_kernel)
        K_test = kernel_func(x_kernel, x_train_kernel)
        
        # Solve for dual coefficients (simplified - no regularization)
        try:
            alpha = np.linalg.solve(K_train, y_train_kernel)
            # Prediction: f(x) = Σ αᵢ k(x, xᵢ)
            y_pred = K_test @ alpha
            
            axes[1, 1].plot(x_kernel.flatten(), y_pred, color=colors_kernel[i],
                           linewidth=2, label=name)
        except np.linalg.LinAlgError:
            print(f"   Warning: Singular kernel matrix for {name}")
    
    # Plot training data
    axes[1, 1].scatter(x_train_kernel.flatten(), y_train_kernel, 
                      c='black', s=100, edgecolors='white', linewidth=2,
                      zorder=5, label='Training data')
    
    axes[1, 1].set_xlabel('x')
    axes[1, 1].set_ylabel('f(x)')
    axes[1, 1].set_title('Kernel Functions in RKHS')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)
    
    print(f"   Training points: {x_train_kernel.flatten()}")
    print(f"   Training labels: {y_train_kernel}")
    print(f"   Different kernels create different function spaces")
    
    # 6. Generalization bounds
    print("\n6. Generalization Bounds")
    print("   Connecting empirical risk to true risk")
    
    # Demonstrate how bounds depend on sample size and complexity
    sample_sizes_gen = np.logspace(1, 4, 50)
    vc_dimensions = [5, 10, 20, 50]
    confidence = 0.05  # 95% confidence
    
    for i, vc_dim in enumerate(vc_dimensions):
        # VC generalization bound (simplified)
        # With probability ≥ 1-δ: |R(h) - R̂(h)| ≤ √((VC_dim * log(m/VC_dim) + log(1/δ)) / m)
        bounds = []
        for m in sample_sizes_gen:
            if m > vc_dim:  # Valid regime
                bound = np.sqrt((vc_dim * np.log(m/vc_dim) + np.log(1/confidence)) / m)
                bounds.append(bound)
            else:
                bounds.append(np.nan)
        
        axes[1, 2].loglog(sample_sizes_gen, bounds, linewidth=2,
                         label=f'VC dim = {vc_dim}')
    
    axes[1, 2].set_xlabel('Sample Size (m)')
    axes[1, 2].set_ylabel('Generalization Bound')
    axes[1, 2].set_title('VC Generalization Bounds')
    axes[1, 2].legend()
    axes[1, 2].grid(True, alpha=0.3)
    
    print(f"   Confidence level: {1-confidence:.0%}")
    print(f"   Bounds decrease as O(√(VC_dim * log(m) / m))")
    print(f"   Higher complexity → looser bounds")
    print(f"   More data → tighter bounds")
    
    plt.tight_layout()
    plt.show()
    
    print("\n🎯 Key ML Theory Concepts:")
    print("• ERM works when empirical risk uniformly converges to true risk")
    print("• Concentration inequalities bound deviations from expectations")
    print("• VC dimension measures the complexity of hypothesis classes")
    print("• Rademacher complexity measures overfitting to random noise")
    print("• RKHS provides the functional analysis foundation for kernels")
    print("• Generalization bounds connect sample complexity to model complexity")

demonstrate_ml_theory_applications()

## Conclusion: The Mathematical Foundation

### What We've Accomplished
Through this journey into **applied measure theory**, you've built the rigorous mathematical foundation that underlies modern machine learning theory. You now understand:

#### Core Measure Theory
- **Measure spaces**: The triple (Ω, ℱ, μ) that makes probability rigorous
- **Lebesgue integration**: A more general approach than Riemann integration
- **Random variables**: Measurable functions from probability spaces to real numbers
- **Convergence theorems**: When we can interchange limits and integrals

#### Machine Learning Applications
- **PAC learning**: Formal framework for learnable problems
- **Uniform convergence**: When empirical risk converges to true risk
- **VC dimension**: Measuring the complexity of hypothesis classes
- **Concentration inequalities**: Controlling deviations from expectations
- **RKHS theory**: The functional analysis behind kernel methods

### Why This Matters
Measure theory isn't just abstract mathematics—it's the **invisible foundation** that makes machine learning theoretically sound:

1. **Rigorous probability**: Avoids paradoxes and logical inconsistencies
2. **Generalization theory**: Explains why ML algorithms work
3. **Kernel methods**: Provides the mathematical framework for SVMs and Gaussian processes
4. **Deep learning theory**: Enables analysis of neural network expressivity and generalization

### The Broader Picture
You've now completed a comprehensive journey through the mathematical foundations of machine learning:

- **Linear algebra**: The computational engine
- **Calculus**: The optimization toolkit
- **Probability**: The uncertainty framework
- **Statistics**: The inference methodology
- **Optimization**: The learning algorithms
- **Information theory**: The complexity measures
- **Bayesian methods**: The uncertainty quantification
- **Stochastic processes**: The temporal modeling
- **Measure theory**: The rigorous foundation

### Moving Forward
With these mathematical tools, you're equipped to:
- Understand cutting-edge ML research papers
- Develop new algorithms with theoretical guarantees
- Debug and improve existing methods
- Bridge the gap between theory and practice

Remember: **mathematics is not the enemy of intuition—it's the language that makes intuition precise and reliable.**

🎓 **Congratulations on mastering the mathematical foundations of machine learning!** 🎓