# Statistics and Statistical Inference for Machine Learning

This notebook covers essential statistical concepts for machine learning:
- Descriptive statistics and data exploration
- Hypothesis testing and p-values
- Confidence intervals and statistical inference
- Model evaluation and validation
- A/B testing and experimental design
- Bias, variance, and the bias-variance tradeoff

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from sklearn.model_selection import cross_val_score, learning_curve, validation_curve
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_regression, make_classification
from sklearn.metrics import mean_squared_error, accuracy_score, classification_report
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
np.random.seed(42)

## 1. Descriptive Statistics and Data Exploration

Understanding your data is the first step in any ML project.

In [None]:
# Generate sample dataset
np.random.seed(42)
n_samples = 1000

# Create mixed distributions to demonstrate different statistical properties
data = {
    'normal': np.random.normal(100, 15, n_samples),
    'skewed': np.random.exponential(2, n_samples),
    'bimodal': np.concatenate([np.random.normal(80, 10, n_samples//2),
                              np.random.normal(120, 8, n_samples//2)]),
    'uniform': np.random.uniform(50, 150, n_samples)
}

df = pd.DataFrame(data)

# Calculate descriptive statistics
def calculate_stats(data):
    """Calculate comprehensive descriptive statistics"""
    stats_dict = {
        'mean': np.mean(data),
        'median': np.median(data),
        'mode': stats.mode(data, keepdims=True)[0][0],
        'std': np.std(data, ddof=1),
        'variance': np.var(data, ddof=1),
        'skewness': stats.skew(data),
        'kurtosis': stats.kurtosis(data),
        'min': np.min(data),
        'max': np.max(data),
        'q25': np.percentile(data, 25),
        'q75': np.percentile(data, 75),
        'iqr': np.percentile(data, 75) - np.percentile(data, 25)
    }
    return stats_dict

# Visualize distributions and statistics
fig, axes = plt.subplots(2, 4, figsize=(20, 12))

colors = ['blue', 'red', 'green', 'purple']

for i, (name, values) in enumerate(data.items()):
    stats_dict = calculate_stats(values)
    
    # Histogram with statistics
    axes[0, i].hist(values, bins=50, alpha=0.7, color=colors[i], density=True, edgecolor='black')
    axes[0, i].axvline(stats_dict['mean'], color='red', linestyle='-', linewidth=2, label=f'Mean: {stats_dict["mean"]:.1f}')
    axes[0, i].axvline(stats_dict['median'], color='green', linestyle='--', linewidth=2, label=f'Median: {stats_dict["median"]:.1f}')
    
    axes[0, i].set_title(f'{name.title()} Distribution', fontweight='bold')
    axes[0, i].set_xlabel('Value')
    axes[0, i].set_ylabel('Density')
    axes[0, i].legend()
    axes[0, i].grid(True, alpha=0.3)
    
    # Box plot
    bp = axes[1, i].boxplot(values, patch_artist=True, widths=0.6)
    bp['boxes'][0].set_facecolor(colors[i])
    bp['boxes'][0].set_alpha(0.7)
    
    axes[1, i].set_title(f'{name.title()}: Skew={stats_dict["skewness"]:.2f}', fontweight='bold')
    axes[1, i].set_ylabel('Value')
    axes[1, i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Create summary statistics table
stats_df = pd.DataFrame({name: calculate_stats(values) for name, values in data.items()}).T
print("Descriptive Statistics Summary:")
print("=" * 50)
print(stats_df.round(2))

# Interpretation
print("\nInterpretations:")
print("- Normal: Symmetric, skewness ≈ 0, mean ≈ median")
print("- Skewed: Right-tailed, positive skewness, mean > median")
print("- Bimodal: Two peaks, may have negative kurtosis")
print("- Uniform: Flat distribution, negative kurtosis")

## 2. Hypothesis Testing

Statistical tests to make decisions about populations from sample data.

In [None]:
# Hypothesis testing examples

# 1. One-sample t-test
def one_sample_ttest_demo():
    """Demonstrate one-sample t-test"""
    # H0: population mean = 100
    # H1: population mean ≠ 100
    
    sample_data = np.random.normal(105, 15, 50)  # True mean = 105
    hypothesized_mean = 100
    
    t_stat, p_value = stats.ttest_1samp(sample_data, hypothesized_mean)
    
    print("One-Sample T-Test")
    print("=" * 20)
    print(f"Sample mean: {np.mean(sample_data):.2f}")
    print(f"Hypothesized mean: {hypothesized_mean}")
    print(f"T-statistic: {t_stat:.4f}")
    print(f"P-value: {p_value:.4f}")
    print(f"Conclusion: {'Reject H0' if p_value < 0.05 else 'Fail to reject H0'} (α=0.05)")
    
    return sample_data, t_stat, p_value

# 2. Two-sample t-test
def two_sample_ttest_demo():
    """Demonstrate two-sample t-test"""
    # H0: mean1 = mean2
    # H1: mean1 ≠ mean2
    
    group1 = np.random.normal(100, 15, 50)
    group2 = np.random.normal(108, 15, 45)  # Different mean
    
    t_stat, p_value = stats.ttest_ind(group1, group2)
    
    print("\nTwo-Sample T-Test")
    print("=" * 20)
    print(f"Group 1 mean: {np.mean(group1):.2f}")
    print(f"Group 2 mean: {np.mean(group2):.2f}")
    print(f"T-statistic: {t_stat:.4f}")
    print(f"P-value: {p_value:.4f}")
    print(f"Conclusion: {'Reject H0' if p_value < 0.05 else 'Fail to reject H0'} (α=0.05)")
    
    return group1, group2, t_stat, p_value

# 3. Chi-square test
def chi_square_test_demo():
    """Demonstrate chi-square test for independence"""
    # Test independence between two categorical variables
    observed = np.array([[20, 30, 15],   # Group A
                        [25, 20, 10],   # Group B
                        [15, 25, 20]])  # Group C
    
    chi2_stat, p_value, dof, expected = stats.chi2_contingency(observed)
    
    print("\nChi-Square Test of Independence")
    print("=" * 35)
    print(f"Chi-square statistic: {chi2_stat:.4f}")
    print(f"P-value: {p_value:.4f}")
    print(f"Degrees of freedom: {dof}")
    print(f"Conclusion: {'Variables are dependent' if p_value < 0.05 else 'Variables are independent'} (α=0.05)")
    
    return observed, expected, chi2_stat, p_value

# Run demonstrations
sample1, t1, p1 = one_sample_ttest_demo()
group1, group2, t2, p2 = two_sample_ttest_demo()
observed, expected, chi2, p_chi = chi_square_test_demo()

# Visualize hypothesis tests
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# One-sample t-test visualization
x = np.linspace(np.min(sample1), np.max(sample1), 100)
axes[0, 0].hist(sample1, bins=15, alpha=0.7, density=True, color='skyblue', edgecolor='black')
axes[0, 0].axvline(np.mean(sample1), color='red', linestyle='-', linewidth=2, label=f'Sample mean: {np.mean(sample1):.2f}')
axes[0, 0].axvline(100, color='green', linestyle='--', linewidth=2, label='H0 mean: 100')
axes[0, 0].set_title(f'One-Sample T-Test (p={p1:.4f})', fontweight='bold')
axes[0, 0].set_xlabel('Value')
axes[0, 0].set_ylabel('Density')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Two-sample t-test visualization
axes[0, 1].hist(group1, bins=15, alpha=0.6, density=True, color='blue', label=f'Group 1 (μ={np.mean(group1):.1f})', edgecolor='black')
axes[0, 1].hist(group2, bins=15, alpha=0.6, density=True, color='red', label=f'Group 2 (μ={np.mean(group2):.1f})', edgecolor='black')
axes[0, 1].set_title(f'Two-Sample T-Test (p={p2:.4f})', fontweight='bold')
axes[0, 1].set_xlabel('Value')
axes[0, 1].set_ylabel('Density')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# P-value distribution
# Simulate multiple tests to show p-value distribution under null hypothesis
n_simulations = 1000
p_values_null = []
for _ in range(n_simulations):
    null_sample = np.random.normal(100, 15, 50)  # True null hypothesis
    _, p_val = stats.ttest_1samp(null_sample, 100)
    p_values_null.append(p_val)

axes[1, 0].hist(p_values_null, bins=20, alpha=0.7, density=True, color='lightgreen', edgecolor='black')
axes[1, 0].axhline(y=1, color='red', linestyle='--', linewidth=2, label='Uniform under H0')
axes[1, 0].axvline(0.05, color='red', linestyle='-', linewidth=2, label='α = 0.05')
axes[1, 0].set_title('P-value Distribution Under H0', fontweight='bold')
axes[1, 0].set_xlabel('P-value')
axes[1, 0].set_ylabel('Density')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Type I and Type II errors
effect_sizes = np.linspace(0, 10, 100)
power_values = []

for effect in effect_sizes:
    # Calculate power for given effect size
    power = stats.ttest_power(effect/15, 50, 0.05)  # effect/std, sample_size, alpha
    power_values.append(power)

axes[1, 1].plot(effect_sizes, power_values, 'b-', linewidth=3, label='Statistical Power')
axes[1, 1].axhline(y=0.8, color='red', linestyle='--', linewidth=2, label='Power = 0.8')
axes[1, 1].axhline(y=0.05, color='orange', linestyle='--', linewidth=2, label='Type I Error = 0.05')
axes[1, 1].set_title('Statistical Power vs Effect Size', fontweight='bold')
axes[1, 1].set_xlabel('Effect Size')
axes[1, 1].set_ylabel('Power (1 - Type II Error)')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 3. Confidence Intervals

Quantifying uncertainty in statistical estimates.

In [None]:
# Confidence interval demonstrations

def simulate_confidence_intervals(true_mean=100, true_std=15, sample_size=50, 
                                confidence_level=0.95, n_simulations=100):
    """Simulate confidence intervals to show coverage probability"""
    
    alpha = 1 - confidence_level
    t_critical = stats.t.ppf(1 - alpha/2, sample_size - 1)
    
    intervals = []
    sample_means = []
    contains_true_mean = []
    
    for _ in range(n_simulations):
        # Generate sample
        sample = np.random.normal(true_mean, true_std, sample_size)
        sample_mean = np.mean(sample)
        sample_std = np.std(sample, ddof=1)
        
        # Calculate confidence interval
        margin_error = t_critical * (sample_std / np.sqrt(sample_size))
        ci_lower = sample_mean - margin_error
        ci_upper = sample_mean + margin_error
        
        intervals.append((ci_lower, ci_upper))
        sample_means.append(sample_mean)
        contains_true_mean.append(ci_lower <= true_mean <= ci_upper)
    
    return intervals, sample_means, contains_true_mean

# Run simulation
intervals, means, coverage = simulate_confidence_intervals(n_simulations=100)
coverage_rate = np.mean(coverage)

print(f"Confidence Interval Simulation Results:")
print(f"True coverage rate: {coverage_rate:.1%} (Expected: 95%)")
print(f"Number of intervals containing true mean: {sum(coverage)}/100")

# Visualize confidence intervals
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 12))

# Plot first 50 confidence intervals
n_show = 50
for i in range(n_show):
    color = 'green' if coverage[i] else 'red'
    ax1.plot([intervals[i][0], intervals[i][1]], [i, i], color=color, linewidth=2, alpha=0.7)
    ax1.plot(means[i], i, 'o', color=color, markersize=4)

ax1.axvline(100, color='blue', linestyle='--', linewidth=3, label='True mean')
ax1.set_xlabel('Value')
ax1.set_ylabel('Sample Number')
ax1.set_title(f'95% Confidence Intervals (First {n_show} samples)\nGreen: Contains true mean, Red: Misses true mean', 
             fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Bootstrap confidence intervals
def bootstrap_ci(data, n_bootstrap=1000, confidence_level=0.95):
    """Calculate bootstrap confidence interval"""
    bootstrap_means = []
    n = len(data)
    
    for _ in range(n_bootstrap):
        bootstrap_sample = np.random.choice(data, size=n, replace=True)
        bootstrap_means.append(np.mean(bootstrap_sample))
    
    alpha = 1 - confidence_level
    ci_lower = np.percentile(bootstrap_means, 100 * alpha/2)
    ci_upper = np.percentile(bootstrap_means, 100 * (1 - alpha/2))
    
    return bootstrap_means, ci_lower, ci_upper

# Compare different CI methods
sample_data = np.random.normal(100, 15, 100)
sample_mean = np.mean(sample_data)
sample_std = np.std(sample_data, ddof=1)

# T-distribution CI
t_crit = stats.t.ppf(0.975, len(sample_data) - 1)
t_ci = (sample_mean - t_crit * sample_std/np.sqrt(len(sample_data)),
        sample_mean + t_crit * sample_std/np.sqrt(len(sample_data)))

# Bootstrap CI
bootstrap_means, boot_ci_lower, boot_ci_upper = bootstrap_ci(sample_data)

# Plot bootstrap distribution and CIs
ax2.hist(bootstrap_means, bins=50, alpha=0.7, density=True, color='lightblue', 
         edgecolor='black', label='Bootstrap distribution')
ax2.axvline(sample_mean, color='red', linestyle='-', linewidth=2, label=f'Sample mean: {sample_mean:.2f}')
ax2.axvline(100, color='green', linestyle='--', linewidth=2, label='True mean: 100')
ax2.axvline(t_ci[0], color='orange', linestyle=':', linewidth=2, label=f'T-CI: [{t_ci[0]:.2f}, {t_ci[1]:.2f}]')
ax2.axvline(t_ci[1], color='orange', linestyle=':', linewidth=2)
ax2.axvline(boot_ci_lower, color='purple', linestyle='-.', linewidth=2, label=f'Bootstrap CI: [{boot_ci_lower:.2f}, {boot_ci_upper:.2f}]')
ax2.axvline(boot_ci_upper, color='purple', linestyle='-.', linewidth=2)

ax2.set_xlabel('Sample Mean')
ax2.set_ylabel('Density')
ax2.set_title('Bootstrap Distribution and Confidence Intervals', fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nConfidence Interval Comparison:")
print(f"T-distribution CI: [{t_ci[0]:.3f}, {t_ci[1]:.3f}]")
print(f"Bootstrap CI: [{boot_ci_lower:.3f}, {boot_ci_upper:.3f}]")
print(f"CI width difference: {(t_ci[1] - t_ci[0]) - (boot_ci_upper - boot_ci_lower):.3f}")

## 4. Model Evaluation and Cross-Validation

Statistical methods for assessing model performance and generalization.

In [None]:
# Generate regression dataset
X, y = make_regression(n_samples=200, n_features=1, noise=10, random_state=42)
X = X.flatten()

# Different model complexities
models = {
    'Linear': Pipeline([('poly', PolynomialFeatures(1)), ('reg', LinearRegression())]),
    'Polynomial (degree 3)': Pipeline([('poly', PolynomialFeatures(3)), ('reg', LinearRegression())]),
    'Polynomial (degree 10)': Pipeline([('poly', PolynomialFeatures(10)), ('reg', LinearRegression())]),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42)
}

# Cross-validation evaluation
def evaluate_models(models, X, y, cv_folds=5):
    """Evaluate models using cross-validation"""
    results = {}
    
    for name, model in models.items():
        # Reshape X for sklearn if needed
        X_reshaped = X.reshape(-1, 1) if X.ndim == 1 else X
        
        # Cross-validation scores
        cv_scores = cross_val_score(model, X_reshaped, y, cv=cv_folds, 
                                  scoring='neg_mean_squared_error')
        cv_scores = -cv_scores  # Convert to positive MSE
        
        results[name] = {
            'cv_scores': cv_scores,
            'mean_cv_score': np.mean(cv_scores),
            'std_cv_score': np.std(cv_scores),
            'model': model
        }
    
    return results

# Evaluate models
results = evaluate_models(models, X, y)

# Learning curves
def plot_learning_curves(models, X, y):
    """Plot learning curves for different models"""
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    axes = axes.flatten()
    
    X_reshaped = X.reshape(-1, 1) if X.ndim == 1 else X
    
    for idx, (name, model) in enumerate(models.items()):
        train_sizes, train_scores, val_scores = learning_curve(
            model, X_reshaped, y, cv=5, n_jobs=-1,
            train_sizes=np.linspace(0.1, 1.0, 10),
            scoring='neg_mean_squared_error'
        )
        
        # Convert to positive MSE
        train_scores = -train_scores
        val_scores = -val_scores
        
        train_mean = np.mean(train_scores, axis=1)
        train_std = np.std(train_scores, axis=1)
        val_mean = np.mean(val_scores, axis=1)
        val_std = np.std(val_scores, axis=1)
        
        axes[idx].plot(train_sizes, train_mean, 'o-', color='blue', label='Training score')
        axes[idx].fill_between(train_sizes, train_mean - train_std, train_mean + train_std, 
                              alpha=0.1, color='blue')
        
        axes[idx].plot(train_sizes, val_mean, 'o-', color='red', label='Validation score')
        axes[idx].fill_between(train_sizes, val_mean - val_std, val_mean + val_std, 
                              alpha=0.1, color='red')
        
        axes[idx].set_xlabel('Training Set Size')
        axes[idx].set_ylabel('Mean Squared Error')
        axes[idx].set_title(f'Learning Curve: {name}', fontweight='bold')
        axes[idx].legend()
        axes[idx].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

plot_learning_curves(models, X, y)

# Model comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Cross-validation scores comparison
model_names = list(results.keys())
cv_means = [results[name]['mean_cv_score'] for name in model_names]
cv_stds = [results[name]['std_cv_score'] for name in model_names]

bars = ax1.bar(model_names, cv_means, yerr=cv_stds, capsize=5, alpha=0.7, 
              color=['blue', 'green', 'red', 'orange'], edgecolor='black')
ax1.set_ylabel('Mean Squared Error')
ax1.set_title('Cross-Validation Performance Comparison', fontweight='bold')
ax1.tick_params(axis='x', rotation=45)
ax1.grid(True, alpha=0.3)

# Add value labels on bars
for bar, mean, std in zip(bars, cv_means, cv_stds):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height + std + 5,
             f'{mean:.0f}±{std:.0f}', ha='center', va='bottom', fontweight='bold')

# Box plots of CV scores
cv_scores_list = [results[name]['cv_scores'] for name in model_names]
bp = ax2.boxplot(cv_scores_list, labels=model_names, patch_artist=True)

colors = ['lightblue', 'lightgreen', 'lightcoral', 'lightsalmon']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)

ax2.set_ylabel('Mean Squared Error')
ax2.set_title('Distribution of CV Scores', fontweight='bold')
ax2.tick_params(axis='x', rotation=45)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print results
print("Model Evaluation Results:")
print("=" * 50)
for name, result in results.items():
    print(f"{name:20}: MSE = {result['mean_cv_score']:.1f} ± {result['std_cv_score']:.1f}")

# Statistical significance test
best_model = min(results.keys(), key=lambda x: results[x]['mean_cv_score'])
print(f"\nBest model: {best_model}")

# Paired t-test between best model and others
print("\nStatistical significance tests (paired t-test):")
best_scores = results[best_model]['cv_scores']
for name, result in results.items():
    if name != best_model:
        t_stat, p_value = stats.ttest_rel(best_scores, result['cv_scores'])
        significance = "***" if p_value < 0.001 else "**" if p_value < 0.01 else "*" if p_value < 0.05 else ""
        print(f"{best_model} vs {name:15}: t={t_stat:6.2f}, p={p_value:.4f} {significance}")

## 5. Bias-Variance Tradeoff

Understanding the fundamental tradeoff in machine learning model complexity.

In [None]:
# Bias-variance decomposition simulation
def bias_variance_decomposition(n_datasets=100, n_test_points=50, noise_level=0.3):
    """Simulate bias-variance decomposition for polynomial models"""
    
    # True function
    def true_function(x):
        return 1.5 * x + 0.5 * x**2 - 0.1 * x**3
    
    # Test points
    x_test = np.linspace(-2, 2, n_test_points)
    y_true = true_function(x_test)
    
    # Different polynomial degrees
    degrees = [1, 3, 5, 10]
    results = {}
    
    for degree in degrees:
        predictions = []
        
        for _ in range(n_datasets):
            # Generate training data
            x_train = np.random.uniform(-2, 2, 50)
            y_train = true_function(x_train) + np.random.normal(0, noise_level, 50)
            
            # Fit polynomial model
            coeffs = np.polyfit(x_train, y_train, degree)
            y_pred = np.polyval(coeffs, x_test)
            predictions.append(y_pred)
        
        predictions = np.array(predictions)
        
        # Calculate bias, variance, and noise
        mean_prediction = np.mean(predictions, axis=0)
        bias_squared = (mean_prediction - y_true) ** 2
        variance = np.var(predictions, axis=0)
        noise = noise_level ** 2
        
        total_error = bias_squared + variance + noise
        
        results[degree] = {
            'predictions': predictions,
            'mean_prediction': mean_prediction,
            'bias_squared': np.mean(bias_squared),
            'variance': np.mean(variance),
            'noise': noise,
            'total_error': np.mean(total_error)
        }
    
    return x_test, y_true, results

# Run bias-variance analysis
x_test, y_true, bv_results = bias_variance_decomposition()

# Visualize bias-variance tradeoff
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.flatten()

colors = ['blue', 'green', 'red', 'purple']

for idx, (degree, color) in enumerate(zip([1, 3, 5, 10], colors)):
    result = bv_results[degree]
    
    # Plot some individual predictions
    for i in range(min(20, len(result['predictions']))):
        axes[idx].plot(x_test, result['predictions'][i], color=color, alpha=0.1)
    
    # Plot mean prediction and true function
    axes[idx].plot(x_test, result['mean_prediction'], color=color, linewidth=3, 
                  label=f'Mean prediction (degree {degree})')
    axes[idx].plot(x_test, y_true, 'k--', linewidth=2, label='True function')
    
    axes[idx].set_title(f'Degree {degree}: Bias²={result["bias_squared"]:.3f}, '
                       f'Var={result["variance"]:.3f}', fontweight='bold')
    axes[idx].set_xlabel('x')
    axes[idx].set_ylabel('y')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.suptitle('Bias-Variance Tradeoff: Individual Predictions and Averages', 
             fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

# Summary plot of bias-variance tradeoff
degrees = list(bv_results.keys())
bias_squared = [bv_results[d]['bias_squared'] for d in degrees]
variance = [bv_results[d]['variance'] for d in degrees]
noise = [bv_results[d]['noise'] for d in degrees]
total_error = [bv_results[d]['total_error'] for d in degrees]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Bias-variance components
ax1.plot(degrees, bias_squared, 'o-', color='red', linewidth=2, markersize=8, label='Bias²')
ax1.plot(degrees, variance, 's-', color='blue', linewidth=2, markersize=8, label='Variance')
ax1.plot(degrees, noise, '^-', color='green', linewidth=2, markersize=8, label='Noise')
ax1.plot(degrees, total_error, 'd-', color='black', linewidth=3, markersize=8, label='Total Error')

ax1.set_xlabel('Polynomial Degree (Model Complexity)')
ax1.set_ylabel('Error')
ax1.set_title('Bias-Variance Decomposition', fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_yscale('log')

# Stacked bar chart
width = 0.6
ax2.bar(degrees, bias_squared, width, label='Bias²', color='red', alpha=0.7)
ax2.bar(degrees, variance, width, bottom=bias_squared, label='Variance', color='blue', alpha=0.7)
ax2.bar(degrees, noise, width, bottom=np.array(bias_squared) + np.array(variance), 
       label='Noise', color='green', alpha=0.7)

ax2.set_xlabel('Polynomial Degree')
ax2.set_ylabel('Error Components')
ax2.set_title('Error Decomposition (Stacked)', fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print numerical results
print("Bias-Variance Decomposition Results:")
print("=" * 60)
print(f"{'Degree':>8} {'Bias²':>10} {'Variance':>10} {'Noise':>8} {'Total':>10}")
print("-" * 60)
for degree in degrees:
    result = bv_results[degree]
    print(f"{degree:>8} {result['bias_squared']:>10.4f} {result['variance']:>10.4f} "
          f"{result['noise']:>8.4f} {result['total_error']:>10.4f}")

optimal_degree = min(degrees, key=lambda d: bv_results[d]['total_error'])
print(f"\nOptimal polynomial degree: {optimal_degree}")
print(f"Minimum total error: {bv_results[optimal_degree]['total_error']:.4f}")

## 6. A/B Testing and Experimental Design

Statistical methods for comparing treatments and making decisions.

In [None]:
# A/B Testing simulation
def ab_test_simulation(control_rate=0.10, treatment_rate=0.12, n_control=1000, n_treatment=1000):
    """Simulate A/B test with conversion rates"""
    
    # Generate data
    control_conversions = np.random.binomial(n_control, control_rate)
    treatment_conversions = np.random.binomial(n_treatment, treatment_rate)
    
    # Calculate rates
    control_conv_rate = control_conversions / n_control
    treatment_conv_rate = treatment_conversions / n_treatment
    
    # Statistical test (two-proportion z-test)
    pooled_rate = (control_conversions + treatment_conversions) / (n_control + n_treatment)
    pooled_se = np.sqrt(pooled_rate * (1 - pooled_rate) * (1/n_control + 1/n_treatment))
    
    z_stat = (treatment_conv_rate - control_conv_rate) / pooled_se
    p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
    
    # Confidence interval for difference
    diff = treatment_conv_rate - control_conv_rate
    se_diff = np.sqrt(control_conv_rate * (1 - control_conv_rate) / n_control +
                     treatment_conv_rate * (1 - treatment_conv_rate) / n_treatment)
    ci_lower = diff - 1.96 * se_diff
    ci_upper = diff + 1.96 * se_diff
    
    return {
        'control_conversions': control_conversions,
        'treatment_conversions': treatment_conversions,
        'control_rate': control_conv_rate,
        'treatment_rate': treatment_conv_rate,
        'difference': diff,
        'z_stat': z_stat,
        'p_value': p_value,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'significant': p_value < 0.05
    }

# Power analysis for A/B testing
def ab_test_power_analysis(control_rate=0.10, effect_sizes=None, alpha=0.05, power=0.8):
    """Calculate required sample sizes for different effect sizes"""
    if effect_sizes is None:
        effect_sizes = np.array([0.001, 0.005, 0.01, 0.02, 0.03, 0.05])
    
    sample_sizes = []
    
    for effect in effect_sizes:
        treatment_rate = control_rate + effect
        
        # Calculate required sample size using formula
        p_avg = (control_rate + treatment_rate) / 2
        z_alpha = stats.norm.ppf(1 - alpha/2)
        z_beta = stats.norm.ppf(power)
        
        n = (z_alpha * np.sqrt(2 * p_avg * (1 - p_avg)) + 
             z_beta * np.sqrt(control_rate * (1 - control_rate) + treatment_rate * (1 - treatment_rate)))**2 / effect**2
        
        sample_sizes.append(int(np.ceil(n)))
    
    return effect_sizes, sample_sizes

# Run A/B test examples
test_scenarios = [
    (0.10, 0.12, "Significant effect"),
    (0.10, 0.105, "Small effect"),
    (0.10, 0.10, "No effect")
]

fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Run multiple A/B tests
for idx, (control_rate, treatment_rate, description) in enumerate(test_scenarios[:3]):
    if idx < 3:
        result = ab_test_simulation(control_rate, treatment_rate)
        
        ax = axes[idx//2, idx%2]
        
        # Bar plot of conversion rates
        rates = [result['control_rate'], result['treatment_rate']]
        labels = ['Control', 'Treatment']
        colors = ['lightblue', 'lightgreen' if result['significant'] else 'lightcoral']
        
        bars = ax.bar(labels, rates, color=colors, edgecolor='black', alpha=0.7)
        
        # Add error bars (confidence intervals)
        for i, (rate, n) in enumerate([(result['control_rate'], 1000), (result['treatment_rate'], 1000)]):
            se = np.sqrt(rate * (1 - rate) / n)
            ci = 1.96 * se
            ax.errorbar(i, rate, yerr=ci, fmt='none', color='black', capsize=5)
        
        # Add value labels
        for bar, rate in zip(bars, rates):
            height = bar.get_height()
            ax.text(bar.get_x() + bar.get_width()/2., height + 0.002,
                   f'{rate:.3f}', ha='center', va='bottom', fontweight='bold')
        
        ax.set_ylabel('Conversion Rate')
        ax.set_title(f'{description}\np-value: {result["p_value"]:.4f}, '
                    f'Difference: {result["difference"]:.3f}', fontweight='bold')
        ax.grid(True, alpha=0.3)
        
        print(f"{description}:")
        print(f"  Control rate: {result['control_rate']:.4f}")
        print(f"  Treatment rate: {result['treatment_rate']:.4f}")
        print(f"  Difference: {result['difference']:.4f} [{result['ci_lower']:.4f}, {result['ci_upper']:.4f}]")
        print(f"  Z-statistic: {result['z_stat']:.3f}")
        print(f"  P-value: {result['p_value']:.4f}")
        print(f"  Significant: {result['significant']}")
        print()

# Power analysis plot
ax = axes[1, 1]
effect_sizes, sample_sizes = ab_test_power_analysis()

ax.plot(effect_sizes * 100, sample_sizes, 'bo-', linewidth=2, markersize=8)
ax.set_xlabel('Effect Size (percentage points)')
ax.set_ylabel('Required Sample Size per Group')
ax.set_title('Sample Size Requirements for A/B Tests\n(80% Power, 5% Significance)', fontweight='bold')
ax.grid(True, alpha=0.3)
ax.set_yscale('log')

# Add annotations
for effect, n in zip(effect_sizes, sample_sizes):
    ax.annotate(f'{n:,}', (effect*100, n), textcoords="offset points", 
               xytext=(0,10), ha='center', fontsize=10)

plt.tight_layout()
plt.show()

# Sequential testing simulation
def sequential_ab_test(control_rate=0.10, treatment_rate=0.12, max_n=2000, alpha=0.05):
    """Simulate sequential A/B testing"""
    n_points = []
    p_values = []
    
    control_successes = 0
    treatment_successes = 0
    
    for n in range(50, max_n, 50):  # Check every 50 samples
        # Add new samples
        new_control = np.random.binomial(50, control_rate)
        new_treatment = np.random.binomial(50, treatment_rate)
        
        control_successes += new_control
        treatment_successes += new_treatment
        
        # Calculate current rates and test
        control_conv_rate = control_successes / n
        treatment_conv_rate = treatment_successes / n
        
        if n > 100:  # Minimum sample size
            pooled_rate = (control_successes + treatment_successes) / (2 * n)
            pooled_se = np.sqrt(pooled_rate * (1 - pooled_rate) * (2/n))
            
            if pooled_se > 0:
                z_stat = (treatment_conv_rate - control_conv_rate) / pooled_se
                p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
            else:
                p_value = 1.0
            
            n_points.append(n)
            p_values.append(p_value)
    
    return n_points, p_values

# Run sequential test
n_seq, p_seq = sequential_ab_test()

fig, ax = plt.subplots(1, 1, figsize=(12, 6))
ax.plot(n_seq, p_seq, 'b-', linewidth=2, label='P-value over time')
ax.axhline(y=0.05, color='red', linestyle='--', linewidth=2, label='Significance threshold (α=0.05)')
ax.fill_between(n_seq, 0, 0.05, alpha=0.2, color='red', label='Rejection region')

ax.set_xlabel('Sample Size per Group')
ax.set_ylabel('P-value')
ax.set_title('Sequential A/B Testing: P-value Evolution', fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_ylim(0, 1)

plt.tight_layout()
plt.show()

# Find when test becomes significant
significant_at = None
for n, p in zip(n_seq, p_seq):
    if p < 0.05:
        significant_at = n
        break

if significant_at:
    print(f"Test became significant at n = {significant_at} per group")
else:
    print("Test did not reach significance within the sample size limit")

print(f"Final p-value: {p_seq[-1]:.4f}")

## Key Takeaways

1. **Descriptive Statistics**: Foundation for understanding data distributions
2. **Hypothesis Testing**: Statistical framework for making decisions under uncertainty
3. **Confidence Intervals**: Quantify uncertainty in parameter estimates
4. **Cross-Validation**: Essential for honest model evaluation
5. **Bias-Variance Tradeoff**: Fundamental principle governing model complexity
6. **A/B Testing**: Rigorous experimental design for comparing treatments
7. **Statistical Power**: Important for planning experiments and interpreting results

## Applications in Machine Learning

- **Model Selection**: Use statistical tests to compare model performance
- **Feature Selection**: Statistical significance of features
- **Hyperparameter Tuning**: Cross-validation for parameter optimization
- **Confidence Estimation**: Bootstrap methods for uncertainty quantification
- **Experimental Design**: A/B testing for ML system improvements
- **Outlier Detection**: Statistical methods for identifying anomalies

## Common Pitfalls

- **Multiple Comparisons**: Adjust p-values when testing many hypotheses
- **Data Snooping**: Avoid overfitting to validation data
- **Cherry Picking**: Report all analyses, not just significant results
- **Correlation vs Causation**: Statistical association ≠ causal relationship
- **Sample Size**: Ensure adequate power for meaningful conclusions

## Next Steps

- Study non-parametric statistical methods
- Learn about multiple testing correction
- Explore Bayesian statistics and MCMC
- Practice with real-world datasets
- Understand causal inference methods

---

**Statistical Concepts Covered:**
- Descriptive statistics and data exploration
- Hypothesis testing and p-values
- Confidence intervals and bootstrap methods
- Cross-validation and model evaluation
- Bias-variance decomposition
- Experimental design and A/B testing
- Statistical power and sample size calculation