# üìä Statistics for Machine Learning

Welcome to statistical foundations of ML! Statistics helps us understand data, make inferences, and validate models.

**Learning Goals:**
- Master descriptive statistics
- Understand hypothesis testing
- Learn statistical inference
- Apply confidence intervals
- Validate ML models statistically

**Sources:**
- "An Introduction to Statistical Learning" - James et al. (2021)
- "Statistics for Machine Learning" - Lantz (2019)
- "The Elements of Statistical Learning" - Hastie, Tibshirani, Friedman

In [None]:
# Import essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.datasets import load_boston, make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# Set up plotting
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (12, 8)
sns.set_palette('husl')
np.random.seed(42)

print("‚úÖ Libraries loaded successfully!")
print(f"NumPy: {np.__version__}, Pandas: {pd.__version__}")

## üìà Part 1: Descriptive Statistics - Understanding Your Data

Before building models, you MUST understand your data!

**Key Concepts:**
- Central tendency (mean, median, mode)
- Spread (variance, standard deviation, IQR)
- Distribution shape (skewness, kurtosis)
- Relationships (correlation, covariance)

**Source:** "An Introduction to Statistical Learning" Chapter 2

In [None]:
# Generate sample data: Student exam scores
np.random.seed(42)
scores = np.concatenate([
    np.random.normal(75, 10, 50),   # Class A: mean=75, std=10
    np.random.normal(82, 8, 50),    # Class B: mean=82, std=8
    np.random.exponential(15, 20) + 60  # Some outliers
])
scores = np.clip(scores, 0, 100)  # Ensure scores are in [0, 100]

print("üìä Descriptive Statistics: Exam Scores")
print("="*60)

# Central Tendency
mean_score = np.mean(scores)
median_score = np.median(scores)
mode_result = stats.mode(scores.round(), keepdims=True)
mode_score = mode_result.mode[0]

print("\nüìç Central Tendency (typical values):")
print(f"  Mean (average): {mean_score:.2f}")
print(f"  Median (middle value): {median_score:.2f}")
print(f"  Mode (most frequent): {mode_score:.2f}")

# Spread/Dispersion
variance = np.var(scores)
std_dev = np.std(scores)
q1, q3 = np.percentile(scores, [25, 75])
iqr = q3 - q1

print("\nüìè Spread (variability):")
print(f"  Variance: {variance:.2f}")
print(f"  Standard Deviation: {std_dev:.2f}")
print(f"  Range: [{scores.min():.2f}, {scores.max():.2f}]")
print(f"  IQR (Q3 - Q1): {iqr:.2f}")

# Shape
skewness = stats.skew(scores)
kurtosis = stats.kurtosis(scores)

print("\nüìê Shape:")
print(f"  Skewness: {skewness:.2f}", "(negative = left skew, positive = right skew)")
print(f"  Kurtosis: {kurtosis:.2f}", "(high = heavy tails, low = light tails)")

In [None]:
# Visualize descriptive statistics
fig = plt.figure(figsize=(18, 10))

# 1. Histogram with statistics
ax1 = plt.subplot(2, 3, 1)
ax1.hist(scores, bins=30, alpha=0.7, color='skyblue', edgecolor='black')
ax1.axvline(mean_score, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_score:.1f}')
ax1.axvline(median_score, color='green', linestyle='--', linewidth=2, label=f'Median: {median_score:.1f}')
ax1.set_xlabel('Score')
ax1.set_ylabel('Frequency')
ax1.set_title('Distribution with Central Tendency')
ax1.legend()
ax1.grid(True, alpha=0.3)

# 2. Box plot
ax2 = plt.subplot(2, 3, 2)
box_parts = ax2.boxplot(scores, vert=True, patch_artist=True)
box_parts['boxes'][0].set_facecolor('lightblue')
ax2.set_ylabel('Score')
ax2.set_title('Box Plot\n(Shows median, quartiles, outliers)')
ax2.grid(True, alpha=0.3, axis='y')

# Add annotations
ax2.text(1.2, median_score, f'Median: {median_score:.1f}', fontsize=10)
ax2.text(1.2, q1, f'Q1: {q1:.1f}', fontsize=10)
ax2.text(1.2, q3, f'Q3: {q3:.1f}', fontsize=10)

# 3. Violin plot
ax3 = plt.subplot(2, 3, 3)
parts = ax3.violinplot([scores], positions=[1], showmeans=True, showmedians=True)
ax3.set_ylabel('Score')
ax3.set_title('Violin Plot\n(Combines box plot + density)')
ax3.set_xticks([1])
ax3.set_xticklabels(['Scores'])
ax3.grid(True, alpha=0.3, axis='y')

# 4. Cumulative distribution
ax4 = plt.subplot(2, 3, 4)
sorted_scores = np.sort(scores)
cumulative = np.arange(1, len(sorted_scores) + 1) / len(sorted_scores)
ax4.plot(sorted_scores, cumulative, linewidth=2, color='purple')
ax4.axhline(0.5, color='red', linestyle='--', alpha=0.5, label='Median')
ax4.axvline(median_score, color='red', linestyle='--', alpha=0.5)
ax4.set_xlabel('Score')
ax4.set_ylabel('Cumulative Probability')
ax4.set_title('Cumulative Distribution Function (CDF)')
ax4.legend()
ax4.grid(True, alpha=0.3)

# 5. Q-Q plot (check normality)
ax5 = plt.subplot(2, 3, 5)
stats.probplot(scores, dist="norm", plot=ax5)
ax5.set_title('Q-Q Plot\n(Check if data is normally distributed)')
ax5.grid(True, alpha=0.3)

# 6. Summary statistics table
ax6 = plt.subplot(2, 3, 6)
ax6.axis('off')

summary_data = [
    ['Statistic', 'Value'],
    ['Count', f'{len(scores):.0f}'],
    ['Mean', f'{mean_score:.2f}'],
    ['Median', f'{median_score:.2f}'],
    ['Std Dev', f'{std_dev:.2f}'],
    ['Min', f'{scores.min():.2f}'],
    ['25%', f'{q1:.2f}'],
    ['75%', f'{q3:.2f}'],
    ['Max', f'{scores.max():.2f}'],
    ['Skewness', f'{skewness:.2f}'],
    ['Kurtosis', f'{kurtosis:.2f}']
]

table = ax6.table(cellText=summary_data, cellLoc='left', loc='center',
                  colWidths=[0.5, 0.3])
table.auto_set_font_size(False)
table.set_fontsize(11)
table.scale(1, 2)

# Style header row
for i in range(2):
    table[(0, i)].set_facecolor('lightblue')
    table[(0, i)].set_text_props(weight='bold')

ax6.set_title('Summary Statistics', fontsize=14, fontweight='bold', pad=20)

plt.tight_layout()
plt.show()

print("\nüí° Key Insights:")
print("  ‚Ä¢ Box plots show outliers clearly")
print("  ‚Ä¢ Q-Q plot: points on line = normally distributed")
print("  ‚Ä¢ CDF shows cumulative probabilities")
print("  ‚Ä¢ Violin plot combines density + quartiles")

### 1.1 Correlation & Covariance: Relationships Between Variables

In [None]:
# Generate correlated data
np.random.seed(42)
n_samples = 100

# Create different types of relationships
x = np.random.randn(n_samples)
y_strong_pos = 2*x + np.random.randn(n_samples)*0.5  # Strong positive
y_weak_pos = 0.5*x + np.random.randn(n_samples)*2    # Weak positive
y_strong_neg = -2*x + np.random.randn(n_samples)*0.5 # Strong negative
y_no_corr = np.random.randn(n_samples)                # No correlation
y_nonlinear = x**2 + np.random.randn(n_samples)*0.5  # Nonlinear

# Calculate correlations
corr_strong_pos = np.corrcoef(x, y_strong_pos)[0, 1]
corr_weak_pos = np.corrcoef(x, y_weak_pos)[0, 1]
corr_strong_neg = np.corrcoef(x, y_strong_neg)[0, 1]
corr_no = np.corrcoef(x, y_no_corr)[0, 1]
corr_nonlinear = np.corrcoef(x, y_nonlinear)[0, 1]

print("üîó Correlation Analysis")
print("="*60)
print(f"\nStrong Positive: r = {corr_strong_pos:.3f}")
print(f"Weak Positive: r = {corr_weak_pos:.3f}")
print(f"Strong Negative: r = {corr_strong_neg:.3f}")
print(f"No Correlation: r = {corr_no:.3f}")
print(f"Nonlinear: r = {corr_nonlinear:.3f}")
print("\nüí° Note: Correlation only measures LINEAR relationships!")

# Visualize different correlations
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

datasets = [
    (x, y_strong_pos, f'Strong Positive\nr = {corr_strong_pos:.3f}'),
    (x, y_weak_pos, f'Weak Positive\nr = {corr_weak_pos:.3f}'),
    (x, y_strong_neg, f'Strong Negative\nr = {corr_strong_neg:.3f}'),
    (x, y_no_corr, f'No Correlation\nr = {corr_no:.3f}'),
    (x, y_nonlinear, f'Nonlinear\nr = {corr_nonlinear:.3f}\n(Correlation misleading!)'),
]

for idx, (x_data, y_data, title) in enumerate(datasets):
    ax = axes[idx // 3, idx % 3]
    ax.scatter(x_data, y_data, alpha=0.6, s=30)
    
    # Fit line
    z = np.polyfit(x_data, y_data, 1)
    p = np.poly1d(z)
    ax.plot(x_data, p(x_data), "r--", linewidth=2, alpha=0.8)
    
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.set_title(title, fontsize=12, fontweight='bold')
    ax.grid(True, alpha=0.3)

# Last subplot: correlation interpretation guide
ax = axes[1, 2]
ax.axis('off')

guide_text = """
üìä Correlation Coefficient (r)

Range: -1 to +1

+1.0 = Perfect positive
+0.7 to +1.0 = Strong positive
+0.3 to +0.7 = Moderate positive
-0.3 to +0.3 = Weak/None
-0.7 to -0.3 = Moderate negative
-1.0 to -0.7 = Strong negative
-1.0 = Perfect negative

‚ö†Ô∏è Warnings:
‚Ä¢ Correlation ‚â† Causation!
‚Ä¢ Only measures linear relationships
‚Ä¢ Sensitive to outliers
"""

ax.text(0.1, 0.5, guide_text, fontsize=11, family='monospace',
        verticalalignment='center')

plt.tight_layout()
plt.show()

In [None]:
# Real-world example: Correlation matrix with heatmap
# Create synthetic housing data
np.random.seed(42)
n = 200

# Generate correlated features
size = np.random.uniform(1000, 3000, n)
bedrooms = np.round(size/500 + np.random.randn(n)*0.5).astype(int)
age = np.random.uniform(0, 50, n)
distance_to_city = np.random.uniform(1, 30, n)
price = (size * 200 + bedrooms * 20000 - age * 1000 - 
         distance_to_city * 3000 + np.random.randn(n) * 50000)

# Create DataFrame
housing_data = pd.DataFrame({
    'Price ($1000s)': price / 1000,
    'Size (sqft)': size,
    'Bedrooms': bedrooms,
    'Age (years)': age,
    'Distance to City (miles)': distance_to_city
})

# Compute correlation matrix
corr_matrix = housing_data.corr()

print("üè† Housing Data Correlation Analysis")
print("="*60)
print("\nCorrelation Matrix:")
print(corr_matrix.round(3))

# Visualize correlation matrix
fig, axes = plt.subplots(1, 2, figsize=(18, 7))

# Heatmap
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8},
            fmt='.2f', ax=axes[0])
axes[0].set_title('Correlation Heatmap\n(Red = positive, Blue = negative)', 
                  fontsize=14, fontweight='bold')

# Pairplot for key relationships
axes[1].axis('off')

# Create mini scatter plots
fig2, mini_axes = plt.subplots(2, 2, figsize=(12, 10))

# Price vs Size
mini_axes[0, 0].scatter(housing_data['Size (sqft)'], housing_data['Price ($1000s)'], alpha=0.5)
mini_axes[0, 0].set_xlabel('Size (sqft)')
mini_axes[0, 0].set_ylabel('Price ($1000s)')
mini_axes[0, 0].set_title(f'Price vs Size (r={corr_matrix.loc["Price ($1000s)", "Size (sqft)"]:.3f})')
mini_axes[0, 0].grid(True, alpha=0.3)

# Price vs Age
mini_axes[0, 1].scatter(housing_data['Age (years)'], housing_data['Price ($1000s)'], alpha=0.5, color='orange')
mini_axes[0, 1].set_xlabel('Age (years)')
mini_axes[0, 1].set_ylabel('Price ($1000s)')
mini_axes[0, 1].set_title(f'Price vs Age (r={corr_matrix.loc["Price ($1000s)", "Age (years)"]:.3f})')
mini_axes[0, 1].grid(True, alpha=0.3)

# Price vs Distance
mini_axes[1, 0].scatter(housing_data['Distance to City (miles)'], housing_data['Price ($1000s)'], alpha=0.5, color='green')
mini_axes[1, 0].set_xlabel('Distance to City (miles)')
mini_axes[1, 0].set_ylabel('Price ($1000s)')
mini_axes[1, 0].set_title(f'Price vs Distance (r={corr_matrix.loc["Price ($1000s)", "Distance to City (miles)"]:.3f})')
mini_axes[1, 0].grid(True, alpha=0.3)

# Size vs Bedrooms
mini_axes[1, 1].scatter(housing_data['Size (sqft)'], housing_data['Bedrooms'], alpha=0.5, color='red')
mini_axes[1, 1].set_xlabel('Size (sqft)')
mini_axes[1, 1].set_ylabel('Bedrooms')
mini_axes[1, 1].set_title(f'Size vs Bedrooms (r={corr_matrix.loc["Size (sqft)", "Bedrooms"]:.3f})')
mini_axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° ML Insights:")
print("  ‚Ä¢ Strong correlations suggest predictive features")
print("  ‚Ä¢ Multicollinearity: highly correlated features can cause issues")
print("  ‚Ä¢ Use correlation for feature selection")

## üß™ Part 2: Hypothesis Testing - Making Decisions from Data

**The Scientific Method for ML:**
1. State a hypothesis (claim about data)
2. Collect data and calculate test statistic
3. Compute p-value
4. Make decision (reject or fail to reject hypothesis)

**Common Tests:**
- t-test (compare means)
- Chi-square (categorical data)
- ANOVA (compare multiple groups)
- A/B testing

**Source:** "Statistics for Machine Learning" - Lantz, Chapter 3

In [None]:
# Example: A/B Testing for ML Models
print("üî¨ Hypothesis Testing: A/B Test for Model Performance")
print("="*60)

# Scenario: Two ML models, which is better?
np.random.seed(42)

# Model A accuracies over 30 runs
model_a_scores = np.random.normal(0.85, 0.03, 30)

# Model B accuracies over 30 runs (slightly better)
model_b_scores = np.random.normal(0.87, 0.03, 30)

print("\nüìä Data:")
print(f"  Model A: mean = {model_a_scores.mean():.4f}, std = {model_a_scores.std():.4f}")
print(f"  Model B: mean = {model_b_scores.mean():.4f}, std = {model_b_scores.std():.4f}")
print(f"  Difference: {model_b_scores.mean() - model_a_scores.mean():.4f}")

# Hypothesis Testing
print("\nüß™ Hypothesis Test:")
print("  H‚ÇÄ (Null): Model A and B have same performance")
print("  H‚ÇÅ (Alternative): Model B is better than Model A")
print("  Significance level (Œ±): 0.05")

# Perform independent t-test
t_statistic, p_value = stats.ttest_ind(model_b_scores, model_a_scores)

print(f"\nüìà Test Results:")
print(f"  t-statistic: {t_statistic:.4f}")
print(f"  p-value: {p_value:.4f}")

if p_value < 0.05:
    print(f"\n‚úÖ REJECT null hypothesis (p < 0.05)")
    print(f"  ‚Üí Model B is SIGNIFICANTLY better than Model A")
else:
    print(f"\n‚ùå FAIL TO REJECT null hypothesis (p ‚â• 0.05)")
    print(f"  ‚Üí No significant difference between models")

# Effect size (Cohen's d)
pooled_std = np.sqrt((model_a_scores.std()**2 + model_b_scores.std()**2) / 2)
cohens_d = (model_b_scores.mean() - model_a_scores.mean()) / pooled_std

print(f"\nüìè Effect Size (Cohen's d): {cohens_d:.4f}")
if abs(cohens_d) < 0.2:
    effect = "small"
elif abs(cohens_d) < 0.5:
    effect = "medium"
else:
    effect = "large"
print(f"  ‚Üí {effect.capitalize()} practical difference")

In [None]:
# Visualize hypothesis test
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot 1: Distributions
axes[0].hist(model_a_scores, bins=15, alpha=0.7, label='Model A', color='blue', edgecolor='black')
axes[0].hist(model_b_scores, bins=15, alpha=0.7, label='Model B', color='red', edgecolor='black')
axes[0].axvline(model_a_scores.mean(), color='blue', linestyle='--', linewidth=2, label=f'Mean A: {model_a_scores.mean():.3f}')
axes[0].axvline(model_b_scores.mean(), color='red', linestyle='--', linewidth=2, label=f'Mean B: {model_b_scores.mean():.3f}')
axes[0].set_xlabel('Accuracy')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Model Performance Distributions')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: Box plots for comparison
axes[1].boxplot([model_a_scores, model_b_scores], labels=['Model A', 'Model B'],
                patch_artist=True,
                boxprops=dict(facecolor='lightblue'),
                medianprops=dict(color='red', linewidth=2))
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Box Plot Comparison')
axes[1].grid(True, alpha=0.3, axis='y')

# Plot 3: t-distribution with test statistic
df = len(model_a_scores) + len(model_b_scores) - 2  # degrees of freedom
x = np.linspace(-4, 4, 100)
y = stats.t.pdf(x, df)

axes[2].plot(x, y, 'b-', linewidth=2, label='t-distribution')
axes[2].fill_between(x[x >= stats.t.ppf(0.95, df)], 0, 
                      stats.t.pdf(x[x >= stats.t.ppf(0.95, df)], df),
                      alpha=0.3, color='red', label='Rejection region (Œ±=0.05)')
axes[2].axvline(t_statistic, color='green', linestyle='--', linewidth=2, 
                label=f't-stat: {t_statistic:.2f}')
axes[2].set_xlabel('t-value')
axes[2].set_ylabel('Probability Density')
axes[2].set_title(f't-Test Visualization\np-value = {p_value:.4f}')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Key Concepts:")
print("  ‚Ä¢ p-value: Probability of observing this difference by chance")
print("  ‚Ä¢ Œ± = 0.05: We accept 5% chance of false positive (Type I error)")
print("  ‚Ä¢ Statistical significance ‚â† practical significance")
print("  ‚Ä¢ Always report effect size along with p-value!")

## üìä Part 3: Confidence Intervals - Quantifying Uncertainty

**Intuition:** A confidence interval gives a range of plausible values for a parameter.

**95% CI Interpretation:**
If we repeated this experiment 100 times, about 95 of the intervals would contain the true parameter.

**ML Applications:**
- Model performance estimates
- Prediction intervals
- Feature importance ranges

**Source:** "An Introduction to Statistical Learning" Chapter 3

In [None]:
# Example: Confidence interval for model accuracy
print("üìä Confidence Intervals for ML Model Performance")
print("="*60)

# Simulate model evaluation
np.random.seed(42)
n_trials = 50
model_accuracies = np.random.normal(0.88, 0.04, n_trials)

# Calculate statistics
mean_acc = model_accuracies.mean()
std_acc = model_accuracies.std(ddof=1)  # sample std
se_acc = std_acc / np.sqrt(n_trials)     # standard error

# 95% confidence interval (t-distribution)
confidence_level = 0.95
alpha = 1 - confidence_level
t_critical = stats.t.ppf(1 - alpha/2, n_trials - 1)

margin_of_error = t_critical * se_acc
ci_lower = mean_acc - margin_of_error
ci_upper = mean_acc + margin_of_error

print(f"\nüìà Model Performance Statistics:")
print(f"  Number of trials: {n_trials}")
print(f"  Mean accuracy: {mean_acc:.4f}")
print(f"  Standard deviation: {std_acc:.4f}")
print(f"  Standard error: {se_acc:.4f}")

print(f"\nüéØ 95% Confidence Interval:")
print(f"  [{ci_lower:.4f}, {ci_upper:.4f}]")
print(f"  Margin of error: ¬±{margin_of_error:.4f}")

print(f"\nüí° Interpretation:")
print(f"  We are 95% confident that the true model accuracy")
print(f"  is between {ci_lower:.2%} and {ci_upper:.2%}")

# Calculate different confidence levels
confidence_levels = [0.90, 0.95, 0.99]
intervals = []

print(f"\nüìä Different Confidence Levels:")
for conf in confidence_levels:
    t_crit = stats.t.ppf(1 - (1-conf)/2, n_trials - 1)
    margin = t_crit * se_acc
    lower = mean_acc - margin
    upper = mean_acc + margin
    intervals.append((lower, upper, margin))
    print(f"  {conf*100:.0f}% CI: [{lower:.4f}, {upper:.4f}] (¬±{margin:.4f})")

In [None]:
# Visualize confidence intervals
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Distribution with CI
axes[0].hist(model_accuracies, bins=20, alpha=0.7, color='skyblue', 
             edgecolor='black', density=True)

# Overlay normal distribution
x = np.linspace(model_accuracies.min(), model_accuracies.max(), 100)
axes[0].plot(x, stats.norm.pdf(x, mean_acc, std_acc), 'r-', 
             linewidth=2, label='Normal fit')

# Mark CI
axes[0].axvline(mean_acc, color='green', linestyle='-', linewidth=2, 
                label=f'Mean: {mean_acc:.3f}')
axes[0].axvline(ci_lower, color='orange', linestyle='--', linewidth=2, 
                label=f'95% CI: [{ci_lower:.3f}, {ci_upper:.3f}]')
axes[0].axvline(ci_upper, color='orange', linestyle='--', linewidth=2)
axes[0].fill_betweenx([0, axes[0].get_ylim()[1]], ci_lower, ci_upper, 
                       alpha=0.2, color='orange')

axes[0].set_xlabel('Accuracy')
axes[0].set_ylabel('Density')
axes[0].set_title('Model Accuracy Distribution with 95% CI')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: Multiple confidence levels
y_pos = np.arange(len(confidence_levels))
colors = ['lightgreen', 'lightblue', 'lightcoral']

for i, (conf, (lower, upper, margin)) in enumerate(zip(confidence_levels, intervals)):
    axes[1].barh(i, upper - lower, left=lower, height=0.5, 
                 color=colors[i], alpha=0.7, edgecolor='black',
                 label=f'{conf*100:.0f}% CI')
    axes[1].plot(mean_acc, i, 'ko', markersize=10)

axes[1].axvline(mean_acc, color='red', linestyle='--', linewidth=2, 
                alpha=0.7, label='Mean')
axes[1].set_yticks(y_pos)
axes[1].set_yticklabels([f'{int(c*100)}%' for c in confidence_levels])
axes[1].set_xlabel('Accuracy')
axes[1].set_ylabel('Confidence Level')
axes[1].set_title('Confidence Intervals at Different Levels\n(Higher confidence = wider interval)')
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

print("\nüí° Key Insights:")
print("  ‚Ä¢ Wider CI = more uncertainty")
print("  ‚Ä¢ Higher confidence level = wider interval")
print("  ‚Ä¢ Larger sample size = narrower interval")
print("  ‚Ä¢ CI gives range, not just point estimate")

## üéØ Part 4: Statistical Validation of ML Models

**Why Statistics Matters in ML:**
- Evaluate if improvements are real or due to chance
- Compare multiple models rigorously
- Report results with uncertainty
- Avoid overfitting

**Key Techniques:**
- Cross-validation
- Bootstrap
- Permutation tests
- Multiple testing correction

**Source:** "The Elements of Statistical Learning" Chapter 7

In [None]:
# Cross-validation with statistical analysis
print("üî¨ Statistical Validation of ML Models")
print("="*60)

# Generate synthetic classification data
X, y = make_classification(n_samples=500, n_features=20, n_informative=15,
                          n_redundant=5, random_state=42)

# Train multiple models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(kernel='rbf', random_state=42)
}

# Perform 10-fold cross-validation
cv_results = {}
n_folds = 10

print("\nüîÑ Performing 10-fold Cross-Validation...\n")

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=n_folds, scoring='accuracy')
    cv_results[name] = scores
    
    mean_score = scores.mean()
    std_score = scores.std()
    se_score = std_score / np.sqrt(n_folds)
    
    # 95% CI
    t_crit = stats.t.ppf(0.975, n_folds - 1)
    ci_lower = mean_score - t_crit * se_score
    ci_upper = mean_score + t_crit * se_score
    
    print(f"{name}:")
    print(f"  Mean: {mean_score:.4f} ¬± {std_score:.4f}")
    print(f"  95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")
    print()

# Statistical comparison between best two models
model_names = list(cv_results.keys())
best_model = max(cv_results, key=lambda k: cv_results[k].mean())
best_scores = cv_results[best_model]

print(f"\nüèÜ Best Model: {best_model}")
print(f"   Mean Accuracy: {best_scores.mean():.4f}")

# Compare best model with others
print(f"\nüî¨ Statistical Comparisons (Paired t-test):")
print("="*60)

for name, scores in cv_results.items():
    if name != best_model:
        t_stat, p_val = stats.ttest_rel(best_scores, scores)
        
        print(f"\n{best_model} vs {name}:")
        print(f"  Mean difference: {best_scores.mean() - scores.mean():.4f}")
        print(f"  t-statistic: {t_stat:.4f}")
        print(f"  p-value: {p_val:.4f}")
        
        if p_val < 0.05:
            print(f"  ‚úÖ {best_model} is SIGNIFICANTLY better")
        else:
            print(f"  ‚ùå No significant difference")

In [None]:
# Visualize cross-validation results
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Box plots
positions = np.arange(len(cv_results))
data_to_plot = [scores for scores in cv_results.values()]

bp = axes[0].boxplot(data_to_plot, labels=cv_results.keys(), 
                     patch_artist=True, showmeans=True)

# Color boxes
colors = ['lightblue', 'lightgreen', 'lightcoral', 'lightyellow']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)

axes[0].set_ylabel('Accuracy', fontsize=12)
axes[0].set_title('Cross-Validation Results\n(Box = IQR, Line = Median, Triangle = Mean)', 
                  fontsize=12, fontweight='bold')
axes[0].grid(True, alpha=0.3, axis='y')
axes[0].tick_params(axis='x', rotation=45)

# Plot 2: Means with 95% CI error bars
means = [scores.mean() for scores in cv_results.values()]
stds = [scores.std() for scores in cv_results.values()]
ses = [std / np.sqrt(n_folds) for std in stds]
t_crit = stats.t.ppf(0.975, n_folds - 1)
ci_margins = [t_crit * se for se in ses]

x_pos = np.arange(len(cv_results))
axes[1].bar(x_pos, means, yerr=ci_margins, capsize=10, alpha=0.7,
           color=colors, edgecolor='black', linewidth=1.5,
           error_kw={'linewidth': 2, 'ecolor': 'darkred'})

axes[1].set_xticks(x_pos)
axes[1].set_xticklabels(cv_results.keys(), rotation=45, ha='right')
axes[1].set_ylabel('Mean Accuracy', fontsize=12)
axes[1].set_title('Mean Accuracy with 95% Confidence Intervals\n(Error bars show uncertainty)', 
                  fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for i, (mean, margin) in enumerate(zip(means, ci_margins)):
    axes[1].text(i, mean + margin + 0.01, f'{mean:.3f}', 
                ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

print("\nüí° Best Practices:")
print("  ‚Ä¢ Always use cross-validation, never single train/test split")
print("  ‚Ä¢ Report mean ¬± std or 95% CI")
print("  ‚Ä¢ Use paired t-test for cross-validation folds (same data splits)")
print("  ‚Ä¢ Consider multiple testing correction (Bonferroni)")
print("  ‚Ä¢ Statistical significance + practical significance both matter")

## üéÆ Interactive Exercises

### Exercise 1: Descriptive Statistics

In [None]:
# TODO: Complete this exercise

# Given dataset: House prices
house_prices = np.array([250000, 320000, 180000, 450000, 290000, 
                         310000, 275000, 520000, 195000, 340000,
                         1500000])  # Note: one outlier!

# Task 1: Calculate mean, median, and standard deviation
mean_price = None  # YOUR CODE HERE
median_price = None  # YOUR CODE HERE
std_price = None  # YOUR CODE HERE

# Task 2: Identify outliers using IQR method
# Rule: Outlier if value < Q1 - 1.5*IQR or value > Q3 + 1.5*IQR
q1 = None  # YOUR CODE HERE
q3 = None  # YOUR CODE HERE
iqr = None  # YOUR CODE HERE
outliers = None  # YOUR CODE HERE (return array of outlier values)

# Task 3: Calculate z-scores and identify outliers (|z| > 3)
z_scores = None  # YOUR CODE HERE
z_outliers = None  # YOUR CODE HERE

print("‚úÖ Solutions:")
print(f"Mean: ${mean_price:,.0f}" if mean_price else "Not computed")
print(f"Median: ${median_price:,.0f}" if median_price else "Not computed")
print(f"Std Dev: ${std_price:,.0f}" if std_price else "Not computed")
print(f"IQR outliers: {outliers}")
print(f"Z-score outliers: {z_outliers}")

### Exercise 2: Hypothesis Testing

In [None]:
# TODO: Complete this exercise

# Scenario: Testing if a new feature improves model performance
baseline_scores = np.array([0.82, 0.85, 0.83, 0.86, 0.84, 0.81, 0.87, 0.83, 0.85, 0.84])
new_feature_scores = np.array([0.86, 0.88, 0.85, 0.89, 0.87, 0.86, 0.90, 0.85, 0.88, 0.87])

# Task 1: Perform paired t-test
# H0: New feature doesn't improve performance
# H1: New feature improves performance
t_statistic = None  # YOUR CODE HERE
p_value = None  # YOUR CODE HERE

# Task 2: Calculate effect size (Cohen's d for paired samples)
# d = mean(differences) / std(differences)
differences = None  # YOUR CODE HERE
cohens_d = None  # YOUR CODE HERE

# Task 3: Calculate 95% CI for the difference
mean_diff = None  # YOUR CODE HERE
se_diff = None  # YOUR CODE HERE (standard error of differences)
ci_lower = None  # YOUR CODE HERE
ci_upper = None  # YOUR CODE HERE

print("‚úÖ Solutions:")
print(f"t-statistic: {t_statistic}" if t_statistic else "Not computed")
print(f"p-value: {p_value}" if p_value else "Not computed")
print(f"Cohen's d: {cohens_d}" if cohens_d else "Not computed")
print(f"95% CI for difference: [{ci_lower}, {ci_upper}]" if ci_lower else "Not computed")

if p_value and p_value < 0.05:
    print("‚úÖ New feature significantly improves performance!")
else:
    print("‚ùå No significant improvement")

### Exercise 3: Bootstrap Confidence Interval

In [None]:
# TODO: Implement bootstrap confidence interval

def bootstrap_ci(data, statistic_func, n_bootstrap=10000, confidence=0.95):
    """
    Calculate bootstrap confidence interval
    
    Parameters:
    - data: original sample
    - statistic_func: function to compute statistic (e.g., np.mean)
    - n_bootstrap: number of bootstrap samples
    - confidence: confidence level
    
    Returns:
    - (lower_bound, upper_bound)
    """
    bootstrap_statistics = []
    
    for _ in range(n_bootstrap):
        # YOUR CODE HERE
        # 1. Resample with replacement
        # 2. Calculate statistic on resample
        # 3. Append to bootstrap_statistics
        pass
    
    # YOUR CODE HERE
    # Calculate percentiles for CI
    alpha = 1 - confidence
    lower = None  # (alpha/2) percentile
    upper = None  # (1 - alpha/2) percentile
    
    return lower, upper

# Test your implementation
sample_data = np.array([85, 90, 78, 92, 88, 76, 95, 89, 84, 91])

mean_ci = bootstrap_ci(sample_data, np.mean, n_bootstrap=10000)
median_ci = bootstrap_ci(sample_data, np.median, n_bootstrap=10000)

print("‚úÖ Bootstrap Confidence Intervals:")
print(f"Mean: {np.mean(sample_data):.2f}, 95% CI: {mean_ci}" if mean_ci[0] else "Not computed")
print(f"Median: {np.median(sample_data):.2f}, 95% CI: {median_ci}" if median_ci[0] else "Not computed")

## üéì Summary & Next Steps

### ‚úÖ What You've Learned:

**Descriptive Statistics:**
- Central tendency (mean, median, mode)
- Spread (variance, std, IQR)
- Correlation and relationships
- Visualizing distributions

**Inferential Statistics:**
- Hypothesis testing framework
- p-values and significance
- Confidence intervals
- Effect sizes

**ML Validation:**
- Cross-validation
- Model comparison
- Reporting uncertainty
- Statistical significance

### üîë Key Formulas:

1. **Standard Error**: $SE = \frac{\sigma}{\sqrt{n}}$

2. **Confidence Interval**: $\bar{x} \pm t_{\alpha/2} \cdot SE$

3. **t-statistic**: $t = \frac{\bar{x}_1 - \bar{x}_2}{SE_{\text{diff}}}$

4. **Cohen's d**: $d = \frac{\bar{x}_1 - \bar{x}_2}{\sigma_{\text{pooled}}}$

5. **Correlation**: $r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}}$

### üöÄ Next Steps:

1. **[Data Processing](04_data_processing.ipynb)** - Clean and prepare data
2. **[Classical ML](05_classical_ml.ipynb)** - Apply statistical ML algorithms
3. **[Deep Learning](06_deep_learning.ipynb)** - Neural networks and beyond

### üìñ Recommended Reading:

- "An Introduction to Statistical Learning" - James et al.
- "Statistics for Machine Learning" - Lantz
- "The Elements of Statistical Learning" - Hastie et al.
- "Statistical Rethinking" - McElreath

### üí™ Challenge Problems:

1. Implement A/B testing framework for ML models
2. Create bootstrap resampling for any statistic
3. Build statistical power calculator
4. Implement multiple testing correction (Bonferroni, FDR)

**Remember**: Statistics is essential for rigorous ML. Never report results without uncertainty measures! üìä