# 19. Machine Learning Statistik (Statistical Machine Learning)

## Tujuan Pembelajaran
- Memahami hubungan antara statistika dan machine learning
- Menerapkan konsep statistik dalam machine learning
- Menggunakan validasi statistik untuk model ML
- Memahami bias-variance tradeoff dan implikasinya
- Menerapkan cross-validation dan bootstrap untuk validasi model
- Memahami konsep overfitting dan underfitting
- Menerapkan regularisasi untuk mengatasi overfitting
- Memahami model selection dan evaluation metrics
- Menerapkan statistical learning theory dalam praktik
- Menggunakan confidence intervals dan hypothesis testing dalam ML
- Memahami resampling methods dan their applications
- Menerapkan ensemble methods dengan dasar statistik
- Memahami feature selection dan dimensionality reduction
- Menerapkan statistical validation dalam model comparison
- Menggunakan statistical inference dalam machine learning

## Materi
1. **Pengantar Machine Learning Statistik (Statistical ML Introduction)**
   - Hubungan statistika dan machine learning
   - Statistical learning theory
   - Supervised vs unsupervised learning
   - Parametric vs non-parametric methods
   - Model complexity dan generalization

2. **Bias-Variance Tradeoff**
   - Konsep bias dan variance
   - Bias-variance decomposition
   - Overfitting dan underfitting
   - Model complexity vs performance
   - Practical implications

3. **Cross-Validation dan Bootstrap**
   - K-fold cross-validation
   - Leave-one-out cross-validation
   - Bootstrap sampling
   - Validation strategies
   - Nested cross-validation

4. **Regularization (Ridge, Lasso, Elastic Net)**
   - Ridge regression (L2 regularization)
   - Lasso regression (L1 regularization)
   - Elastic Net regularization
   - Regularization parameter tuning
   - Feature selection dengan regularization

5. **Model Selection dan Evaluation**
   - Information criteria (AIC, BIC)
   - Cross-validation metrics
   - Model comparison techniques
   - Statistical significance testing
   - Multiple comparison correction

6. **Resampling Methods**
   - Bootstrap confidence intervals
   - Permutation tests
   - Jackknife estimation
   - Monte Carlo methods
   - Statistical significance testing

7. **Ensemble Methods**
   - Bagging (Bootstrap Aggregating)
   - Random Forest
   - Boosting methods
   - Stacking
   - Voting classifiers

8. **Feature Selection dan Dimensionality Reduction**
   - Filter methods
   - Wrapper methods
   - Embedded methods
   - Principal Component Analysis (PCA)
   - Feature importance

9. **Statistical Validation dalam ML**
   - Hypothesis testing untuk model comparison
   - Confidence intervals untuk predictions
   - Statistical significance dalam feature selection
   - Multiple testing correction
   - Power analysis

10. **Aplikasi dalam Analisis Data**
    - Medical diagnosis
    - Financial modeling
    - Marketing analytics
    - Scientific research
    - Business intelligence


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, KFold, LeaveOneOut, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor, AdaBoostRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_regression
from scipy import stats
from scipy.stats import bootstrap
import warnings
warnings.filterwarnings('ignore')

# Set up plotting
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12
sns.set_style("whitegrid")

print("Libraries imported successfully!")
print("NumPy version:", np.__version__)
print("Pandas version:", pd.__version__)
print("Scikit-learn version:", __import__('sklearn').__version__)


## 1. Pengantar Machine Learning Statistik (Statistical ML Introduction)

### 1.1 Hubungan Statistika dan Machine Learning

**Machine Learning Statistik** adalah bidang yang menggabungkan prinsip-prinsip statistika dengan teknik machine learning untuk membuat model prediktif yang robust dan dapat diinterpretasikan.

#### 1.1.1 Perspektif Statistika
- **Inference**: Membuat kesimpulan tentang populasi dari sampel
- **Uncertainty Quantification**: Mengukur ketidakpastian dalam prediksi
- **Hypothesis Testing**: Menguji signifikansi statistik
- **Confidence Intervals**: Memberikan rentang kepercayaan untuk estimasi
- **Model Validation**: Memvalidasi model dengan metode statistik

#### 1.1.2 Perspektif Machine Learning
- **Prediction**: Membuat prediksi yang akurat
- **Pattern Recognition**: Mengenali pola dalam data
- **Automation**: Mengotomatisasi proses pengambilan keputusan
- **Scalability**: Menangani data dalam skala besar
- **Feature Engineering**: Mengidentifikasi fitur yang relevan

#### 1.1.3 Konvergensi Kedua Bidang
- **Statistical Learning Theory**: Teori matematis untuk pembelajaran
- **Empirical Risk Minimization**: Minimisasi risiko empiris
- **Generalization**: Kemampuan model bekerja pada data baru
- **Regularization**: Teknik untuk mencegah overfitting
- **Cross-Validation**: Validasi model dengan data terbatas

### 1.2 Statistical Learning Theory

#### 1.2.1 Konsep Dasar
**Statistical Learning Theory** memberikan fondasi matematis untuk machine learning dengan fokus pada:

1. **Generalization Error**: Error pada data yang belum pernah dilihat
2. **Sample Complexity**: Berapa banyak data yang dibutuhkan
3. **Learning Rate**: Seberapa cepat model belajar
4. **Convergence**: Konvergensi algoritma pembelajaran

#### 1.2.2 Vapnik-Chervonenkis (VC) Theory
- **VC Dimension**: Mengukur kapasitas model
- **VC Bound**: Batas atas untuk generalization error
- **Structural Risk Minimization**: Prinsip untuk memilih model
- **Margin Theory**: Teori margin untuk support vector machines

#### 1.2.3 Empirical Risk Minimization (ERM)
```
R(f) = E[L(f(X), Y)]  # True risk
R_emp(f) = (1/n) Σ L(f(x_i), y_i)  # Empirical risk
```

Dimana:
- R(f) = true risk (expected loss)
- R_emp(f) = empirical risk (average loss on training data)
- L = loss function
- f = model function

### 1.3 Supervised vs Unsupervised Learning

#### 1.3.1 Supervised Learning
**Tujuan**: Mempelajari mapping dari input ke output yang diketahui

**Karakteristik**:
- Data berlabel (X, y)
- Tujuan prediksi atau klasifikasi
- Evaluasi dengan ground truth
- Contoh: regression, classification

**Metode Statistik**:
- **Regression**: Linear regression, polynomial regression
- **Classification**: Logistic regression, Naive Bayes
- **Validation**: Cross-validation, holdout validation
- **Inference**: Confidence intervals, hypothesis testing

#### 1.3.2 Unsupervised Learning
**Tujuan**: Mempelajari struktur dalam data tanpa label

**Karakteristik**:
- Data tidak berlabel (X)
- Tujuan eksplorasi dan discovery
- Evaluasi dengan internal metrics
- Contoh: clustering, dimensionality reduction

**Metode Statistik**:
- **Clustering**: K-means, hierarchical clustering
- **Dimensionality Reduction**: PCA, ICA, t-SNE
- **Density Estimation**: Kernel density estimation
- **Anomaly Detection**: Statistical outlier detection

### 1.4 Parametric vs Non-Parametric Methods

#### 1.4.1 Parametric Methods
**Karakteristik**:
- Jumlah parameter tetap
- Asumsi bentuk fungsi
- Lebih efisien dengan data kecil
- Interpretasi yang mudah

**Contoh**:
- **Linear Regression**: y = β₀ + β₁x₁ + ... + βₖxₖ + ε
- **Logistic Regression**: P(y=1|x) = 1/(1 + e^(-βᵀx))
- **Naive Bayes**: P(y|x) ∝ P(y) ∏ P(xᵢ|y)

**Keuntungan**:
- Interpretasi yang jelas
- Confidence intervals untuk parameter
- Hypothesis testing
- Efisien dengan data kecil

**Keterbatasan**:
- Asumsi bentuk fungsi yang kaku
- Mungkin tidak cocok untuk data kompleks
- Sensitif terhadap outliers

#### 1.4.2 Non-Parametric Methods
**Karakteristik**:
- Jumlah parameter bertambah dengan data
- Tidak ada asumsi bentuk fungsi
- Lebih fleksibel
- Membutuhkan data lebih banyak

**Contoh**:
- **k-Nearest Neighbors**: Prediksi berdasarkan k tetangga terdekat
- **Decision Trees**: Aturan if-then yang hierarkis
- **Support Vector Machines**: Hyperplane optimal
- **Neural Networks**: Komposisi fungsi non-linear

**Keuntungan**:
- Fleksibilitas tinggi
- Dapat menangani data kompleks
- Tidak memerlukan asumsi distribusi
- Robust terhadap outliers

**Keterbatasan**:
- Interpretasi yang sulit
- Membutuhkan data banyak
- Overfitting risk tinggi
- Computational cost tinggi

### 1.5 Model Complexity dan Generalization

#### 1.5.1 Model Complexity
**Definisi**: Jumlah parameter atau kapasitas model untuk mempelajari data

**Faktor yang Mempengaruhi**:
- **Jumlah Parameter**: Lebih banyak parameter = lebih kompleks
- **Model Flexibility**: Kemampuan menangani hubungan non-linear
- **Feature Space**: Dimensi dan jenis fitur
- **Regularization**: Teknik untuk mengontrol kompleksitas

#### 1.5.2 Generalization
**Definisi**: Kemampuan model bekerja pada data yang belum pernah dilihat

**Faktor yang Mempengaruhi**:
- **Training Data Size**: Lebih banyak data = generalisasi lebih baik
- **Model Complexity**: Kompleksitas optimal untuk data
- **Noise Level**: Tingkat noise dalam data
- **Feature Quality**: Relevansi dan kualitas fitur

#### 1.5.3 Bias-Variance Tradeoff
```
Expected Error = Bias² + Variance + Noise
```

**Bias**: Error karena asumsi model yang terlalu sederhana
**Variance**: Error karena sensitivitas model terhadap data training
**Noise**: Error inherent dalam data

#### 1.5.4 Overfitting dan Underfitting

##### Overfitting
**Definisi**: Model terlalu kompleks, mempelajari noise dalam data training

**Tanda-tanda**:
- Training error rendah, validation error tinggi
- Gap besar antara training dan validation performance
- Model terlalu sensitif terhadap data training

**Solusi**:
- Regularization (L1, L2)
- Early stopping
- Dropout (untuk neural networks)
- Data augmentation
- Cross-validation

##### Underfitting
**Definisi**: Model terlalu sederhana, tidak dapat mempelajari pola dalam data

**Tanda-tanda**:
- Training error tinggi
- Validation error juga tinggi
- Model tidak dapat menangkap hubungan dalam data

**Solusi**:
- Meningkatkan model complexity
- Feature engineering
- Mengurangi regularization
- Menggunakan model yang lebih powerful

### 1.6 Statistical Validation dalam ML

#### 1.6.1 Cross-Validation
**Tujuan**: Mengestimasi performance model pada data yang belum pernah dilihat

**Metode**:
- **k-Fold CV**: Data dibagi menjadi k subset
- **Leave-One-Out CV**: Setiap observasi sebagai test set
- **Stratified CV**: Mempertahankan proporsi kelas
- **Time Series CV**: Validasi untuk data time series

#### 1.6.2 Bootstrap
**Tujuan**: Mengestimasi distribusi sampling dan confidence intervals

**Metode**:
- **Bootstrap Sampling**: Sampling dengan replacement
- **Bootstrap Confidence Intervals**: CI untuk parameter
- **Bootstrap Model Selection**: Seleksi model dengan bootstrap
- **Bagging**: Bootstrap aggregating untuk ensemble

#### 1.6.3 Statistical Significance Testing
**Tujuan**: Menguji signifikansi statistik dalam model comparison

**Metode**:
- **t-test**: Perbandingan dua model
- **ANOVA**: Perbandingan multiple models
- **McNemar's Test**: Perbandingan binary classifiers
- **Friedman Test**: Perbandingan multiple models dengan ranking

### 1.7 Aplikasi Praktis

#### 1.7.1 Medical Diagnosis
- **Prediksi Penyakit**: Model untuk mendiagnosis penyakit
- **Drug Discovery**: Identifikasi senyawa obat baru
- **Medical Imaging**: Analisis gambar medis
- **Clinical Trials**: Desain dan analisis uji klinis

#### 1.7.2 Financial Modeling
- **Risk Assessment**: Penilaian risiko kredit
- **Algorithmic Trading**: Trading otomatis
- **Fraud Detection**: Deteksi penipuan
- **Portfolio Optimization**: Optimasi portofolio

#### 1.7.3 Marketing Analytics
- **Customer Segmentation**: Segmentasi pelanggan
- **Recommendation Systems**: Sistem rekomendasi
- **Churn Prediction**: Prediksi customer churn
- **A/B Testing**: Pengujian eksperimen

#### 1.7.4 Scientific Research
- **Drug Discovery**: Penemuan obat baru
- **Climate Modeling**: Pemodelan iklim
- **Genomics**: Analisis data genomik
- **Particle Physics**: Analisis data partikel

### 1.8 Best Practices

#### 1.8.1 Data Preparation
- **Data Quality**: Pastikan data berkualitas tinggi
- **Missing Values**: Handle missing values dengan tepat
- **Outliers**: Identifikasi dan handle outliers
- **Feature Engineering**: Buat fitur yang meaningful

#### 1.8.2 Model Selection
- **Start Simple**: Mulai dengan model sederhana
- **Cross-Validation**: Gunakan CV untuk evaluasi
- **Regularization**: Gunakan regularisasi untuk mencegah overfitting
- **Ensemble Methods**: Pertimbangkan ensemble methods

#### 1.8.3 Validation
- **Holdout Validation**: Pisahkan data test
- **Cross-Validation**: Gunakan CV untuk model selection
- **Statistical Testing**: Uji signifikansi statistik
- **Confidence Intervals**: Berikan CI untuk prediksi

#### 1.8.4 Interpretation
- **Feature Importance**: Analisis pentingnya fitur
- **Model Diagnostics**: Periksa asumsi model
- **Sensitivity Analysis**: Analisis sensitivitas
- **Uncertainty Quantification**: Kuantifikasi ketidakpastian


In [None]:
# 1.9 Demonstrasi Kode: Pengantar Machine Learning Statistik

print("=== DEMONSTRASI PENGANTAR MACHINE LEARNING STATISTIK ===\n")

# 1. Membuat Data Simulasi
print("1. Membuat Data Simulasi")
print("-" * 50)

# Set random seed untuk reproducibility
np.random.seed(42)

# Membuat data dengan hubungan non-linear
n_samples = 200
X = np.random.uniform(-3, 3, n_samples).reshape(-1, 1)
y = 0.5 * X.flatten()**3 - 2 * X.flatten()**2 + X.flatten() + np.random.normal(0, 0.5, n_samples)

print(f"Jumlah sampel: {n_samples}")
print(f"Shape X: {X.shape}")
print(f"Shape y: {y.shape}")
print(f"Range X: [{X.min():.2f}, {X.max():.2f}]")
print(f"Range y: [{y.min():.2f}, {y.max():.2f}]")

# 2. Split Data
print("\n2. Split Data")
print("-" * 50)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"Training ratio: {X_train.shape[0]/n_samples:.2f}")
print(f"Test ratio: {X_test.shape[0]/n_samples:.2f}")

# 3. Model Comparison: Parametric vs Non-Parametric
print("\n3. Model Comparison: Parametric vs Non-Parametric")
print("-" * 50)

# Parametric: Linear Regression
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
y_pred_linear = linear_model.predict(X_test)

# Non-Parametric: Random Forest
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

# Evaluasi model
linear_mse = mean_squared_error(y_test, y_pred_linear)
rf_mse = mean_squared_error(y_test, y_pred_rf)

linear_r2 = r2_score(y_test, y_pred_linear)
rf_r2 = r2_score(y_test, y_pred_rf)

print("Linear Regression (Parametric):")
print(f"  MSE: {linear_mse:.4f}")
print(f"  R²: {linear_r2:.4f}")

print("\nRandom Forest (Non-Parametric):")
print(f"  MSE: {rf_mse:.4f}")
print(f"  R²: {rf_r2:.4f}")

# 4. Cross-Validation
print("\n4. Cross-Validation")
print("-" * 50)

# K-Fold Cross-Validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# Cross-validation untuk Linear Regression
linear_scores = cross_val_score(linear_model, X, y, cv=kfold, scoring='neg_mean_squared_error')
linear_cv_mse = -linear_scores.mean()
linear_cv_std = linear_scores.std()

# Cross-validation untuk Random Forest
rf_scores = cross_val_score(rf_model, X, y, cv=kfold, scoring='neg_mean_squared_error')
rf_cv_mse = -rf_scores.mean()
rf_cv_std = rf_scores.std()

print("K-Fold Cross-Validation (5 folds):")
print(f"Linear Regression:")
print(f"  CV MSE: {linear_cv_mse:.4f} ± {linear_cv_std:.4f}")

print(f"\nRandom Forest:")
print(f"  CV MSE: {rf_cv_mse:.4f} ± {rf_cv_std:.4f}")

# 5. Bootstrap Confidence Intervals
print("\n5. Bootstrap Confidence Intervals")
print("-" * 50)

# Bootstrap untuk R² score
def r2_bootstrap(X, y, model, n_bootstrap=1000):
    r2_scores = []
    for _ in range(n_bootstrap):
        # Bootstrap sample
        indices = np.random.choice(len(X), len(X), replace=True)
        X_boot = X[indices]
        y_boot = y[indices]
        
        # Fit model dan predict
        model.fit(X_boot, y_boot)
        y_pred = model.predict(X_test)
        r2 = r2_score(y_test, y_pred)
        r2_scores.append(r2)
    
    return np.array(r2_scores)

# Bootstrap untuk Random Forest
rf_r2_bootstrap = r2_bootstrap(X_train, y_train, RandomForestRegressor(n_estimators=50, random_state=42))

# Hitung confidence intervals
rf_r2_ci = np.percentile(rf_r2_bootstrap, [2.5, 97.5])
rf_r2_mean = np.mean(rf_r2_bootstrap)

print(f"Random Forest R² Bootstrap (1000 iterations):")
print(f"  Mean R²: {rf_r2_mean:.4f}")
print(f"  95% CI: [{rf_r2_ci[0]:.4f}, {rf_r2_ci[1]:.4f}]")

# 6. Model Complexity Analysis
print("\n6. Model Complexity Analysis")
print("-" * 50)

# Analisis kompleksitas dengan polynomial features
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

degrees = [1, 2, 3, 4, 5]
train_scores = []
test_scores = []

for degree in degrees:
    # Polynomial regression
    poly_model = Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('linear', LinearRegression())
    ])
    
    # Fit dan predict
    poly_model.fit(X_train, y_train)
    
    # Training score
    y_train_pred = poly_model.predict(X_train)
    train_mse = mean_squared_error(y_train, y_train_pred)
    train_scores.append(train_mse)
    
    # Test score
    y_test_pred = poly_model.predict(X_test)
    test_mse = mean_squared_error(y_test, y_test_pred)
    test_scores.append(test_mse)

print("Polynomial Regression Complexity Analysis:")
print("Degree | Train MSE | Test MSE  | Gap")
print("-" * 40)
for i, degree in enumerate(degrees):
    gap = test_scores[i] - train_scores[i]
    print(f"{degree:6d} | {train_scores[i]:8.4f} | {test_scores[i]:8.4f} | {gap:6.4f}")

# 7. Visualisasi
print("\n7. Visualisasi")
print("-" * 50)

# Plot 1: Data dan Model Fits
plt.figure(figsize=(15, 12))

plt.subplot(2, 3, 1)
plt.scatter(X, y, alpha=0.6, label='Data')
X_plot = np.linspace(-3, 3, 100).reshape(-1, 1)
y_plot_linear = linear_model.predict(X_plot)
y_plot_rf = rf_model.predict(X_plot)
plt.plot(X_plot, y_plot_linear, 'r-', label='Linear Regression', linewidth=2)
plt.plot(X_plot, y_plot_rf, 'g-', label='Random Forest', linewidth=2)
plt.xlabel('X')
plt.ylabel('y')
plt.title('Model Fits')
plt.legend()
plt.grid(True)

# Plot 2: Residuals
plt.subplot(2, 3, 2)
residuals_linear = y_test - y_pred_linear
residuals_rf = y_test - y_pred_rf
plt.scatter(y_pred_linear, residuals_linear, alpha=0.6, label='Linear Regression')
plt.scatter(y_pred_rf, residuals_rf, alpha=0.6, label='Random Forest')
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.legend()
plt.grid(True)

# Plot 3: Cross-Validation Scores
plt.subplot(2, 3, 3)
plt.boxplot([linear_scores, rf_scores], labels=['Linear', 'Random Forest'])
plt.ylabel('CV Score (MSE)')
plt.title('Cross-Validation Scores')
plt.grid(True)

# Plot 4: Bootstrap Distribution
plt.subplot(2, 3, 4)
plt.hist(rf_r2_bootstrap, bins=30, alpha=0.7, edgecolor='black')
plt.axvline(rf_r2_mean, color='red', linestyle='--', label=f'Mean: {rf_r2_mean:.4f}')
plt.axvline(rf_r2_ci[0], color='orange', linestyle='--', label=f'95% CI: [{rf_r2_ci[0]:.4f}, {rf_r2_ci[1]:.4f}]')
plt.axvline(rf_r2_ci[1], color='orange', linestyle='--')
plt.xlabel('R² Score')
plt.ylabel('Frequency')
plt.title('Bootstrap R² Distribution')
plt.legend()
plt.grid(True)

# Plot 5: Model Complexity
plt.subplot(2, 3, 5)
plt.plot(degrees, train_scores, 'o-', label='Training MSE', linewidth=2)
plt.plot(degrees, test_scores, 's-', label='Test MSE', linewidth=2)
plt.xlabel('Polynomial Degree')
plt.ylabel('MSE')
plt.title('Model Complexity vs Performance')
plt.legend()
plt.grid(True)

# Plot 6: Feature Importance (Random Forest)
plt.subplot(2, 3, 6)
feature_importance = rf_model.feature_importances_
plt.bar(range(len(feature_importance)), feature_importance)
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.title('Feature Importance (Random Forest)')
plt.grid(True)

plt.tight_layout()
plt.show()

# 8. Statistical Tests
print("\n8. Statistical Tests")
print("-" * 50)

# t-test untuk perbandingan model
from scipy.stats import ttest_rel

# Perbandingan MSE
mse_linear = mean_squared_error(y_test, y_pred_linear)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# Hitung MSE untuk setiap fold
linear_mse_folds = []
rf_mse_folds = []

for train_idx, val_idx in kfold.split(X):
    X_fold_train, X_fold_val = X[train_idx], X[val_idx]
    y_fold_train, y_fold_val = y[train_idx], y[val_idx]
    
    # Linear model
    linear_model.fit(X_fold_train, y_fold_train)
    y_pred_linear_fold = linear_model.predict(X_fold_val)
    linear_mse_folds.append(mean_squared_error(y_fold_val, y_pred_linear_fold))
    
    # Random Forest model
    rf_model.fit(X_fold_train, y_fold_train)
    y_pred_rf_fold = rf_model.predict(X_fold_val)
    rf_mse_folds.append(mean_squared_error(y_fold_val, y_pred_rf_fold))

# Paired t-test
t_stat, p_value = ttest_rel(linear_mse_folds, rf_mse_folds)

print("Paired t-test untuk perbandingan model:")
print(f"  t-statistic: {t_stat:.4f}")
print(f"  p-value: {p_value:.4f}")

if p_value < 0.05:
    print("  → Ada perbedaan signifikan antara model (p < 0.05)")
else:
    print("  → Tidak ada perbedaan signifikan antara model (p ≥ 0.05)")

# 9. Summary
print("\n9. Summary")
print("-" * 50)

print("Hasil Analisis Machine Learning Statistik:")
print(f"1. Data: {n_samples} samples, 1 feature")
print(f"2. Split: {X_train.shape[0]} training, {X_test.shape[0]} test")
print(f"3. Linear Regression: MSE = {linear_mse:.4f}, R² = {linear_r2:.4f}")
print(f"4. Random Forest: MSE = {rf_mse:.4f}, R² = {rf_r2:.4f}")
print(f"5. Cross-Validation: Linear MSE = {linear_cv_mse:.4f} ± {linear_cv_std:.4f}")
print(f"6. Bootstrap CI: Random Forest R² = {rf_r2_mean:.4f} [{rf_r2_ci[0]:.4f}, {rf_r2_ci[1]:.4f}]")
print(f"7. Statistical Test: t = {t_stat:.4f}, p = {p_value:.4f}")

print("\n" + "="*60)
print("DEMONSTRASI SELESAI")
print("="*60)


## 2. Bias-Variance Tradeoff

### 2.1 Konsep Dasar Bias dan Variance

**Bias-Variance Tradeoff** adalah konsep fundamental dalam machine learning yang menjelaskan hubungan antara bias, variance, dan total error dalam model prediktif.

#### 2.1.1 Definisi Bias
**Bias** adalah error yang terjadi karena asumsi model yang terlalu sederhana untuk menangkap hubungan yang sebenarnya dalam data.

**Karakteristik Bias:**
- **High Bias**: Model terlalu sederhana, tidak dapat mempelajari pola kompleks
- **Low Bias**: Model cukup fleksibel untuk mempelajari pola dalam data
- **Bias Error**: Error yang konsisten dalam prediksi

**Contoh High Bias:**
- Linear regression untuk data non-linear
- Model yang terlalu simple untuk data kompleks
- Asumsi yang terlalu kaku

#### 2.1.2 Definisi Variance
**Variance** adalah error yang terjadi karena sensitivitas model terhadap variasi kecil dalam data training.

**Karakteristik Variance:**
- **High Variance**: Model terlalu sensitif terhadap data training
- **Low Variance**: Model stabil terhadap perubahan data training
- **Variance Error**: Error yang bervariasi dengan data training yang berbeda

**Contoh High Variance:**
- Model yang terlalu kompleks
- Overfitting pada data training
- Sensitif terhadap noise dalam data

#### 2.1.3 Noise
**Noise** adalah error inherent dalam data yang tidak dapat dihindari oleh model apapun.

**Sumber Noise:**
- Measurement error
- Sampling error
- Random variation
- Data quality issues

### 2.2 Bias-Variance Decomposition

#### 2.2.1 Mathematical Formulation
```
Expected Error = Bias² + Variance + Noise
```

Dimana:
- **Expected Error**: Error yang diharapkan pada data baru
- **Bias²**: Kuadrat dari bias
- **Variance**: Varians dari prediksi
- **Noise**: Error inherent dalam data

#### 2.2.2 Detailed Decomposition
```
E[(y - f̂(x))²] = [E[f̂(x)] - f(x)]² + E[(f̂(x) - E[f̂(x)])²] + σ²
```

Dimana:
- y = true value
- f̂(x) = predicted value
- f(x) = true function
- σ² = noise variance

#### 2.2.3 Interpretation
- **Bias²**: Error karena model tidak dapat menangkap true function
- **Variance**: Error karena model bervariasi dengan data training
- **Noise**: Error yang tidak dapat dihindari

### 2.3 Overfitting dan Underfitting

#### 2.3.1 Overfitting (High Variance, Low Bias)
**Definisi**: Model terlalu kompleks, mempelajari noise dalam data training

**Tanda-tanda Overfitting:**
- Training error sangat rendah
- Validation error tinggi
- Gap besar antara training dan validation performance
- Model terlalu sensitif terhadap data training

**Penyebab Overfitting:**
- Model terlalu kompleks
- Data training terlalu sedikit
- Noise dalam data training
- Tidak ada regularisasi

**Solusi Overfitting:**
- Regularization (L1, L2)
- Early stopping
- Dropout (untuk neural networks)
- Data augmentation
- Cross-validation
- Ensemble methods

#### 2.3.2 Underfitting (High Bias, Low Variance)
**Definisi**: Model terlalu sederhana, tidak dapat mempelajari pola dalam data

**Tanda-tanda Underfitting:**
- Training error tinggi
- Validation error juga tinggi
- Model tidak dapat menangkap hubungan dalam data
- Performance tidak membaik dengan data lebih banyak

**Penyebab Underfitting:**
- Model terlalu sederhana
- Feature engineering tidak cukup
- Regularisasi terlalu kuat
- Data tidak representatif

**Solusi Underfitting:**
- Meningkatkan model complexity
- Feature engineering
- Mengurangi regularisasi
- Menggunakan model yang lebih powerful
- Menambah data training

#### 2.3.3 Optimal Fitting
**Tujuan**: Mencapai keseimbangan optimal antara bias dan variance

**Karakteristik Optimal Fitting:**
- Training error dan validation error seimbang
- Gap kecil antara training dan validation performance
- Model dapat generalisasi dengan baik
- Performance stabil pada data baru

### 2.4 Model Complexity vs Performance

#### 2.4.1 Learning Curves
**Learning Curves** menunjukkan hubungan antara jumlah data training dan performance model.

**Karakteristik Learning Curves:**
- **Training Curve**: Menurun dengan data lebih banyak
- **Validation Curve**: Meningkat dengan data lebih banyak
- **Gap**: Perbedaan antara training dan validation performance

#### 2.4.2 Complexity Curves
**Complexity Curves** menunjukkan hubungan antara model complexity dan performance.

**Karakteristik Complexity Curves:**
- **Training Performance**: Meningkat dengan complexity
- **Validation Performance**: Meningkat kemudian menurun
- **Optimal Point**: Titik optimal complexity

#### 2.4.3 Bias-Variance Tradeoff Curve
**Tradeoff Curve** menunjukkan hubungan antara bias dan variance.

**Karakteristik Tradeoff Curve:**
- **High Bias, Low Variance**: Model sederhana
- **Low Bias, High Variance**: Model kompleks
- **Optimal Point**: Keseimbangan optimal

### 2.5 Practical Implications

#### 2.5.1 Model Selection
**Prinsip**: Pilih model dengan keseimbangan optimal antara bias dan variance

**Strategi:**
- Mulai dengan model sederhana
- Tingkatkan complexity secara bertahap
- Gunakan cross-validation untuk evaluasi
- Pertimbangkan ensemble methods

#### 2.5.2 Regularization
**Tujuan**: Mengontrol model complexity untuk mencegah overfitting

**Metode Regularisasi:**
- **L1 Regularization (Lasso)**: Feature selection
- **L2 Regularization (Ridge)**: Parameter shrinkage
- **Elastic Net**: Kombinasi L1 dan L2
- **Early Stopping**: Menghentikan training sebelum overfitting

#### 2.5.3 Cross-Validation
**Tujuan**: Mengestimasi true performance model

**Metode Cross-Validation:**
- **k-Fold CV**: Data dibagi menjadi k subset
- **Leave-One-Out CV**: Setiap observasi sebagai test set
- **Stratified CV**: Mempertahankan proporsi kelas
- **Time Series CV**: Validasi untuk data time series

#### 2.5.4 Ensemble Methods
**Tujuan**: Mengurangi variance dengan menggabungkan multiple models

**Metode Ensemble:**
- **Bagging**: Bootstrap aggregating
- **Boosting**: Sequential learning
- **Stacking**: Meta-learning
- **Voting**: Majority voting

### 2.6 Mathematical Analysis

#### 2.6.1 Bias Analysis
**Bias** dapat dianalisis dengan:
- **Approximation Error**: Error karena model tidak dapat menangkap true function
- **Estimation Error**: Error karena estimasi parameter dari data terbatas
- **Model Error**: Error karena asumsi model yang salah

#### 2.6.2 Variance Analysis
**Variance** dapat dianalisis dengan:
- **Sampling Variance**: Variasi karena sampling data training
- **Model Variance**: Variasi karena model complexity
- **Parameter Variance**: Variasi karena estimasi parameter

#### 2.6.3 Tradeoff Analysis
**Tradeoff** dapat dianalisis dengan:
- **Bias-Variance Decomposition**: Decomposisi error menjadi bias dan variance
- **Learning Curves**: Analisis performance vs data size
- **Complexity Curves**: Analisis performance vs model complexity

### 2.7 Aplikasi Praktis

#### 2.7.1 Model Selection
- **Linear Models**: Low variance, high bias
- **Tree Models**: High variance, low bias
- **Ensemble Models**: Balanced bias-variance
- **Neural Networks**: Adjustable bias-variance

#### 2.7.2 Hyperparameter Tuning
- **Learning Rate**: Mempengaruhi convergence dan stability
- **Regularization Parameter**: Mempengaruhi bias-variance tradeoff
- **Model Complexity**: Mempengaruhi capacity model
- **Ensemble Size**: Mempengaruhi variance reduction

#### 2.7.3 Feature Engineering
- **Feature Selection**: Mengurangi variance
- **Feature Creation**: Mengurangi bias
- **Dimensionality Reduction**: Mengurangi variance
- **Feature Scaling**: Mempengaruhi convergence

### 2.8 Best Practices

#### 2.8.1 Model Development
- **Start Simple**: Mulai dengan model sederhana
- **Iterative Improvement**: Tingkatkan complexity secara bertahap
- **Cross-Validation**: Gunakan CV untuk evaluasi
- **Regularization**: Gunakan regularisasi untuk kontrol complexity

#### 2.8.2 Evaluation
- **Multiple Metrics**: Gunakan multiple evaluation metrics
- **Learning Curves**: Analisis learning curves
- **Validation Curves**: Analisis validation curves
- **Statistical Testing**: Uji signifikansi perbedaan model

#### 2.8.3 Interpretation
- **Bias Analysis**: Analisis sumber bias
- **Variance Analysis**: Analisis sumber variance
- **Tradeoff Analysis**: Analisis tradeoff optimal
- **Practical Implications**: Implikasi praktis untuk aplikasi


In [None]:
# 2.9 Demonstrasi Kode: Bias-Variance Tradeoff

print("=== DEMONSTRASI BIAS-VARIANCE TRADEOFF ===\n")

# 1. Membuat Data Simulasi
print("1. Membuat Data Simulasi")
print("-" * 50)

# Set random seed untuk reproducibility
np.random.seed(42)

# True function: f(x) = 0.5x³ - 2x² + x + noise
def true_function(x):
    return 0.5 * x**3 - 2 * x**2 + x

# Generate data
n_samples = 1000
X = np.random.uniform(-3, 3, n_samples)
y = true_function(X) + np.random.normal(0, 0.5, n_samples)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"Jumlah sampel: {n_samples}")
print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")

# 2. Model dengan Berbagai Kompleksitas
print("\n2. Model dengan Berbagai Kompleksitas")
print("-" * 50)

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

# Model dengan berbagai degree polynomial
degrees = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
models = {}
train_scores = []
test_scores = []

for degree in degrees:
    # Polynomial regression
    model = Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('linear', LinearRegression())
    ])
    
    # Fit model
    model.fit(X_train.reshape(-1, 1), y_train)
    
    # Predictions
    y_train_pred = model.predict(X_train.reshape(-1, 1))
    y_test_pred = model.predict(X_test.reshape(-1, 1))
    
    # Scores
    train_mse = mean_squared_error(y_train, y_train_pred)
    test_mse = mean_squared_error(y_test, y_test_pred)
    
    models[degree] = model
    train_scores.append(train_mse)
    test_scores.append(test_mse)
    
    print(f"Degree {degree:2d}: Train MSE = {train_mse:.4f}, Test MSE = {test_mse:.4f}")

# 3. Bias-Variance Analysis
print("\n3. Bias-Variance Analysis")
print("-" * 50)

# Hitung bias dan variance untuk setiap model
def calculate_bias_variance(model, X_test, y_test, n_bootstrap=100):
    """Hitung bias dan variance untuk model"""
    predictions = []
    
    # Bootstrap sampling
    for _ in range(n_bootstrap):
        # Bootstrap sample
        indices = np.random.choice(len(X_train), len(X_train), replace=True)
        X_boot = X_train[indices].reshape(-1, 1)
        y_boot = y_train[indices]
        
        # Fit model pada bootstrap sample
        model.fit(X_boot, y_boot)
        
        # Predict pada test set
        y_pred = model.predict(X_test.reshape(-1, 1))
        predictions.append(y_pred)
    
    predictions = np.array(predictions)
    
    # Hitung bias dan variance
    mean_pred = np.mean(predictions, axis=0)
    bias_squared = np.mean((mean_pred - y_test)**2)
    variance = np.mean(np.var(predictions, axis=0))
    
    return bias_squared, variance

# Analisis untuk beberapa model
selected_degrees = [1, 3, 5, 7, 9]
bias_scores = []
variance_scores = []
total_scores = []

print("Bias-Variance Analysis:")
print("Degree | Bias²    | Variance | Total   | Noise")
print("-" * 50)

for degree in selected_degrees:
    model = models[degree]
    bias_sq, variance = calculate_bias_variance(model, X_test, y_test)
    
    # Total error = bias² + variance + noise
    # Noise diestimasi sebagai MSE minimum yang mungkin
    noise = 0.25  # Variance dari noise yang ditambahkan
    total_error = bias_sq + variance + noise
    
    bias_scores.append(bias_sq)
    variance_scores.append(variance)
    total_scores.append(total_error)
    
    print(f"{degree:6d} | {bias_sq:8.4f} | {variance:8.4f} | {total_error:7.4f} | {noise:6.4f}")

# 4. Learning Curves
print("\n4. Learning Curves")
print("-" * 50)

# Hitung learning curves untuk model dengan degree 3 dan 7
def learning_curve(model, X, y, train_sizes):
    train_scores = []
    val_scores = []
    
    for size in train_sizes:
        # Sample data
        indices = np.random.choice(len(X), size, replace=False)
        X_sample = X[indices].reshape(-1, 1)
        y_sample = y[indices]
        
        # Split untuk validation
        X_train_sample, X_val_sample, y_train_sample, y_val_sample = train_test_split(
            X_sample, y_sample, test_size=0.3, random_state=42
        )
        
        # Fit model
        model.fit(X_train_sample, y_train_sample)
        
        # Scores
        train_score = model.score(X_train_sample, y_train_sample)
        val_score = model.score(X_val_sample, y_val_sample)
        
        train_scores.append(train_score)
        val_scores.append(val_score)
    
    return train_scores, val_scores

# Learning curves
train_sizes = [20, 50, 100, 200, 300, 400, 500, 600, 700]

# Model degree 3 (underfitting)
model_underfit = Pipeline([
    ('poly', PolynomialFeatures(degree=3)),
    ('linear', LinearRegression())
])
train_scores_3, val_scores_3 = learning_curve(model_underfit, X_train, y_train, train_sizes)

# Model degree 7 (overfitting)
model_overfit = Pipeline([
    ('poly', PolynomialFeatures(degree=7)),
    ('linear', LinearRegression())
])
train_scores_7, val_scores_7 = learning_curve(model_overfit, X_train, y_train, train_sizes)

print("Learning Curves (R² Score):")
print("Size | Degree 3 Train | Degree 3 Val | Degree 7 Train | Degree 7 Val")
print("-" * 70)
for i, size in enumerate(train_sizes):
    print(f"{size:4d} | {train_scores_3[i]:12.4f} | {val_scores_3[i]:11.4f} | {train_scores_7[i]:13.4f} | {val_scores_7[i]:10.4f}")

# 5. Visualisasi
print("\n5. Visualisasi")
print("-" * 50)

# Plot 1: Model Fits
plt.figure(figsize=(20, 15))

plt.subplot(3, 3, 1)
X_plot = np.linspace(-3, 3, 100)
y_true = true_function(X_plot)
plt.scatter(X, y, alpha=0.3, label='Data', s=10)
plt.plot(X_plot, y_true, 'k-', label='True Function', linewidth=2)

# Plot beberapa model
for degree in [1, 3, 5, 7, 9]:
    model = models[degree]
    y_pred = model.predict(X_plot.reshape(-1, 1))
    plt.plot(X_plot, y_pred, label=f'Degree {degree}', linewidth=2)

plt.xlabel('X')
plt.ylabel('y')
plt.title('Model Fits dengan Berbagai Kompleksitas')
plt.legend()
plt.grid(True)

# Plot 2: Bias-Variance Tradeoff
plt.subplot(3, 3, 2)
plt.plot(selected_degrees, bias_scores, 'o-', label='Bias²', linewidth=2, markersize=8)
plt.plot(selected_degrees, variance_scores, 's-', label='Variance', linewidth=2, markersize=8)
plt.plot(selected_degrees, total_scores, '^-', label='Total Error', linewidth=2, markersize=8)
plt.xlabel('Polynomial Degree')
plt.ylabel('Error')
plt.title('Bias-Variance Tradeoff')
plt.legend()
plt.grid(True)

# Plot 3: Learning Curves
plt.subplot(3, 3, 3)
plt.plot(train_sizes, train_scores_3, 'o-', label='Degree 3 Train', linewidth=2)
plt.plot(train_sizes, val_scores_3, 's-', label='Degree 3 Val', linewidth=2)
plt.plot(train_sizes, train_scores_7, 'o-', label='Degree 7 Train', linewidth=2)
plt.plot(train_sizes, val_scores_7, 's-', label='Degree 7 Val', linewidth=2)
plt.xlabel('Training Size')
plt.ylabel('R² Score')
plt.title('Learning Curves')
plt.legend()
plt.grid(True)

# Plot 4: Training vs Test Error
plt.subplot(3, 3, 4)
plt.plot(degrees, train_scores, 'o-', label='Training Error', linewidth=2)
plt.plot(degrees, test_scores, 's-', label='Test Error', linewidth=2)
plt.xlabel('Polynomial Degree')
plt.ylabel('MSE')
plt.title('Training vs Test Error')
plt.legend()
plt.grid(True)

# Plot 5: Error Gap
plt.subplot(3, 3, 5)
error_gap = np.array(test_scores) - np.array(train_scores)
plt.plot(degrees, error_gap, 'o-', label='Error Gap', linewidth=2, color='red')
plt.xlabel('Polynomial Degree')
plt.ylabel('Error Gap')
plt.title('Overfitting Indicator')
plt.legend()
plt.grid(True)

# Plot 6: Model Complexity vs Performance
plt.subplot(3, 3, 6)
plt.plot(degrees, train_scores, 'o-', label='Training', linewidth=2)
plt.plot(degrees, test_scores, 's-', label='Test', linewidth=2)
plt.axvline(x=3, color='green', linestyle='--', label='Optimal', linewidth=2)
plt.xlabel('Polynomial Degree')
plt.ylabel('MSE')
plt.title('Model Complexity vs Performance')
plt.legend()
plt.grid(True)

# Plot 7: Residuals Analysis (Degree 3)
plt.subplot(3, 3, 7)
model_3 = models[3]
y_pred_3 = model_3.predict(X_test.reshape(-1, 1))
residuals_3 = y_test - y_pred_3
plt.scatter(y_pred_3, residuals_3, alpha=0.6)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals (Degree 3)')
plt.grid(True)

# Plot 8: Residuals Analysis (Degree 7)
plt.subplot(3, 3, 8)
model_7 = models[7]
y_pred_7 = model_7.predict(X_test.reshape(-1, 1))
residuals_7 = y_test - y_pred_7
plt.scatter(y_pred_7, residuals_7, alpha=0.6)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals (Degree 7)')
plt.grid(True)

# Plot 9: Optimal Model
plt.subplot(3, 3, 9)
optimal_degree = 3
optimal_model = models[optimal_degree]
y_pred_optimal = optimal_model.predict(X_plot.reshape(-1, 1))
plt.scatter(X, y, alpha=0.3, label='Data', s=10)
plt.plot(X_plot, y_true, 'k-', label='True Function', linewidth=2)
plt.plot(X_plot, y_pred_optimal, 'r-', label=f'Optimal Model (Degree {optimal_degree})', linewidth=2)
plt.xlabel('X')
plt.ylabel('y')
plt.title('Optimal Model')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()

# 6. Statistical Analysis
print("\n6. Statistical Analysis")
print("-" * 50)

# T-test untuk perbandingan model
from scipy.stats import ttest_rel

# Perbandingan degree 3 vs degree 7
model_3_scores = []
model_7_scores = []

# Cross-validation scores
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

for train_idx, val_idx in kfold.split(X_train):
    X_fold_train, X_fold_val = X_train[train_idx], X_train[val_idx]
    y_fold_train, y_fold_val = y_train[train_idx], y_train[val_idx]
    
    # Model degree 3
    model_3.fit(X_fold_train.reshape(-1, 1), y_fold_train)
    y_pred_3_fold = model_3.predict(X_fold_val.reshape(-1, 1))
    mse_3 = mean_squared_error(y_fold_val, y_pred_3_fold)
    model_3_scores.append(mse_3)
    
    # Model degree 7
    model_7.fit(X_fold_train.reshape(-1, 1), y_fold_train)
    y_pred_7_fold = model_7.predict(X_fold_val.reshape(-1, 1))
    mse_7 = mean_squared_error(y_fold_val, y_pred_7_fold)
    model_7_scores.append(mse_7)

# Paired t-test
t_stat, p_value = ttest_rel(model_3_scores, model_7_scores)

print("Paired t-test (Degree 3 vs Degree 7):")
print(f"  Degree 3 MSE: {np.mean(model_3_scores):.4f} ± {np.std(model_3_scores):.4f}")
print(f"  Degree 7 MSE: {np.mean(model_7_scores):.4f} ± {np.std(model_7_scores):.4f}")
print(f"  t-statistic: {t_stat:.4f}")
print(f"  p-value: {p_value:.4f}")

if p_value < 0.05:
    print("  → Ada perbedaan signifikan antara model (p < 0.05)")
else:
    print("  → Tidak ada perbedaan signifikan antara model (p ≥ 0.05)")

# 7. Summary
print("\n7. Summary")
print("-" * 50)

# Temukan model optimal
optimal_idx = np.argmin(test_scores)
optimal_degree = degrees[optimal_idx]
optimal_mse = test_scores[optimal_idx]

print("Hasil Analisis Bias-Variance Tradeoff:")
print(f"1. Model optimal: Degree {optimal_degree}")
print(f"2. Optimal MSE: {optimal_mse:.4f}")
print(f"3. Bias² (Degree 3): {bias_scores[1]:.4f}")
print(f"4. Variance (Degree 3): {variance_scores[1]:.4f}")
print(f"5. Bias² (Degree 7): {bias_scores[3]:.4f}")
print(f"6. Variance (Degree 7): {variance_scores[3]:.4f}")
print(f"7. Error Gap (Degree 3): {test_scores[2] - train_scores[2]:.4f}")
print(f"8. Error Gap (Degree 7): {test_scores[6] - train_scores[6]:.4f}")

print("\nInterpretasi:")
if optimal_degree <= 3:
    print("  → Model cenderung underfitting (high bias, low variance)")
elif optimal_degree >= 7:
    print("  → Model cenderung overfitting (low bias, high variance)")
else:
    print("  → Model mencapai keseimbangan optimal (balanced bias-variance)")

print("\n" + "="*60)
print("DEMONSTRASI SELESAI")
print("="*60)
