# Resampling and CV - How to Compare Models Without Fooling Yourself

<hr>

<center>
<div>
<img src="https://raw.githubusercontent.com/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/main/notebooks/figures/mgmt_474_ai_logo_02-modified.png" width="200"/>
</div>
</center>

# <center><a class="tocSkip"></center>
# <center>MGMT47400 Predictive Analytics</center>
# <center>Professor: Davi Moreira </center>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/blob/main/notebooks/08_cross_validation_model_comparison.ipynb)

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Run k-fold cross-validation for classification and regression
2. Use stratified CV for classification
3. Understand variance of performance estimates (why one split is fragile)
4. Compare models using consistent CV and a single primary metric
5. Build a reusable CV evaluation function (project-ready)

---

In [None]:
# Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer, fetch_california_housing
from sklearn.model_selection import (
    KFold, StratifiedKFold, cross_val_score, cross_validate, RepeatedKFold
)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.ensemble import RandomForestClassifier
import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.precision', 4)
RANDOM_SEED = 474
np.random.seed(RANDOM_SEED)
print("✓ Setup complete!")

**Reading the output:**

The setup cell imports the resampling toolkit we will use throughout this notebook. Key imports include `KFold` and `StratifiedKFold` (CV splitters), `cross_val_score` and `cross_validate` (convenience functions that handle the train-evaluate loop internally), and `RepeatedKFold` for more stable estimates. On the modeling side, we import `LogisticRegression`, `Ridge`, and `RandomForestClassifier` so we can compare diverse algorithms on the same folds. The confirmation **"Setup complete!"** means everything loaded successfully.

**Why this matters:** scikit-learn provides dedicated splitter objects rather than manual index slicing. Using them guarantees reproducible folds (via `random_state`) and correct stratification.

---

## 1. Why Cross-Validation Exists

### The Problem with Single Train/Val Splits

**Issues:**
- Performance depends on which samples end up in validation
- High variance in estimates
- May get lucky or unlucky with the split
- Wastes data (validation set sits idle)

### The Solution: Cross-Validation

**K-Fold CV:**
1. Split data into K folds
2. Train on K-1 folds, validate on 1 fold
3. Repeat K times (each fold gets to be validation once)
4. Average the K performance scores

**Benefits:**
- More stable performance estimates
- Uses all data for both training and validation
- Reveals variance in model performance

## 2. K-Fold Cross-Validation for Regression

We start with a regression example to build intuition about what K-Fold CV actually does. The California Housing dataset (20,640 samples, 8 features) predicts median house value from census-block-level attributes such as median income, average rooms, and geographic coordinates.

With `KFold(n_splits=5)`, the data is divided into 5 equally-sized folds. The model trains on 4 folds (~16,512 samples) and evaluates on the held-out fold (~4,128 samples), repeating this 5 times so every sample is validated exactly once. The result is 5 R-squared scores whose mean and standard deviation give us a more honest estimate of model quality than any single train/val split could.

In [None]:
# Load regression dataset
california = fetch_california_housing(as_frame=True)
X_reg = california.data
y_reg = california.target

print(f"Regression dataset: {X_reg.shape}")

# Create pipeline
ridge_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', Ridge(alpha=1.0))
])

# 5-fold cross-validation
cv = KFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)
cv_scores = cross_val_score(ridge_pipeline, X_reg, y_reg, cv=cv, scoring='r2')

print("\n=== 5-FOLD CROSS-VALIDATION (Regression) ===")
print(f"Fold scores: {cv_scores}")
print(f"Mean R²: {cv_scores.mean():.4f}")
print(f"Std R²:  {cv_scores.std():.4f}")
print(f"95% CI:  [{cv_scores.mean() - 2*cv_scores.std():.4f}, {cv_scores.mean() + 2*cv_scores.std():.4f}]")

# Visualize fold variation
plt.figure(figsize=(10, 6))
plt.bar(range(1, 6), cv_scores, alpha=0.7, edgecolor='black')
plt.axhline(y=cv_scores.mean(), color='r', linestyle='--', label=f'Mean = {cv_scores.mean():.4f}')
plt.axhline(y=cv_scores.mean() + cv_scores.std(), color='orange', linestyle=':', label=f'±1 Std')
plt.axhline(y=cv_scores.mean() - cv_scores.std(), color='orange', linestyle=':')
plt.xlabel('Fold')
plt.ylabel('R² Score')
plt.title('Cross-Validation Scores Across Folds')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\n💡 Notice the variation across folds - this is why CV matters!")

**Reading the output:**

Five R-squared scores are printed, one per fold. Notice the variation -- the highest fold might score around **R² = 0.61** while the lowest might dip to **R² = 0.58**. The summary line reports the **mean R²** (roughly 0.59-0.61) and the **standard deviation** (typically 0.01-0.02).

The bar chart makes the fold-to-fold variation visible. The red dashed line marks the mean, and the orange dotted lines mark plus/minus one standard deviation. If any bar falls far outside the orange band, it may indicate that certain regions of the data are harder to predict (for example, luxury coastal properties versus inland suburbs).

The approximate **95% confidence interval** is computed as mean +/- 2*std. This interval tells you the range within which you would expect the model's true generalization performance to fall. A narrow interval means the estimate is stable; a wide one means you need more data or a different CV scheme.

**Key takeaway:** A single train/val split would have given you just *one* of these five bars. Cross-validation gives you the full picture: both the central tendency and the uncertainty around it.

---

## 3. Stratified K-Fold for Classification

When the target variable is categorical, random K-Fold splitting can produce folds with different class proportions. For instance, if the overall positive rate is 63%, one fold might end up with 55% while another gets 70%, injecting noise into the CV estimates.

**Stratified K-Fold** solves this by enforcing the same class distribution in every fold. For the breast cancer dataset (569 samples, 212 malignant, 357 benign), each of the 5 folds will contain approximately 42 malignant and 71 benign observations. The cell below runs *both* regular and stratified 5-fold CV side by side so you can compare the variance in scores directly.

In [None]:
# Load classification dataset
data = load_breast_cancer(as_frame=True)
X_clf = data.data
y_clf = data.target

print(f"Classification dataset: {X_clf.shape}")
print(f"Class distribution: {np.bincount(y_clf)}")
print(f"Class ratio: {np.bincount(y_clf)[1] / len(y_clf):.4f}")

# Create pipeline
log_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(random_state=RANDOM_SEED, max_iter=1000))
])

# Compare regular vs stratified CV
cv_regular = KFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)
cv_stratified = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)

scores_regular = cross_val_score(log_pipeline, X_clf, y_clf, cv=cv_regular, scoring='roc_auc')
scores_stratified = cross_val_score(log_pipeline, X_clf, y_clf, cv=cv_stratified, scoring='roc_auc')

print("\n=== REGULAR K-FOLD ===")
print(f"Scores: {scores_regular}")
print(f"Mean ± Std: {scores_regular.mean():.4f} ± {scores_regular.std():.4f}")

print("\n=== STRATIFIED K-FOLD ===")
print(f"Scores: {scores_stratified}")
print(f"Mean ± Std: {scores_stratified.mean():.4f} ± {scores_stratified.std():.4f}")

print("\n💡 Stratified CV typically has lower variance")
print(f"💡 Variance reduction: {(scores_regular.std() - scores_stratified.std()) / scores_regular.std() * 100:.1f}%")

**Reading the output:**

The breast cancer dataset has 569 samples with 30 features. The class distribution shows roughly **63% benign (class 1)** and **37% malignant (class 0)**.

Two sets of 5-fold CV scores are printed side by side:
- **Regular K-Fold:** The ROC-AUC scores show noticeable variation because some folds may have unbalanced class ratios by chance.
- **Stratified K-Fold:** The scores are typically tighter (lower standard deviation) because every fold mirrors the overall 63/37 class split.

The **variance reduction** percentage at the bottom quantifies how much stratification helped. A positive value means stratified CV produced more stable estimates. In practice, the improvement can range from modest (5-10%) to dramatic (50%+), depending on how unbalanced the data is and how small the dataset is.

**Why this matters:** For classification tasks, always use `StratifiedKFold` (or `StratifiedShuffleSplit`). It is a free lunch: same computational cost, lower-variance estimates, and no risk of a fold that accidentally contains zero samples of the minority class.

---

## 📝 PAUSE-AND-DO Exercise 1 (5 minutes)

**Task:** Write `cv_report(model, X, y, cv, scoring)` returning mean/std.

---

In [None]:
def cv_report(model, X, y, cv, scoring='accuracy'):
    """
    Generate a comprehensive cross-validation report.
    
    Parameters:
    -----------
    model : estimator
        sklearn model or pipeline
    X : array-like
        Features
    y : array-like
        Target
    cv : cross-validator
        CV splitter (e.g., KFold, StratifiedKFold)
    scoring : str
        Scoring metric
    
    Returns:
    --------
    dict : Dictionary with scores and statistics
    """
    # Run cross-validation
    cv_results = cross_validate(
        model, X, y, cv=cv, scoring=scoring,
        return_train_score=True,
        n_jobs=-1
    )
    
    # Extract scores
    train_scores = cv_results['train_score']
    val_scores = cv_results['test_score']
    fit_times = cv_results['fit_time']
    
    # Calculate statistics
    report = {
        'train_mean': train_scores.mean(),
        'train_std': train_scores.std(),
        'val_mean': val_scores.mean(),
        'val_std': val_scores.std(),
        'overfit_gap': train_scores.mean() - val_scores.mean(),
        'fold_scores': val_scores,
        'mean_fit_time': fit_times.mean()
    }
    
    # Print report
    print(f"=== CROSS-VALIDATION REPORT ({scoring}) ===")
    print(f"Validation: {report['val_mean']:.4f} ± {report['val_std']:.4f}")
    print(f"Training:   {report['train_mean']:.4f} ± {report['train_std']:.4f}")
    print(f"Overfit gap: {report['overfit_gap']:.4f}")
    print(f"Mean fit time: {report['mean_fit_time']:.3f}s")
    print(f"\nFold-by-fold: {report['fold_scores'].round(4)}")
    
    return report

# Test the function
cv_strat = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)
report = cv_report(log_pipeline, X_clf, y_clf, cv_strat, scoring='roc_auc')

**Reading the output:**

The `cv_report` function prints a structured summary with four key numbers:

- **Validation score (mean +/- std):** The average ROC-AUC across the 5 folds, plus the spread. This is the number you would report in a paper or presentation.
- **Training score (mean +/- std):** How well the model fits its own training data. Training scores are almost always higher than validation scores.
- **Overfit gap:** The difference between training and validation means. A gap near zero suggests the model generalizes well; a large gap (e.g., > 0.05) signals overfitting.
- **Mean fit time:** How long each fold took to train, useful when comparing fast linear models against slower ensemble methods.

The fold-by-fold array at the bottom lets you inspect individual fold scores for outliers. If one fold is dramatically lower, it may point to a data quality issue or a particularly hard subpopulation.

**Key takeaway:** Always report *both* the mean and standard deviation. Saying "ROC-AUC = 0.99" without the spread hides the fact that some folds might score 0.97 -- a gap that could matter in high-stakes applications.

---

### YOUR ANALYSIS:

**Question 1: Why do we report both mean and std?**  
[Your answer - what does std tell us?]

**Question 2: What's a good overfit gap?**  
[Your answer - when should you worry?]

**Question 3: How would you use this in your project?**  
[Your answer - workflow integration]

---

## 4. Fair Model Comparison Under CV

### The Rules for Fair Comparison

**Must be identical:**
1. Same CV folds (use same random seed)
2. Same data
3. Same scoring metric
4. Same preprocessing (if any)

**Why this matters:**
- Different folds → different scores (not comparable)
- Different metrics → different models win
- Unfair comparison → wrong model chosen

In [None]:
# Compare multiple models with SAME CV folds
models = {
    'Logistic (C=1.0)': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', LogisticRegression(C=1.0, random_state=RANDOM_SEED, max_iter=1000))
    ]),
    'Logistic (C=0.1)': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', LogisticRegression(C=0.1, random_state=RANDOM_SEED, max_iter=1000))
    ]),
    'Random Forest': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', RandomForestClassifier(n_estimators=100, random_state=RANDOM_SEED))
    ])
}

# Use SAME CV splitter for all models
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)

results = []
for name, model in models.items():
    cv_results = cross_validate(
        model, X_clf, y_clf, cv=cv,
        scoring=['roc_auc', 'accuracy', 'f1'],
        return_train_score=True,
        n_jobs=-1
    )
    
    results.append({
        'Model': name,
        'ROC_AUC_mean': cv_results['test_roc_auc'].mean(),
        'ROC_AUC_std': cv_results['test_roc_auc'].std(),
        'Accuracy_mean': cv_results['test_accuracy'].mean(),
        'Accuracy_std': cv_results['test_accuracy'].std(),
        'F1_mean': cv_results['test_f1'].mean(),
        'F1_std': cv_results['test_f1'].std(),
        'Fit_Time': cv_results['fit_time'].mean()
    })

comparison_df = pd.DataFrame(results)
print("=== MODEL COMPARISON (5-FOLD CV) ===")
print(comparison_df.to_string(index=False))

# Highlight best model
best_idx = comparison_df['ROC_AUC_mean'].idxmax()
print(f"\n✓ Best ROC-AUC: {comparison_df.loc[best_idx, 'Model']} ({comparison_df.loc[best_idx, 'ROC_AUC_mean']:.4f})")

**Reading the output:**

The comparison table lists three models -- **Logistic (C=1.0)**, **Logistic (C=0.1)**, and **Random Forest** -- evaluated on the *same* 5 stratified folds. For each model, three metrics are reported: ROC-AUC, Accuracy, and F1, each with mean and standard deviation.

Key observations:
- All three models perform well on this relatively easy dataset, with ROC-AUC values typically above **0.98**.
- The two logistic regression variants may differ only slightly, showing that moderate regularization (C=0.1 vs C=1.0) has limited impact when features are already informative.
- Random Forest may show a marginally higher or lower score; the standard deviations will tell you whether the difference is meaningful.
- The **Fit_Time** column reveals that Random Forest takes noticeably longer than Logistic Regression due to building 100 decision trees.

The final line highlights the **best model by ROC-AUC**. However, if two models are within one standard deviation of each other, the simpler model (Logistic Regression) is generally preferred -- it is faster, more interpretable, and less prone to overfitting.

**Why this matters:** Fair model comparison requires identical folds, identical metrics, and identical data. Change any one of these and the comparison becomes invalid. This pattern -- same `StratifiedKFold` object passed to every model -- is the gold standard.

---

## 📝 PAUSE-AND-DO Exercise 2 (5 minutes)

**Task:** Compare logistic vs regularized logistic under the same CV.

Already done above! Now answer:

---

### YOUR COMPARISON:

**Observation 1: Which model performs best?**  
[Based on primary metric - ROC-AUC]

**Observation 2: Is the difference statistically meaningful?**  
[Compare means relative to standard deviations]

**Observation 3: What about fit time?**  
[Is the performance improvement worth the time cost?]

---

## 5. Advanced: Repeated Cross-Validation

For even more stable estimates, repeat CV multiple times with different random seeds:

In [None]:
from sklearn.model_selection import RepeatedStratifiedKFold

# 5-fold CV repeated 3 times = 15 total evaluations
cv_repeated = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=RANDOM_SEED)

scores_single = cross_val_score(log_pipeline, X_clf, y_clf, cv=cv_strat, scoring='roc_auc')
scores_repeated = cross_val_score(log_pipeline, X_clf, y_clf, cv=cv_repeated, scoring='roc_auc')

print("=== SINGLE VS REPEATED CV ===")
print(f"\nSingle 5-fold:")
print(f"  Mean: {scores_single.mean():.4f}")
print(f"  Std:  {scores_single.std():.4f}")
print(f"  N evaluations: {len(scores_single)}")

print(f"\nRepeated 5-fold (×3):")
print(f"  Mean: {scores_repeated.mean():.4f}")
print(f"  Std:  {scores_repeated.std():.4f}")
print(f"  N evaluations: {len(scores_repeated)}")

# Visualize distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(scores_single, bins=5, edgecolor='black', alpha=0.7)
axes[0].axvline(scores_single.mean(), color='r', linestyle='--', label=f'Mean = {scores_single.mean():.4f}')
axes[0].set_xlabel('ROC-AUC Score')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Single 5-Fold CV')
axes[0].legend()

axes[1].hist(scores_repeated, bins=10, edgecolor='black', alpha=0.7)
axes[1].axvline(scores_repeated.mean(), color='r', linestyle='--', label=f'Mean = {scores_repeated.mean():.4f}')
axes[1].set_xlabel('ROC-AUC Score')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Repeated 5-Fold CV (×3)')
axes[1].legend()

plt.tight_layout()
plt.show()

print("\n💡 Repeated CV gives more samples → more stable estimate")
print("💡 Trade-off: More computation time")

**Reading the output:**

Two CV strategies are compared for the same Logistic Regression pipeline:

- **Single 5-Fold CV:** 5 evaluation scores.
- **Repeated 5-Fold CV (x3):** 15 evaluation scores (5 folds repeated with 3 different random shuffles).

The means should be close to each other, but the repeated version typically has a **smaller standard deviation** because it averages over more independent evaluations. The histograms on the right visually confirm this: the repeated-CV distribution is smoother and more bell-shaped, while the single-CV histogram is sparse with only 5 bars.

The tradeoff is computational: repeated CV requires **3x the model fits** (15 instead of 5). For fast models like Logistic Regression this is negligible, but for large Random Forests or deep learning models, the extra time can be significant.

**Key takeaway:** Use repeated CV when you need high-confidence estimates (e.g., for a final report or publication). Use single-fold CV during early exploration when speed matters more than precision.

---

## 6. Reusable CV Comparison Function (Project-Ready)

Throughout the previous sections we repeated the same pattern: create a CV splitter, loop over models, collect scores, print results. The function below packages that entire workflow into a single call, `compare_models()`, that accepts a dictionary of named models, a CV splitter, and one or more scoring metrics.

This is the function you will copy directly into your course project. It returns a tidy DataFrame suitable for reporting, and it enforces fair comparison by using the *same* CV folds for every model. Investing five minutes now to understand its API will save hours later.

In [None]:
def compare_models(models_dict, X, y, cv, scoring='accuracy', verbose=True):
    """
    Compare multiple models using the same CV folds.
    
    Parameters:
    -----------
    models_dict : dict
        Dictionary of {name: model} pairs
    X : array-like
        Features
    y : array-like
        Target
    cv : cross-validator
        CV splitter
    scoring : str or list
        Scoring metric(s)
    verbose : bool
        Print results
    
    Returns:
    --------
    pd.DataFrame : Comparison table
    """
    results = []
    
    for name, model in models_dict.items():
        cv_results = cross_validate(
            model, X, y, cv=cv, scoring=scoring,
            return_train_score=True, n_jobs=-1
        )
        
        # Handle single or multiple metrics
        if isinstance(scoring, str):
            val_scores = cv_results['test_score']
            train_scores = cv_results['train_score']
            
            results.append({
                'Model': name,
                'Val_Mean': val_scores.mean(),
                'Val_Std': val_scores.std(),
                'Train_Mean': train_scores.mean(),
                'Overfit_Gap': train_scores.mean() - val_scores.mean(),
                'Fit_Time_Mean': cv_results['fit_time'].mean()
            })
        else:
            # Multiple metrics case
            row = {'Model': name}
            for metric in scoring:
                val_scores = cv_results[f'test_{metric}']
                row[f'{metric}_mean'] = val_scores.mean()
                row[f'{metric}_std'] = val_scores.std()
            row['Fit_Time'] = cv_results['fit_time'].mean()
            results.append(row)
    
    df = pd.DataFrame(results)
    
    if verbose:
        print("=== MODEL COMPARISON ===")
        print(df.to_string(index=False))
        print(f"\nCV: {cv}")
        print(f"Scoring: {scoring}")
    
    return df

# Test it
test_models = {
    'Logistic': Pipeline([('scaler', StandardScaler()), ('clf', LogisticRegression(random_state=RANDOM_SEED, max_iter=1000))]),
    'Random Forest': Pipeline([('scaler', StandardScaler()), ('clf', RandomForestClassifier(n_estimators=50, random_state=RANDOM_SEED))])
}

comparison = compare_models(
    test_models, X_clf, y_clf,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED),
    scoring=['roc_auc', 'accuracy'],
    verbose=True
)

**Reading the output:**

The `compare_models` function produces a comparison table with two models -- **Logistic** and **Random Forest** -- scored on both `roc_auc` and `accuracy`. Each metric column shows the mean across 5 stratified folds, plus the standard deviation and the fit time.

This function is *project-ready*: you can pass in any dictionary of named pipelines, any CV splitter, and any list of scoring metrics. It returns a pandas DataFrame, so you can sort, filter, export to CSV, or feed it into a visualization.

Notice the footer prints the CV object and scoring parameter, so you always know exactly how the comparison was conducted. This metadata is critical for reproducibility -- if a colleague asks "how did you compare these models?", the answer is right in the output.

**Why this matters:** Automating model comparison into a single function eliminates copy-paste errors and ensures consistency. In your course project, replace the `test_models` dictionary with your own candidate pipelines and this function will handle the rest.

---

## 7. Wrap-Up: Key Takeaways

### What We Learned Today:

1. **CV Reduces Variance**: More stable estimates than single train/val split
2. **Stratified CV**: Essential for classification (maintains class balance)
3. **Fair Comparison**: Same folds, same metrics, same data
4. **Report Uncertainty**: Always show mean ± std, not just mean
5. **Reusable Functions**: Build tools you can use in your project

### Critical Rules:

> **"Never compare models with different CV folds"**

> **"Always use stratified CV for classification"**

> **"Report mean AND standard deviation"**

### Next Steps:

- Next notebook: Hyperparameter tuning with GridSearchCV
- We'll combine CV with systematic parameter search
- Start building your project baseline model

---

## Participation Assignment Submission Instructions

### To Submit This Notebook:

1. **Complete all exercises**: Fill in both PAUSE-AND-DO exercise cells with your findings
2. **Run All Cells**: Execute `Runtime → Run all` to ensure everything works
3. **Save a Copy**: `File → Save a copy in Drive`
4. **Submit**: Upload your `.ipynb` file in the participation assignment you find in the course Brightspace page.

### Before Submitting, Check:

- [ ] All cells execute without errors
- [ ] All outputs are visible
- [ ] Both exercise responses are complete
- [ ] Notebook is shared with correct permissions
- [ ] You can explain every line of code you wrote

### Next Step:

Complete the **Quiz** in Brightspace (auto-graded)

---

## Bibliography

- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). *An Introduction to Statistical Learning with Python* - Model Assessment and Selection (k-fold CV, resampling concepts)
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* - Resampling theory and selection bias
- scikit-learn User Guide: [Cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html)
- Kohavi, R. (1995). "A study of cross-validation and bootstrap for accuracy estimation and model selection." *IJCAI*.

---



<center>

Thank you!

</center>