<a href="https://colab.research.google.com/github/asifahsaan/data-preprocessing-beginners/blob/main/02_model_selection_with_crossval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 02 — Model Selection with Cross-Validation
In this notebook, we’ll learn how to:
- Use `cross_val_score` for reliable model evaluation
- Compare different classifiers using k-fold cross-validation
- Understand bias-variance trade-offs using scores

## 1. Load Dataset

In [None]:
from sklearn.datasets import load_breast_cancer
import pandas as pd

# Load sample classification dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

## 2. Compare Models with Cross-Validation

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

models = {
    "LogisticRegression": LogisticRegression(max_iter=500),
    "RandomForest": RandomForestClassifier(),
    "SVM": SVC()
}

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5)
    print(f"{name}: Mean Accuracy = {scores.mean():.3f} (+/- {scores.std():.3f})")

## 3. Visualize Cross-Validation Scores

In [None]:
import matplotlib.pyplot as plt

cv_results = {}
for name, model in models.items():
    cv_results[name] = cross_val_score(model, X, y, cv=5)

plt.boxplot(cv_results.values(), labels=cv_results.keys())
plt.ylabel("Accuracy")
plt.title("Cross-Validation Accuracy by Model")
plt.grid(True)
plt.show()

## Summary
- Used `cross_val_score` with 5-fold cross-validation
- Compared Logistic Regression, Random Forest, and SVM
- Visualized score distributions to evaluate performance consistency

## What’s Next?
In the next notebook:
**`03_classification_metrics.ipynb`** — we’ll explore accuracy, precision, recall, and F1-score for deeper model evaluation.