**Programmer:** python_scripts (Abhijith Warrier)

**PYTHON SCRIPT TO **_PLOT LEARNING CURVES TO DETECT UNDERFITTING & OVERFITTING_**. 🐍📈🤖**

This script demonstrates how **learning curves** help you visualize whether a model is **underfitting, overfitting, or just right**.
By tracking training and validation scores as the dataset size increases, you can assess if your model generalizes well.

### 📦 Import Required Libraries

We’ll use `learning_curve` with a stratified CV splitter and a simple pipeline.

In [1]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

### 🧩 Load Dataset and Initialize Model

We’ll use the Iris dataset for simplicity.

In [2]:
# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target
print("X shape:", X.shape, "| y classes:", np.unique(y))

X shape: (150, 4) | y classes: [0 1 2]


### ⚙️ Build Model Pipeline

SVM/LogReg benefit from scaling; bump `max_iter` for convergence.

In [3]:
model = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("logreg", LogisticRegression(max_iter=1000, multi_class="auto"))
])

### 🧪 Define Stratified CV & Train Sizes

Use **StratifiedKFold** with shuffling to preserve class balance in each fold.
Start `train_sizes` at **0.3** to avoid single-class tiny splits.

In [4]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
train_sizes = np.linspace(0.3, 1.0, 8)  # 30% → 100%, 8 points

### 📊 Compute Learning Curves

Collect training and validation (CV) accuracy at increasing sample sizes.

In [5]:
sizes, train_scores, test_scores = learning_curve(
    estimator=model,
    X=X,
    y=y,
    cv=cv,
    train_sizes=train_sizes,
    scoring="accuracy",
    n_jobs=-1,
    shuffle=True,
    random_state=42
)

train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
test_mean = test_scores.mean(axis=1)
test_std = test_scores.std(axis=1)



### 📈 Plot Learning Curves

Training vs. cross-validation accuracy with std-bands.

In [None]:
plt.figure(figsize=(8, 6))
plt.plot(sizes, train_mean, "o-", label="Training score")
plt.plot(sizes, test_mean, "o-", label="Cross-validation score")
plt.fill_between(sizes, train_mean - train_std, train_mean + train_std, alpha=0.15)
plt.fill_between(sizes, test_mean - test_std, test_mean + test_std, alpha=0.15)
plt.title("Learning Curves — Logistic Regression (Iris)")
plt.xlabel("Training Set Size")
plt.ylabel("Accuracy")
plt.legend(loc="best")
plt.grid(True, alpha=0.25)
plt.tight_layout()
plt.show()

### 🧠 How to Read This

- **Underfitting**: both curves converge to a **low** accuracy → model too simple; add features/complexity.
- **Overfitting**: **high gap** (train ≫ CV) even at large sizes → reduce complexity, regularize, or add data.
- **Good fit**: both curves converge together at a **high** accuracy → balanced bias/variance.

### 📝 Tips

- If you still see instability, raise the lower bound: `np.linspace(0.4, 1.0, 7)` or use `n_splits=3`.
- Try different estimators (e.g., SVC, DecisionTree) to compare bias/variance behavior.
- Use the same pattern on your own datasets to diagnose learning dynamics quickly.