
# Iris Classification with Linear Regression + Resampling (Oversampling & SMOTE)

**Goal:** Follow the assignment:
1. Split the Iris dataset into training, validation, and test sets (80%/10%/10% and 70%/15%/15%).
2. Create a **linear regression** model that predicts categories using a One-vs-Rest (OvR) approach.
3. Make one class artificially minority by downsampling, then balance the training set with:
   - **Random oversampling** (duplicate minority samples).
   - **SMOTE** (interpolate synthetic samples) with two settings:
     - `k_neighbors=1` (pair-based interpolation)
     - `k_neighbors=5` (default-like nearest-neighbor interpolation)
4. Train classifiers for each minority-choice and compare their performance.
5. Evaluate on *validation* and *test* (kept untouched) and visualize confusions.

> **Why linear regression for classification?**  
> This is for the assignment. We build separate regressors for each class (target is 1 if sample belongs to class, else 0). At prediction, we pick the class whose regressor outputs the highest score.
> This is called an **OvR Linear Regression** classifier. It's not the most robust classifier for multi-class problems (logistic regression or SVM are better), but it demonstrates the pipeline and resampling effects clearly.


In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

from imblearn.over_sampling import RandomOverSampler, SMOTE



### Helper functions

We encapsulate utilities to keep the notebook clean:
- `split_train_val_test`: stratified splitting into train/val/test using two-stage split.
- `OVRLinearRegression`: one-vs-rest linear regression wrapper.
- `evaluate`: accuracy, macro-F1, and confusion matrix.
- `class_counts` & `plot_confusion_matrix`: basic visualizations without seaborn.
- `downsample_one_class`: create intentional class imbalance for experiments.


In [None]:

def split_train_val_test(X, y, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1, random_state=42):
    assert abs(train_ratio + val_ratio + test_ratio - 1.0) < 1e-6, "Ratios must sum to 1."
    X_train, X_temp, y_train, y_temp = train_test_split(
        X, y, test_size=(1 - train_ratio), stratify=y, random_state=random_state
    )
    val_size = val_ratio / (val_ratio + test_ratio)
    X_val, X_test, y_val, y_test = train_test_split(
        X_temp, y_temp, test_size=(1 - val_size), stratify=y_temp, random_state=random_state
    )
    return X_train, X_val, X_test, y_train, y_val, y_test

class OVRLinearRegression:
    def __init__(self):
        self.models_ = {}
        self.classes_ = None

    def fit(self, X, y):
        self.classes_ = np.unique(y)
        for k in self.classes_:
            y_binary = (y == k).astype(float)
            model = LinearRegression()
            model.fit(X, y_binary)
            self.models_[k] = model
        return self

    def predict_scores(self, X):
        scores = []
        for k in self.classes_:
            s = self.models_[k].predict(X)
            scores.append(s.reshape(-1, 1))
        return np.hstack(scores)

    def predict(self, X):
        scores = self.predict_scores(X)
        idx = np.argmax(scores, axis=1)
        return self.classes_[idx]

def evaluate(y_true, y_pred, class_names):
    acc = accuracy_score(y_true, y_pred)
    report = classification_report(y_true, y_pred, target_names=class_names, output_dict=True, zero_division=0)
    cm = confusion_matrix(y_true, y_pred, labels=list(range(len(class_names))))
    return acc, report, cm

def plot_confusion_matrix(cm, class_names, title):
    plt.figure()
    plt.imshow(cm, interpolation="nearest")
    plt.title(title)
    plt.xlabel("Predicted label")
    plt.ylabel("True label")
    tick_marks = np.arange(len(class_names))
    plt.xticks(tick_marks, class_names, rotation=45)
    plt.yticks(tick_marks, class_names)
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            plt.text(j, i, format(cm[i, j], "d"), ha="center", va="center")
    plt.tight_layout()
    plt.show()

def class_counts(y, class_names, title):
    counts = pd.Series(y).value_counts().sort_index()
    plt.figure()
    plt.bar(range(len(class_names)), counts.values)
    plt.title(title)
    plt.xlabel("Class")
    plt.ylabel("Count")
    plt.xticks(range(len(class_names)), class_names, rotation=45)
    plt.tight_layout()
    plt.show()
    return counts

def downsample_one_class(X, y, target_class, keep_n=10, random_state=42):
    rng = np.random.default_rng(random_state)
    mask_target = (y == target_class)
    idx_all = np.arange(len(y))
    idx_target = idx_all[mask_target]
    idx_other = idx_all[~mask_target]
    if keep_n >= len(idx_target):
        chosen_target = idx_target
    else:
        chosen_target = rng.choice(idx_target, size=keep_n, replace=False)
    new_idx = np.concatenate([chosen_target, idx_other])
    rng.shuffle(new_idx)
    return X[new_idx], y[new_idx]



### Pipeline: load → split → scale → baseline → imbalance + resampling → compare

We will run the full pipeline for both 80/10/10 and 70/15/15 splits.  
For each split:
1. Train a **baseline** OvR Linear Regression (no resampling).
2. For each class (setosa, versicolor, virginica):
   - Make it minority by downsampling to 10 samples in the training set.
   - Balance using:
     - **RandomOverSampler** (duplicates).
     - **SMOTE** with `k_neighbors=1` (pair-based) and `k_neighbors=5` (nearest-neighbor).
3. Evaluate on validation and test (unmodified) and record metrics.


In [None]:

from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler, SMOTE

def run_pipeline_for_split(train_ratio, val_ratio, test_ratio, random_state=42):
    iris = load_iris(as_frame=True)
    df = iris.frame.copy()
    X = df[iris.feature_names].values
    y = df["target"].values
    class_names = list(iris.target_names)

    X_train, X_val, X_test, y_train, y_val, y_test = split_train_val_test(
        X, y, train_ratio, val_ratio, test_ratio, random_state=random_state
    )

    scaler = StandardScaler()
    X_train_s = scaler.fit_transform(X_train)
    X_val_s = scaler.transform(X_val)
    X_test_s = scaler.transform(X_test)

    rows = []

    # Baseline
    base = OVRLinearRegression().fit(X_train_s, y_train)
    y_val_pred = base.predict(X_val_s)
    y_test_pred = base.predict(X_test_s)
    val_acc, val_report, val_cm = evaluate(y_val, y_val_pred, class_names)
    test_acc, test_report, test_cm = evaluate(y_test, y_test_pred, class_names)

    rows.append({
        "split": f"{int(train_ratio*100)}/{int(val_ratio*100)}/{int(test_ratio*100)}",
        "scenario": "Baseline (no resampling)",
        "minority_class": "N/A",
        "val_accuracy": val_acc,
        "test_accuracy": test_acc,
        "val_macro_f1": val_report["macro avg"]["f1-score"],
        "test_macro_f1": test_report["macro avg"]["f1-score"]
    })

    plot_confusion_matrix(test_cm, class_names, f"Confusion Matrix - Baseline (Test) [{int(train_ratio*100)}/{int(val_ratio*100)}/{int(test_ratio*100)}]")

    # RandomOverSampler experiments
    for cls_idx, cls_name in enumerate(class_names):
        X_train_imb, y_train_imb = downsample_one_class(X_train, y_train, target_class=cls_idx, keep_n=10, random_state=random_state)
        X_train_imb_s = scaler.transform(X_train_imb)

        ros = RandomOverSampler(random_state=random_state)
        X_ros, y_ros = ros.fit_resample(X_train_imb_s, y_train_imb)

        clf = OVRLinearRegression().fit(X_ros, y_ros)
        y_val_pred = clf.predict(X_val_s)
        y_test_pred = clf.predict(X_test_s)
        val_acc, val_report, val_cm = evaluate(y_val, y_val_pred, class_names)
        test_acc, test_report, test_cm = evaluate(y_test, y_test_pred, class_names)

        rows.append({
            "split": f"{int(train_ratio*100)}/{int(val_ratio*100)}/{int(test_ratio*100)}",
            "scenario": "RandomOverSampler",
            "minority_class": cls_name,
            "val_accuracy": val_acc,
            "test_accuracy": test_acc,
            "val_macro_f1": val_report["macro avg"]["f1-score"],
            "test_macro_f1": test_report["macro avg"]["f1-score"]
        })

    # SMOTE experiments
    for k_neighbors, label in [(1, "SMOTE (k_neighbors=1)"), (5, "SMOTE (k_neighbors=5)")]:
        for cls_idx, cls_name in enumerate(class_names):
            X_train_imb, y_train_imb = downsample_one_class(X_train, y_train, target_class=cls_idx, keep_n=10, random_state=random_state)
            X_train_imb_s = scaler.transform(X_train_imb)

            smote = SMOTE(random_state=random_state, k_neighbors=k_neighbors)
            X_sm, y_sm = smote.fit_resample(X_train_imb_s, y_train_imb)

            clf = OVRLinearRegression().fit(X_sm, y_sm)
            y_val_pred = clf.predict(X_val_s)
            y_test_pred = clf.predict(X_test_s)
            val_acc, val_report, val_cm = evaluate(y_val, y_val_pred, class_names)
            test_acc, test_report, test_cm = evaluate(y_test, y_test_pred, class_names)

            rows.append({
                "split": f"{int(train_ratio*100)}/{int(val_ratio*100)}/{int(test_ratio*100)}",
                "scenario": label,
                "minority_class": cls_name,
                "val_accuracy": val_acc,
                "test_accuracy": test_acc,
                "val_macro_f1": val_report["macro avg"]["f1-score"],
                "test_macro_f1": test_report["macro avg"]["f1-score"]
            })

    return pd.DataFrame(rows)

results_80 = run_pipeline_for_split(0.8, 0.1, 0.1, random_state=42)
results_70 = run_pipeline_for_split(0.7, 0.15, 0.15, random_state=42)
all_results = pd.concat([results_80, results_70], ignore_index=True).sort_values(by=["split", "scenario", "minority_class"]).reset_index(drop=True)
all_results



### How to read the results

- **Baseline**: Model trained on the original (balanced) Iris training set.  
- **RandomOverSampler**: We artificially made one class minority by downsampling it to 10 samples in the **training** set, then oversampled (duplicated) until classes balanced.  
- **SMOTE**: Same imbalance as above, but we synthetically generate in-between samples instead of duplicates.
  - `k_neighbors=1` approximates the instruction "take any two samples and interpolate".
  - `k_neighbors=5` uses the nearest-neighbor set, generating more diverse synthetic samples.

- The **validation/test** sets are **never resampled** and remain balanced; they reflect generalization.
- Use **accuracy** and **macro-F1** to compare; macro-F1 weighs each class equally (good for class imbalance checks).
- Check **confusion matrices** (plotted above) to see which classes are confused.
