
# Phase 2 — Supervised Learning (StudentsPerformance.csv)

This notebook fulfills the **Phase 2** requirements:
- Choose ≥2 supervised models with justification  
- Implement training (with clean preprocessing)  
- Evaluate & compare (Accuracy, Precision, Recall, F1, + optional CV)  
- Interpret results and explain which model performed best and why  

> Place this file in your repo under: `/Supervised_Learning/Phase2_Supervised_Learning.ipynb`


In [None]:

# 0) Imports & basic setup
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix

from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

pd.set_option('display.max_columns', None)


## 1) Load data

In [None]:

# Make sure StudentsPerformance.csv is in the same folder as this notebook
df = pd.read_csv("StudentsPerformance.csv")
df.head()


## 2) Create target label (Pass/Fail)

In [None]:

score_cols = ["math score", "reading score", "writing score"]
assert set(score_cols).issubset(df.columns), "Score columns not found in CSV."

df["average"] = df[score_cols].mean(axis=1)
df["performance"] = (df["average"] >= 60).astype(int)  # 1=Pass, 0=Fail

df[score_cols + ["average","performance"]].describe()


## 3) Features/Target split

In [None]:

y = df["performance"].copy()
X = df.drop(columns=["performance", "average"])

num_cols = [c for c in X.columns if X[c].dtype != "O"]
cat_cols = [c for c in X.columns if X[c].dtype == "O"]

num_cols, cat_cols


## 4) Preprocessing (ColumnTransformer + Pipelines)

In [None]:

numeric_transformer = Pipeline(steps=[("scaler", StandardScaler())])
categorical_transformer = Pipeline(steps=[("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_cols),
        ("cat", categorical_transformer, cat_cols)
    ],
    remainder="drop"
)
preprocessor


## 5) Train/Test split

In [None]:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train.shape, X_test.shape, y_train.value_counts(normalize=True), y_test.value_counts(normalize=True)


## 6) Models: SVM (RBF) + Decision Tree

In [None]:

svm_clf = Pipeline(steps=[
    ("prep", preprocessor),
    ("clf", SVC(kernel="rbf", probability=False, random_state=42))
])

dt_clf = Pipeline(steps=[
    ("prep", preprocessor),
    ("clf", DecisionTreeClassifier(max_depth=None, random_state=42))
])

models = {
    "SVM (RBF)": svm_clf,
    "DecisionTree": dt_clf
}
list(models.keys())


## 7) Train & Evaluate

In [None]:

results = []

for name, pipe in models.items():
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)

    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, zero_division=0)
    rec = recall_score(y_test, y_pred, zero_division=0)
    f1 = f1_score(y_test, y_pred, zero_division=0)

    print(f"\n=== {name} ===")
    print("Accuracy :", round(acc, 4))
    print("Precision:", round(prec, 4))
    print("Recall   :", round(rec, 4))
    print("F1-score :", round(f1, 4))
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, digits=4))
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))

    results.append((name, acc, prec, rec, f1))

comp_df = pd.DataFrame(results, columns=["Model", "Accuracy", "Precision", "Recall", "F1"]).sort_values("F1", ascending=False)
comp_df


## 8) Optional: 5-Fold Cross-Validation (F1)

In [None]:

cv_summary = {}
for name, pipe in models.items():
    cv_scores = cross_val_score(pipe, X, y, cv=5, scoring="f1")
    cv_summary[name] = {"mean": cv_scores.mean(), "std": cv_scores.std()}
pd.DataFrame(cv_summary).T



## 9) Results Interpretation (write here)
- **Which model performed best?** Explain using F1 and other metrics.
- **Why might it be better?** (e.g., non-linear boundaries in SVM, over/underfitting in DT).
- **What features likely matter?** (Consider `test preparation course`, `lunch`, etc.)
- **Any limitations?** (dataset size, class balance, need for hyperparameter tuning).

> If you have time, add a grid search cell to tune SVM (C, gamma) and Decision Tree (max_depth, min_samples_split).



## 10) Algorithm Selection & Justification (write here)
- **SVM (RBF):** captures non-linear relationships; robust on medium-sized tabular data when features are scaled and categoricals are one-hot encoded.
- **Decision Tree:** simple and interpretable baseline; shows if simple rules can separate classes; fast to train and understand.



---

### ✅ Phase 2 Checklist
- [ ] Two supervised models with clear justification  
- [ ] Clean preprocessing via ColumnTransformer + Pipelines  
- [ ] Train/Test split (and optional cross-validation)  
- [ ] Metrics: Accuracy, Precision, Recall, F1 + Confusion Matrix  
- [ ] A short interpretation explaining which model is best and why  
- [ ] Save this notebook under `/Supervised_Learning/Phase2_Supervised_Learning.ipynb` in your GitHub repo  

