<a href="https://colab.research.google.com/github/engAziz04/SWE485-Project-Group2/blob/main/phase2_supervised_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Phase 2 — Supervised Learning (StudentsPerformance.csv)

This notebook fulfills the **Phase 2** requirements:
- Choose ≥2 supervised models with justification  
- Implement training (with clean preprocessing)  
- Evaluate & compare (Accuracy, Precision, Recall, F1, + optional CV)  
- Interpret results and explain which model performed best and why  

> Place this file in your repo under: `/Supervised_Learning/Phase2_Supervised_Learning.ipynb`


In [3]:

# 0) Imports & basic setup
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix

from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

pd.set_option('display.max_columns', None)


## 1) Load data

In [4]:

# Make sure StudentsPerformance.csv is in the same folder as this notebook
df = pd.read_csv("StudentsPerformance.csv")
df.head()


Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


## 2) Create target label (Pass/Fail)

In [5]:

score_cols = ["math score", "reading score", "writing score"]
assert set(score_cols).issubset(df.columns), "Score columns not found in CSV."

df["average"] = df[score_cols].mean(axis=1)
df["performance"] = (df["average"] >= 60).astype(int)  # 1=Pass, 0=Fail

df[score_cols + ["average","performance"]].describe()


Unnamed: 0,math score,reading score,writing score,average,performance
count,1000.0,1000.0,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054,67.770667,0.715
std,15.16308,14.600192,15.195657,14.257326,0.45164
min,0.0,17.0,10.0,9.0,0.0
25%,57.0,59.0,57.75,58.333333,0.0
50%,66.0,70.0,69.0,68.333333,1.0
75%,77.0,79.0,79.0,77.666667,1.0
max,100.0,100.0,100.0,100.0,1.0


## 3) Features/Target split

In [6]:

y = df["performance"].copy()
X = df.drop(columns=["performance", "average"])

num_cols = [c for c in X.columns if X[c].dtype != "O"]
cat_cols = [c for c in X.columns if X[c].dtype == "O"]

num_cols, cat_cols


(['math score', 'reading score', 'writing score'],
 ['gender',
  'race/ethnicity',
  'parental level of education',
  'lunch',
  'test preparation course'])

## 4) Preprocessing (ColumnTransformer + Pipelines)

In [7]:

numeric_transformer = Pipeline(steps=[("scaler", StandardScaler())])
categorical_transformer = Pipeline(steps=[("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_cols),
        ("cat", categorical_transformer, cat_cols)
    ],
    remainder="drop"
)
preprocessor


## 5) Train/Test split

In [8]:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train.shape, X_test.shape, y_train.value_counts(normalize=True), y_test.value_counts(normalize=True)


((800, 8),
 (200, 8),
 performance
 1    0.715
 0    0.285
 Name: proportion, dtype: float64,
 performance
 1    0.715
 0    0.285
 Name: proportion, dtype: float64)

## 6) Models: SVM (RBF) + Decision Tree

In [9]:

svm_clf = Pipeline(steps=[
    ("prep", preprocessor),
    ("clf", SVC(kernel="rbf", probability=False, random_state=42))
])

dt_clf = Pipeline(steps=[
    ("prep", preprocessor),
    ("clf", DecisionTreeClassifier(max_depth=None, random_state=42))
])

models = {
    "SVM (RBF)": svm_clf,
    "DecisionTree": dt_clf
}
list(models.keys())


['SVM (RBF)', 'DecisionTree']

## 7) Train & Evaluate

In [10]:

results = []

for name, pipe in models.items():
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)

    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, zero_division=0)
    rec = recall_score(y_test, y_pred, zero_division=0)
    f1 = f1_score(y_test, y_pred, zero_division=0)

    print(f"\n=== {name} ===")
    print("Accuracy :", round(acc, 4))
    print("Precision:", round(prec, 4))
    print("Recall   :", round(rec, 4))
    print("F1-score :", round(f1, 4))
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, digits=4))
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))

    results.append((name, acc, prec, rec, f1))

comp_df = pd.DataFrame(results, columns=["Model", "Accuracy", "Precision", "Recall", "F1"]).sort_values("F1", ascending=False)
comp_df



=== SVM (RBF) ===
Accuracy : 0.97
Precision: 0.9928
Recall   : 0.965
F1-score : 0.9787

Classification Report:
              precision    recall  f1-score   support

           0     0.9180    0.9825    0.9492        57
           1     0.9928    0.9650    0.9787       143

    accuracy                         0.9700       200
   macro avg     0.9554    0.9737    0.9639       200
weighted avg     0.9715    0.9700    0.9703       200

Confusion Matrix:
[[ 56   1]
 [  5 138]]

=== DecisionTree ===
Accuracy : 0.96
Precision: 0.9655
Recall   : 0.979
F1-score : 0.9722

Classification Report:
              precision    recall  f1-score   support

           0     0.9455    0.9123    0.9286        57
           1     0.9655    0.9790    0.9722       143

    accuracy                         0.9600       200
   macro avg     0.9555    0.9457    0.9504       200
weighted avg     0.9598    0.9600    0.9598       200

Confusion Matrix:
[[ 52   5]
 [  3 140]]


Unnamed: 0,Model,Accuracy,Precision,Recall,F1
0,SVM (RBF),0.97,0.992806,0.965035,0.978723
1,DecisionTree,0.96,0.965517,0.979021,0.972222


## 8) Optional: 5-Fold Cross-Validation (F1)

In [11]:

cv_summary = {}
for name, pipe in models.items():
    cv_scores = cross_val_score(pipe, X, y, cv=5, scoring="f1")
    cv_summary[name] = {"mean": cv_scores.mean(), "std": cv_scores.std()}
pd.DataFrame(cv_summary).T


Unnamed: 0,mean,std
SVM (RBF),0.985989,0.007416
DecisionTree,0.977436,0.007451



## 9) Results Interpretation

Which model performed best?
The SVM (RBF) model achieved the highest performance with an accuracy of 97% and an F1-score of 0.9787, slightly outperforming the Decision Tree (accuracy = 96%, F1 = 0.9722).

Why might it be better?
SVM can model non-linear decision boundaries, allowing it to handle complex relationships between features.
In contrast, the Decision Tree might slightly overfit or make less smooth splits.

What features likely matter?
The most important predictors are the math, reading, and writing scores, along with test preparation course and lunch type.
These features strongly influence overall student performance.

Any limitations?
The dataset is relatively small (1000 samples) but balanced.
Additional hyperparameter tuning (e.g., SVM – C, gamma; Decision Tree – max_depth, min_samples_split) could further improve results.

Cross-Validation:
A 5-fold cross-validation confirmed the stability of both models.
SVM achieved an average F1-score of 0.986 ± 0.007, showing consistent and reliable performance across all folds.


## 10)  Algorithm Selection & Justification

SVM (RBF):
Selected as the primary model because it effectively captures non-linear patterns and performs robustly on medium-sized tabular datasets.
It benefits from scaling and one-hot encoding, providing high accuracy and generalization.

Decision Tree:
Used as a baseline model due to its simplicity, interpretability, and fast training time.
It helps visualize how features contribute to predictions and serves as a strong reference for comparing other algorithms.


## 11) Results Summary & Conclusion

Both supervised models — SVM (RBF) and Decision Tree — were trained and evaluated on the StudentsPerformance.csv dataset.
After preprocessing (encoding and scaling), both models achieved excellent predictive performance.

Model	Accuracy	Precision	Recall	F1	Cross-Val F1 (mean ± std)
SVM (RBF)	0.97	0.993	0.965	0.979	0.986 ± 0.007
Decision Tree	0.96	0.966	0.979	0.972	0.977 ± 0.007

Interpretation:
The SVM model slightly outperformed the Decision Tree in both accuracy and stability.
Its ability to model nonlinear patterns gave it a small but consistent advantage.
Both models showed strong generalization, confirming the dataset’s quality and the effectiveness of preprocessing.

Conclusion:
The SVM (RBF) is selected as the best model for this task due to its higher and more consistent performance across both the test set and 5-fold cross-validation.