# Regression Models for Student Performance Prediction

In this notebook, we build several regression models to predict the final grade (G3) of students based on their:
- demographics  
- past grades  
- study time  
- absences  
- family background  
- behavioral attributes  

We will evaluate multiple algorithms:
- Linear Regression  
- k-Nearest Neighbors (KNN)  
- Decision Tree Regression  
- Random Forest Regression  
- Support Vector Regression 

Metrics used:
- MAE (Mean Absolute Error)
- RMSE (Root Mean Squared Error)
- R² Score



In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor


In [5]:
df = pd.read_csv("../data/processed/student_data_cleaned.csv")
df.head()


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3,statut_reussite
0,0,0,18,1,0,0,4,4,0,4,...,3,4,1,1,3,4,0,11,11,1
1,0,0,17,1,0,1,1,1,0,2,...,3,3,1,1,3,2,9,11,11,1
2,0,0,15,1,1,1,1,1,0,2,...,3,2,2,3,3,6,12,13,12,1
3,0,0,15,1,0,1,4,2,1,3,...,2,2,1,1,5,0,14,14,14,1
4,0,0,16,1,0,1,3,3,2,2,...,3,2,1,2,5,0,11,13,13,1


In [6]:
X = df.drop("G3", axis=1)
y = df["G3"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [7]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


Linear Regression

In [8]:
lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred_lr = lr.predict(X_test)


KNN Regression

In [10]:
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

y_pred_knn = knn.predict(X_test_scaled)


Decision Tree Regression

In [11]:
dt = DecisionTreeRegressor(max_depth=6, random_state=42)
dt.fit(X_train, y_train)

y_pred_dt = dt.predict(X_test)


Random Forest Regression

In [12]:
rf = RandomForestRegressor(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)


Evaluation Function

In [13]:
def evaluate_model(y_true, y_pred, name):
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2 = r2_score(y_true, y_pred)

    print(f"=== {name} ===")
    print(f"MAE:  {mae:.3f}")
    print(f"RMSE: {rmse:.3f}")
    print(f"R²:   {r2:.3f}")
    print("\n")


Evaluate All Models

In [14]:
evaluate_model(y_test, y_pred_lr, "Linear Regression")
evaluate_model(y_test, y_pred_knn, "KNN Regressor")
evaluate_model(y_test, y_pred_dt, "Decision Tree Regressor")
evaluate_model(y_test, y_pred_rf, "Random Forest Regressor")


=== Linear Regression ===
MAE:  0.731
RMSE: 1.138
R²:   0.867


=== KNN Regressor ===
MAE:  1.366
RMSE: 1.760
R²:   0.683


=== Decision Tree Regressor ===
MAE:  0.521
RMSE: 0.677
R²:   0.953


=== Random Forest Regressor ===
MAE:  0.652
RMSE: 1.030
R²:   0.891




Compare Models in a Table

In [15]:
results = pd.DataFrame({
    "Model": ["Linear Regression", "KNN", "Decision Tree", "Random Forest"],
    "MAE": [
        mean_absolute_error(y_test, y_pred_lr),
        mean_absolute_error(y_test, y_pred_knn),
        mean_absolute_error(y_test, y_pred_dt),
        mean_absolute_error(y_test, y_pred_rf),
    ],
    "RMSE": [
        np.sqrt(mean_squared_error(y_test, y_pred_lr)),
        np.sqrt(mean_squared_error(y_test, y_pred_knn)),
        np.sqrt(mean_squared_error(y_test, y_pred_dt)),
        np.sqrt(mean_squared_error(y_test, y_pred_rf)),
    ],
    "R2 Score": [
        r2_score(y_test, y_pred_lr),
        r2_score(y_test, y_pred_knn),
        r2_score(y_test, y_pred_dt),
        r2_score(y_test, y_pred_rf),
    ]
})

results


Unnamed: 0,Model,MAE,RMSE,R2 Score
0,Linear Regression,0.730562,1.137547,0.867304
1,KNN,1.366154,1.759545,0.682517
2,Decision Tree,0.521314,0.67721,0.952971
3,Random Forest,0.652038,1.030381,0.891128


Save Best Model for Deployment

In [17]:
import joblib
joblib.dump(rf, "../models/regression_best.pkl")
joblib.dump(scaler, "../models/scaler.pkl")


['../models/scaler.pkl']