Predicting student exam outcomes using machine learning and educational analytics.
The goal of this project is to build a regression model that predicts a student's final exam grade (G3) using behavioral, academic, and lifestyle features from the UCI Student Performance Dataset. Beyond prediction accuracy, this project focuses on understanding which factors most influence academic performance, and how data-driven insights can support early intervention for at-risk students.
-
Source: UCI Machine Learning Repository
-
File used: student-mat.csv
-
Samples: 395
-
Target variable: G3 (final grade, 0–20)
-
The dataset includes categories such as:
-
Academic: study time, past grades (G1, G2), absences
-
Family/Social: family support, living situation
-
Lifestyle: weekday/weekend alcohol use
-
Demographic: age, parental education
-
Data Cleaning & Exploration (EDA)
-
Feature Engineering
-
avg_previous_grade = (G1 + G2) / 2
-
Weekalc = (Dalc + Walc) / 2
-
-
One-Hot Encoding for categorical variables
-
Feature Scaling with StandardScaler
-
Train/Test Split (test_size=0.2, random_state=42)
-
Regression Models
-
Baseline: Linear Regression
-
Improved: Polynomial Regression (degree = 2)
-
-
Evaluation Metrics
- MAE, RMSE, R² score
-
Model Diagnostics
-
Correlation heatmap
-
Predictions vs Actual
-
Residual plots
-
-
Inspected dataset structure with .info() and .describe()
-
Checked distributions and missing values
-
Identified key predictors through a correlation heatmap → avg_previous_grade strongly correlated with G3
Created two meaningful features based on domain intuition:
| Feature | Description |
|---|---|
avg_previous_grade |
Average of G1 and G2 — captures academic trend |
Weekalc |
Combined alcohol use — lifestyle pattern indicator |
-
One-hot encoded categorical columns (schoolsup)
-
Scaled numerical features using StandardScaler
-
Split into training and testing sets before scaling
Trained two models:
-
Baseline Linear Regression
-
Polynomial Regression (degree=2) Used PolynomialFeatures() to capture mild nonlinearity.
Evaluated using MAE, RMSE, and R². (Results are shown below.)
Generated three key plots:
- Correlation heatmap
- Actual vs Predicted scatterplots
- Residual plots for error analysis
avg_previous_grade was by far the strongest predictor of final exam performance.
Alcohol consumption (Weekalc) had a mild negative correlation.
Absences and study time showed weak linear relationships.
Interpretation: Past academic performance matters far more than lifestyle or support features.
| Model | MAE | RMSE | R² |
|---|---|---|---|
| Baseline Linear Regression | 1.6751 | 2.2543 | 0.7522 |
| Polynomial Regression (degree=2) | 1.6076 | 2.2319 | 0.7571 |
The polynomial model offered a small but consistent improvement across all metrics.
Both models predicted well, with the polynomial model producing slightly tighter clustering around the ideal line.
Residuals were centered around zero with no strong pattern, indicating:
- No major bias
- Linearity assumptions hold reasonably well
- Mild heteroscedasticity at high grades (common in student datasets)
✔ Achieved an R² of ~0.76 on the polynomial model
✔ Predicted final grades within ~1.6 points MAE
✔ Engineered features noticeably improved performance
✔ Past grades dominate prediction strength
✔ Visualization confirmed the model’s reliability and interpretability
This project demonstrates the ability to:
-
Work with real-world educational datasets
-
Apply practical regression techniques
-
Engineer meaningful features from domain knowledge
-
Evaluate and compare multiple models
-
Create insightful visualizations for interpretability
-
Write clean, reproducible ML code
-
Explain model behavior clearly and professionally