# Student Score Prediction

---
### Objective
Build a robust machine learning pipeline to **predict students' exam scores** based on multiple academic and lifestyle factors.

### Workflow
1. Data Loading & Overview
2. Exploratory Data Analysis (EDA) with Rich Visualizations
3. Feature Engineering & Preprocessing
4. Model Training: Linear → Ridge → Polynomial Regression
5. Evaluation & Comparison Dashboard
6. Save Model for Deployment

### Dataset
[Student Performance Factors – Kaggle](https://www.kaggle.com/datasets/lainguyn123/student-performance-factors)

In [None]:
# ============================================================
# IMPORTS
# ============================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
import warnings
import joblib
import os

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.pipeline import Pipeline
from sklearn.inspection import permutation_importance

warnings.filterwarnings('ignore')

# ── Style ──────────────────────────────────────────────────
plt.rcParams.update({
    'figure.facecolor': '#0f0f1a',
    'axes.facecolor': '#1a1a2e',
    'axes.edgecolor': '#444',
    'axes.labelcolor': 'white',
    'xtick.color': '#aaa',
    'ytick.color': '#aaa',
    'text.color': 'white',
    'grid.color': '#2a2a3e',
    'grid.linestyle': '--',
    'grid.alpha': 0.5,
    'font.family': 'DejaVu Sans'
})

PALETTE = ['#7B2FBE', '#00D4FF', '#FF6B6B', '#FFD93D', '#6BCB77', '#4D96FF']
sns.set_palette(PALETTE)

print('Libraries loaded successfully!')
print(f'NumPy: {np.__version__} | Pandas: {pd.__version__}')

: 

## 1. Data Loading & Overview

In [None]:
# ── Load Data ────────────────────────────────────────────────
# If running on Kaggle, path will be: /kaggle/input/student-performance-factors/StudentPerformanceFactors.csv
try:
    df = pd.read_csv('/kaggle/input/student-performance-factors/StudentPerformanceFactors.csv')
except FileNotFoundError:
    # Fallback: generate synthetic data matching the real dataset schema
    np.random.seed(42)
    n = 6607
    df = pd.DataFrame({
        'Hours_Studied': np.random.randint(1, 44, n),
        'Attendance': np.random.randint(60, 100, n),
        'Sleep_Hours': np.random.randint(4, 10, n),
        'Previous_Scores': np.random.randint(50, 100, n),
        'Tutoring_Sessions': np.random.randint(0, 8, n),
        'Physical_Activity': np.random.randint(0, 6, n),
        'Parental_Involvement': np.random.choice(['Low', 'Medium', 'High'], n),
        'Access_to_Resources': np.random.choice(['Low', 'Medium', 'High'], n),
        'Motivation_Level': np.random.choice(['Low', 'Medium', 'High'], n),
        'Internet_Access': np.random.choice(['Yes', 'No'], n),
        'Family_Income': np.random.choice(['Low', 'Medium', 'High'], n),
        'Teacher_Quality': np.random.choice(['Low', 'Medium', 'High'], n),
        'School_Type': np.random.choice(['Public', 'Private'], n),
        'Peer_Influence': np.random.choice(['Negative', 'Neutral', 'Positive'], n),
        'Gender': np.random.choice(['Male', 'Female'], n),
        'Learning_Disabilities': np.random.choice(['Yes', 'No'], n),
        'Extracurricular_Activities': np.random.choice(['Yes', 'No'], n),
        'Parental_Education_Level': np.random.choice(['High School', 'College', 'Postgraduate'], n),
        'Distance_from_Home': np.random.choice(['Near', 'Moderate', 'Far'], n),
    })
    # Create realistic target
    df['Exam_Score'] = (
        40
        + df['Hours_Studied'] * 0.8
        + (df['Attendance'] - 75) * 0.3
        + df['Previous_Scores'] * 0.25
        + df['Tutoring_Sessions'] * 1.2
        + df['Sleep_Hours'] * 0.5
        + np.random.normal(0, 3, n)
    ).clip(55, 101).astype(int)
    print('Using synthetic data (upload real dataset on Kaggle)')

print(f'Dataset Shape: {df.shape}')
print(f'Columns: {list(df.columns)}')
df.head()

In [None]:
# ── Basic Statistics ─────────────────────────────────────────
print('=' * 60)
print('DATASET STATISTICS')
print('=' * 60)
print(f'  Rows:         {df.shape[0]:,}')
print(f'  Columns:      {df.shape[1]}')
print(f'  Missing vals: {df.isnull().sum().sum()}')
print(f'  Duplicates:   {df.duplicated().sum()}')
print(f'  Target Range: {df["Exam_Score"].min()} – {df["Exam_Score"].max()}')
print(f'  Target Mean:  {df["Exam_Score"].mean():.2f}')
print('=' * 60)

df.describe().style.background_gradient(cmap='Blues').format(precision=2)

## 2. Exploratory Data Analysis

In [None]:
# ── Target Distribution ────────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(16, 5))
fig.suptitle('Target Variable: Exam Score Distribution', 
             fontsize=16, fontweight='bold', y=1.02, color='white')

# Histogram
axes[0].hist(df['Exam_Score'], bins=30, color='#7B2FBE', edgecolor='#00D4FF', alpha=0.85, linewidth=0.8)
axes[0].axvline(df['Exam_Score'].mean(), color='#FFD93D', linewidth=2, linestyle='--', label=f'Mean: {df["Exam_Score"].mean():.1f}')
axes[0].axvline(df['Exam_Score'].median(), color='#FF6B6B', linewidth=2, linestyle='-.', label=f'Median: {df["Exam_Score"].median():.1f}')
axes[0].set_title('Distribution', fontsize=13)
axes[0].set_xlabel('Exam Score')
axes[0].set_ylabel('Count')
axes[0].legend()
axes[0].grid(True)

# Boxplot
bp = axes[1].boxplot(df['Exam_Score'], vert=True, patch_artist=True,
                     boxprops=dict(facecolor='#7B2FBE', alpha=0.7),
                     medianprops=dict(color='#FFD93D', linewidth=2),
                     whiskerprops=dict(color='#00D4FF'),
                     capprops=dict(color='#00D4FF'),
                     flierprops=dict(marker='o', color='#FF6B6B', markersize=4))
axes[1].set_title('Box Plot', fontsize=13)
axes[1].set_ylabel('Exam Score')
axes[1].set_xticklabels(['Exam Score'])
axes[1].grid(True)

plt.tight_layout()
plt.savefig('target_distribution.png', dpi=150, bbox_inches='tight', facecolor='#0f0f1a')
plt.show()
print(f'Skewness: {df["Exam_Score"].skew():.3f} | Kurtosis: {df["Exam_Score"].kurt():.3f}')

In [None]:
# ── Correlation Heatmap ────────────────────────────────────────
num_cols = df.select_dtypes(include=np.number).columns.tolist()
corr = df[num_cols].corr()

fig, ax = plt.subplots(figsize=(12, 8))
mask = np.triu(np.ones_like(corr, dtype=bool))
cmap = sns.diverging_palette(250, 10, as_cmap=True)

sns.heatmap(corr, mask=mask, cmap=cmap, vmin=-1, vmax=1, center=0,
            annot=True, fmt='.2f', square=True, ax=ax,
            annot_kws={'size': 9}, linewidths=0.5, linecolor='#0f0f1a',
            cbar_kws={'shrink': 0.8})

ax.set_title('Correlation Heatmap – Numerical Features', fontsize=15, fontweight='bold', pad=15)
plt.tight_layout()
plt.savefig('correlation_heatmap.png', dpi=150, bbox_inches='tight', facecolor='#0f0f1a')
plt.show()

# Top correlations with target
target_corr = corr['Exam_Score'].drop('Exam_Score').sort_values(key=abs, ascending=False)
print('\nTop Feature Correlations with Exam Score:')
for feat, val in target_corr.items():
    bar = '█' * int(abs(val) * 20)
    sign = '+' if val > 0 else '-'
    print(f'  {feat:<22} {sign}{bar} ({val:.3f})')

In [None]:
# ── Scatter Plots: Key Features vs Score ─────────────────────
key_features = ['Hours_Studied', 'Attendance', 'Previous_Scores', 'Sleep_Hours']
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
fig.suptitle('Key Features vs Exam Score', fontsize=16, fontweight='bold', y=1.01)

for ax, feat, color in zip(axes.flatten(), key_features, PALETTE):
    ax.scatter(df[feat], df['Exam_Score'], alpha=0.3, s=15, color=color)
    # Trend line
    z = np.polyfit(df[feat], df['Exam_Score'], 1)
    p = np.poly1d(z)
    x_line = np.linspace(df[feat].min(), df[feat].max(), 100)
    ax.plot(x_line, p(x_line), '--', color='#FFD93D', linewidth=2, label='Trend')
    corr_val = df[feat].corr(df['Exam_Score'])
    ax.set_title(f'{feat} | r = {corr_val:.3f}', fontsize=12)
    ax.set_xlabel(feat)
    ax.set_ylabel('Exam Score')
    ax.legend()
    ax.grid(True)

plt.tight_layout()
plt.savefig('scatter_plots.png', dpi=150, bbox_inches='tight', facecolor='#0f0f1a')
plt.show()

In [None]:
# ── Categorical Feature Analysis ──────────────────────────────
cat_cols = ['Parental_Involvement', 'Motivation_Level', 'Teacher_Quality',
            'Access_to_Resources', 'Peer_Influence', 'Family_Income']

fig, axes = plt.subplots(2, 3, figsize=(18, 10))
fig.suptitle('Categorical Features vs Exam Score', fontsize=16, fontweight='bold', y=1.01)

for ax, col, color in zip(axes.flatten(), cat_cols, PALETTE):
    order = df.groupby(col)['Exam_Score'].median().sort_values(ascending=False).index
    means = df.groupby(col)['Exam_Score'].mean().reindex(order)
    bars = ax.bar(means.index, means.values, color=color, alpha=0.85, edgecolor='white', linewidth=0.5)
    ax.set_title(col.replace('_', ' '), fontsize=11)
    ax.set_xlabel('')
    ax.set_ylabel('Avg Exam Score')
    ax.grid(True, axis='y')
    for bar, val in zip(bars, means.values):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.3, 
                f'{val:.1f}', ha='center', va='bottom', fontsize=9, color='white')

plt.tight_layout()
plt.savefig('categorical_analysis.png', dpi=150, bbox_inches='tight', facecolor='#0f0f1a')
plt.show()

## 3. Feature Engineering & Preprocessing

In [None]:
# ── Data Cleaning ──────────────────────────────────────────────
df_clean = df.copy()

# Fill missing values
for col in df_clean.select_dtypes(include=np.number).columns:
    df_clean[col].fillna(df_clean[col].median(), inplace=True)
for col in df_clean.select_dtypes(include='object').columns:
    df_clean[col].fillna(df_clean[col].mode()[0], inplace=True)

# Drop duplicates
df_clean.drop_duplicates(inplace=True)

# ── Encode Categorical Features ────────────────────────────────
ordinal_maps = {
    'Parental_Involvement': {'Low': 0, 'Medium': 1, 'High': 2},
    'Access_to_Resources': {'Low': 0, 'Medium': 1, 'High': 2},
    'Motivation_Level': {'Low': 0, 'Medium': 1, 'High': 2},
    'Family_Income': {'Low': 0, 'Medium': 1, 'High': 2},
    'Teacher_Quality': {'Low': 0, 'Medium': 1, 'High': 2},
    'Distance_from_Home': {'Near': 0, 'Moderate': 1, 'Far': 2},
    'Peer_Influence': {'Negative': 0, 'Neutral': 1, 'Positive': 2},
    'Parental_Education_Level': {'High School': 0, 'College': 1, 'Postgraduate': 2},
}

binary_maps = {
    'Internet_Access': {'Yes': 1, 'No': 0},
    'Extracurricular_Activities': {'Yes': 1, 'No': 0},
    'Learning_Disabilities': {'Yes': 1, 'No': 0},
    'Gender': {'Male': 1, 'Female': 0},
    'School_Type': {'Private': 1, 'Public': 0},
}

for col, mapping in {**ordinal_maps, **binary_maps}.items():
    if col in df_clean.columns:
        df_clean[col] = df_clean[col].map(mapping)

# ── Feature Engineering ────────────────────────────────────────
df_clean['Study_Efficiency'] = df_clean['Hours_Studied'] * df_clean['Attendance'] / 100
df_clean['Wellbeing_Score'] = df_clean['Sleep_Hours'] + df_clean['Physical_Activity']
df_clean['Resource_Score'] = df_clean['Access_to_Resources'] + df_clean['Internet_Access']
df_clean['Support_Score'] = df_clean['Parental_Involvement'] + df_clean['Teacher_Quality'] + df_clean['Tutoring_Sessions']

print('Feature engineering complete!')
print(f' Final features: {df_clean.shape[1] - 1}')
print(f' Dataset size: {df_clean.shape[0]:,} rows')
df_clean.head(3)

In [None]:
# ── Train/Test Split ───────────────────────────────────────────
TARGET = 'Exam_Score'
FEATURES = [c for c in df_clean.columns if c != TARGET]

X = df_clean[FEATURES]
y = df_clean[TARGET]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

print(f'Training set:   {X_train.shape[0]:,} samples')
print(f'Testing set:    {X_test.shape[0]:,} samples')
print(f'Features used:  {len(FEATURES)}')

## 4. Model Training & Comparison

In [None]:
# ── Train Multiple Models ──────────────────────────────────────
def evaluate_model(name, model, X_tr, y_tr, X_te, y_te):
    model.fit(X_tr, y_tr)
    preds = model.predict(X_te)
    cv = cross_val_score(model, X_tr, y_tr, cv=5, scoring='r2')
    return {
        'Model': name,
        'R² Train': model.score(X_tr, y_tr),
        'R² Test': r2_score(y_te, preds),
        'MAE': mean_absolute_error(y_te, preds),
        'RMSE': np.sqrt(mean_squared_error(y_te, preds)),
        'CV R² Mean': cv.mean(),
        'CV R² Std': cv.std(),
        'Predictions': preds
    }

models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Lasso Regression': Lasso(alpha=0.1),
}

results = []
for name, model in models.items():
    res = evaluate_model(name, model, X_train_sc, y_train, X_test_sc, y_test)
    results.append(res)
    print(f'{name}: R²={res["R² Test"]:.4f} | MAE={res["MAE"]:.3f} | RMSE={res["RMSE"]:.3f}')

# Polynomial Regression (degree 2)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_tr_poly = poly.fit_transform(X_train_sc)
X_te_poly = poly.transform(X_test_sc)
poly_model = Ridge(alpha=1.0)  # Ridge prevents overfitting with poly features
res_poly = evaluate_model('Polynomial (deg=2)', poly_model, X_tr_poly, y_train, X_te_poly, y_test)
results.append(res_poly)
print(f'Polynomial Regression: R²={res_poly["R² Test"]:.4f} | MAE={res_poly["MAE"]:.3f} | RMSE={res_poly["RMSE"]:.3f}')

results_df = pd.DataFrame([{k: v for k, v in r.items() if k != 'Predictions'} for r in results])
print('\nModel Comparison:')
results_df.style.highlight_max(subset=['R² Test', 'CV R² Mean'], color='#2d6a2d').highlight_min(subset=['MAE', 'RMSE'], color='#2d6a2d').format(precision=4)

In [None]:
# ── Model Comparison Dashboard ─────────────────────────────────
fig = plt.figure(figsize=(20, 12))
fig.suptitle('Model Performance Dashboard', fontsize=18, fontweight='bold', y=1.01)

gs = fig.add_gridspec(2, 3, hspace=0.4, wspace=0.35)

model_names = results_df['Model'].tolist()
colors = PALETTE[:len(model_names)]

# R² Score
ax1 = fig.add_subplot(gs[0, 0])
bars = ax1.bar(model_names, results_df['R² Test'], color=colors, alpha=0.85, edgecolor='white')
ax1.set_title('R² Score (Test)', fontweight='bold')
ax1.set_ylim(0, 1)
ax1.set_xticklabels(model_names, rotation=15, ha='right', fontsize=8)
for bar, val in zip(bars, results_df['R² Test']):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005, f'{val:.3f}', ha='center', fontsize=9)
ax1.grid(True, axis='y')

# MAE
ax2 = fig.add_subplot(gs[0, 1])
bars = ax2.bar(model_names, results_df['MAE'], color=colors, alpha=0.85, edgecolor='white')
ax2.set_title('Mean Absolute Error (↓ better)', fontweight='bold')
ax2.set_xticklabels(model_names, rotation=15, ha='right', fontsize=8)
for bar, val in zip(bars, results_df['MAE']):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, f'{val:.2f}', ha='center', fontsize=9)
ax2.grid(True, axis='y')

# RMSE
ax3 = fig.add_subplot(gs[0, 2])
bars = ax3.bar(model_names, results_df['RMSE'], color=colors, alpha=0.85, edgecolor='white')
ax3.set_title('RMSE (↓ better)', fontweight='bold')
ax3.set_xticklabels(model_names, rotation=15, ha='right', fontsize=8)
for bar, val in zip(bars, results_df['RMSE']):
    ax3.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, f'{val:.2f}', ha='center', fontsize=9)
ax3.grid(True, axis='y')

# Best model: Actual vs Predicted
best_idx = results_df['R² Test'].idxmax()
best_name = results_df.loc[best_idx, 'Model']
best_preds = results[best_idx]['Predictions']

ax4 = fig.add_subplot(gs[1, 0:2])
ax4.scatter(y_test.values[:200], best_preds[:200], alpha=0.6, s=25, color='#7B2FBE', label='Predictions')
mn, mx = min(y_test.min(), best_preds.min()), max(y_test.max(), best_preds.max())
ax4.plot([mn, mx], [mn, mx], '--', color='#FFD93D', linewidth=2, label='Perfect Prediction')
ax4.set_title(f'Actual vs Predicted – {best_name}', fontweight='bold')
ax4.set_xlabel('Actual Score')
ax4.set_ylabel('Predicted Score')
ax4.legend()
ax4.grid(True)

# Residuals
ax5 = fig.add_subplot(gs[1, 2])
residuals = y_test.values - best_preds
ax5.hist(residuals, bins=30, color='#00D4FF', edgecolor='white', alpha=0.8)
ax5.axvline(0, color='#FFD93D', linewidth=2, linestyle='--')
ax5.set_title('Residuals Distribution', fontweight='bold')
ax5.set_xlabel('Residual')
ax5.grid(True)

plt.savefig('model_dashboard.png', dpi=150, bbox_inches='tight', facecolor='#0f0f1a')
plt.show()
print(f'\nBest Model: {best_name} | R²={results_df.loc[best_idx, "R² Test"]:.4f}')

In [None]:
# ── Feature Importance ─────────────────────────────────────────
best_model = models.get(best_name, Ridge(alpha=1.0))
if best_name in models:
    best_model.fit(X_train_sc, y_train)
    importances = np.abs(best_model.coef_)
    feat_imp = pd.Series(importances, index=FEATURES).sort_values(ascending=True).tail(15)

    fig, ax = plt.subplots(figsize=(12, 7))
    colors_imp = plt.cm.viridis(np.linspace(0.3, 1, len(feat_imp)))
    bars = ax.barh(feat_imp.index, feat_imp.values, color=colors_imp, edgecolor='white', alpha=0.85)
    ax.set_title('Feature Importance (Coefficient Magnitude)', fontsize=14, fontweight='bold')
    ax.set_xlabel('Absolute Coefficient')
    ax.grid(True, axis='x')
    for bar, val in zip(bars, feat_imp.values):
        ax.text(bar.get_width() + 0.001, bar.get_y() + bar.get_height()/2,
                f'{val:.3f}', va='center', fontsize=9)
    plt.tight_layout()
    plt.savefig('feature_importance.png', dpi=150, bbox_inches='tight', facecolor='#0f0f1a')
    plt.show()

## 5. Save Model for Deployment

In [None]:
# ── Save Artifacts ─────────────────────────────────────────────
# Retrain best model on full data
final_model = Ridge(alpha=1.0)
X_all_sc = scaler.fit_transform(X)
final_model.fit(X_all_sc, y)

joblib.dump(final_model, 'student_model.pkl')
joblib.dump(scaler, 'student_scaler.pkl')

# Save feature list
import json
with open('student_features.json', 'w') as f:
    json.dump(FEATURES, f)

print('Model saved: student_model.pkl')
print('Scaler saved: student_scaler.pkl')
print('Features saved: student_features.json')

# ── Final Summary ──────────────────────────────────────────────
best_r2 = results_df['R² Test'].max()
best_mae = results_df.loc[results_df['R² Test'].idxmax(), 'MAE']
print('\n' + '='*55)
print('STUDENT SCORE PREDICTION — FINAL RESULTS')
print('='*55)
print(f'  Best Model:  {best_name}')
print(f'  R² Score:    {best_r2:.4f}  ({best_r2*100:.1f}% variance explained)')
print(f'  MAE:         {best_mae:.2f} points')
print(f'  Interpretation: Predictions are on avg {best_mae:.1f} pts off')
print('='*55)

## Key Insights

| Finding | Insight |
|---|---|
| **Hours Studied** | Strongest positive predictor of exam score |
| **Attendance** | High attendance students score significantly better |
| **Previous Scores** | Strong predictor — past performance predicts future |
| **Teacher Quality** | High quality correlates with better student outcomes |
| **Polynomial Features** | Capture non-linear relationships for better accuracy |

### Recommendations
- Students should aim for **>80% attendance** and **>20 study hours/week**
- Tutoring sessions show measurable improvement in outcomes
- Sleep and physical activity have a positive wellbeing effect on scores