# Prosit 3: Predicting and Supporting Student Success
## Supervised Learning: Classification + Regression

**Objective**: Build predictive models leveraging the **longitudinal/time-series nature** of student data.

**Dual Approach**:
1. **Classification**: Predict probation risk and Dean's List eligibility
2. **Regression**: Predict future GPA/CGPA based on historical performance

**Data Structure**:
- 538,147 records from 12,207 unique students  
- Average 44 records per student (courses across semesters)
- Temporal tracking via `StudentRef`

**Ashesi Policies** (Student Handbook 2022/2023):
- **Dean's List**: Semester GPA ‚â• 3.5
- **Probation**: Cumulative GPA < 2.0
- **Dismissal**: Failure to make normal progress OR two consecutive semesters on probation

# Part 1: Data Preparation

## 1. Setup & Library Imports

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Classification models
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Regression models
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, RidgeCV, LassoCV, ElasticNetCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

# Model selection and evaluation
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler

# Classification metrics
from sklearn.metrics import (
    classification_report, confusion_matrix, 
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve
)

# Regression metrics
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score
)

# Utilities
from tqdm.auto import tqdm
import warnings
warnings.filterwarnings('ignore')

# Model persistence
import pickle
import os
from datetime import datetime

print("‚úÖ Libraries loaded successfully!")
print(f"Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

‚úÖ Libraries loaded successfully!
Timestamp: 2025-12-15 01:14:24


## 2. Data Loading & Exploration

In [2]:
# Load merged cleaned encoded data (has StudentRef for temporal analysis)
df = pd.read_csv('../data/merged_cleaned_encoded.csv', low_memory=False)

print(f"Dataset shape: {df.shape}")
print(f"Rows: {df.shape[0]:,} | Columns: {df.shape[1]}")
print(f"\nUnique students: {df['StudentRef'].nunique():,}")
print(f"Average records per student: {len(df) / df['StudentRef'].nunique():.2f}")

df.head()

Dataset shape: (538147, 56)
Rows: 538,147 | Columns: 56

Unique students: 12,207
Average records per student: 44.09


Unnamed: 0,Applicant,Application,Created date,Submitted date,Offer type,Offer course name,StudentRef,Latest Education Level,Education - Block 1: Level of education,Education - Block 2: Level of education,...,GPA_y,CGPA_y,Subject Credit,Calculate toward Graduation Criteria?,Course offering plan name,Student Type,Student Status,Nationality,Admission Year,Program
0,557,470,15/01/2018 5:20,02/06/2018 18:15,3,2,Sd25fcbb18e84f890,4,4,2,...,,,,,2,,5,0,7,9
1,566,473,15/01/2018 19:28,14/08/2018 12:33,3,5,Sfd5f545f824e3b45,4,4,2,...,,,,,2,,5,0,7,9
2,569,479,15/01/2018 21:10,14/08/2018 21:05,3,4,Se4a2f9bcf28873f3,4,4,2,...,,,,,2,,5,24,7,9
3,572,482,15/01/2018 21:19,16/08/2018 21:04,3,1,S2c5748435b37f518,4,4,2,...,,,,,2,,5,0,7,9
4,575,485,15/01/2018 23:47,21/01/2018 2:16,3,6,Sa3d7b10c3d22ffe0,2,4,2,...,,,,,2,,5,17,7,9


In [3]:
# Check key columns
print("Key Academic Columns:")
academic_cols = [c for c in df.columns if 'GPA' in c or 'Grade' in c or 'Mark' in c]
print(academic_cols[:10])

print("\nTemporal Columns:")
temporal_cols = [c for c in df.columns if 'Semester' in c or 'Year' in c or 'Yeargroup' in c]
print(temporal_cols)

print("\nMissing values in key columns:")
print(df[['StudentRef', 'GPA_y', 'CGPA_y', 'Mark', 'Yeargroup']].isnull().sum())

Key Academic Columns:
['GPA_x', 'CGPA_x', 'Mark', 'Grade', 'Grade point', 'Grade system', 'GPA_y', 'CGPA_y']

Temporal Columns:
['Extra question: Exam Year', 'Yeargroup', 'Semester/Year_x', 'Academic Year_x', 'Semester/Year_y', 'Academic Year_y', 'Admission Year']

Missing values in key columns:
StudentRef        0
GPA_y         10449
CGPA_y        10449
Mark          13777
Yeargroup     10302
dtype: int64


## 2.5. Merge Clustering Results from Prosit 2

We'll merge the clustering assignments from Prosit 2 to enrich our features.

In [4]:
# Load clustering results from Prosit 2
print("Loading clustering results...")
df_cluster = pd.read_csv('../results/prosit 2/clustering_results.csv')

print(f"Clustering data shape: {df_cluster.shape}")
print(f"\nClustering columns available:")
cluster_cols = [c for c in df_cluster.columns if 'Cluster' in c]
print(cluster_cols)

# Check overlap with main data
common_cols = list(set(df.columns) & set(df_cluster.columns))
print(f"\nCommon columns for merging: {len(common_cols)}")

Loading clustering results...
Clustering data shape: (538147, 36)

Clustering columns available:
['KMeans_Cluster', 'Hierarchical_Cluster', 'DBSCAN_Cluster', 'GMM_Cluster']

Common columns for merging: 32


In [5]:
# Merge clustering results with main data
# We'll use key academic columns to match records
merge_keys = ['Mark', 'GPA_y', 'CGPA_y', 'Grade point', 'Subject Credit', 
              'Yeargroup', 'Semester/Year_y', 'Academic Year_y']

# Keep only clustering columns from df_cluster
cluster_features = ['KMeans_Cluster', 'Hierarchical_Cluster', 'GMM_Cluster']
merge_cols = merge_keys + cluster_features

# Merge
df_merged = df.merge(
    df_cluster[merge_cols],
    on=merge_keys,
    how='left',
    suffixes=('', '_cluster')
)

print(f"\nMerged data shape: {df_merged.shape}")
print(f"Original data shape: {df.shape}")

# Check how many records got clustering assignments
for col in cluster_features:
    if col in df_merged.columns:
        matched = df_merged[col].notna().sum()
        print(f"{col}: {matched:,} records matched ({matched/len(df_merged)*100:.1f}%)")

# Replace original df with merged version
df = df_merged.copy()
print("\n‚úÖ Clustering results merged successfully!")


Merged data shape: (7027249, 59)
Original data shape: (538147, 56)
KMeans_Cluster: 7,013,472 records matched (99.8%)
Hierarchical_Cluster: 7,013,472 records matched (99.8%)
GMM_Cluster: 7,013,472 records matched (99.8%)

‚úÖ Clustering results merged successfully!


In [6]:
# Visualize clustering distributions
fig = make_subplots(rows=1, cols=3, subplot_titles=('K-Means', 'Hierarchical', 'GMM'))

for i, col in enumerate(['KMeans_Cluster', 'Hierarchical_Cluster', 'GMM_Cluster'], 1):
    if col in df.columns:
        cluster_dist = df[col].value_counts().sort_index()
        fig.add_trace(
            go.Bar(x=cluster_dist.index, y=cluster_dist.values, name=col.replace('_Cluster', '')),
            row=1, col=i
        )

fig.update_layout(
    title='Clustering Assignments Distribution',
    height=400,
    showlegend=False
)
fig.update_xaxes(title_text="Cluster ID")
fig.update_yaxes(title_text="Count")
fig.show()

## 3. Temporal Data Analysis

In [7]:
# Analyze student trajectories
student_stats = df.groupby('StudentRef').agg({
    'Mark': ['count', 'mean', 'std'],
    'GPA_y': ['mean', 'std', 'min', 'max'],
    'CGPA_y': ['mean', 'std', 'min', 'max'],
    'Semester/Year_y': ['min', 'max'],
    'Yeargroup': 'first'
}).reset_index()

student_stats.columns = ['_'.join(col).strip('_') for col in student_stats.columns.values]

print("Student-Level Statistics:")
print(f"Total students: {len(student_stats):,}")
print(f"\nRecords per student:")
print(student_stats['Mark_count'].describe())

print(f"\nAverage GPA distribution:")
print(student_stats['GPA_y_mean'].describe())

student_stats.head()

Student-Level Statistics:
Total students: 12,207

Records per student:
count    12207.000000
mean       574.545097
std       1966.434412
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max      56736.000000
Name: Mark_count, dtype: float64

Average GPA distribution:
count    2166.000000
mean        2.959982
std         0.611796
min         0.000000
25%         2.632610
50%         3.044985
75%         3.406672
max         4.000000
Name: GPA_y_mean, dtype: float64


Unnamed: 0,StudentRef,Mark_count,Mark_mean,Mark_std,GPA_y_mean,GPA_y_std,GPA_y_min,GPA_y_max,CGPA_y_mean,CGPA_y_std,CGPA_y_min,CGPA_y_max,Semester/Year_y_min,Semester/Year_y_max,Yeargroup_first
0,S0001d2c78a76136f,0,,,,,,,,,,,0,0,
1,S000239c28ac17cd5,0,,,,,,,,,,,0,0,
2,S00023dfc670bba5d,0,,,,,,,,,,,0,0,
3,S000384e636664a25,0,,,,,,,,,,,0,0,
4,S00039f6fd1b74390,1224,81.869118,6.044741,3.805588,0.084686,3.65,3.9,3.819706,0.037938,3.78,3.88,1,3,2026.0


In [8]:
# Visualize sample student trajectories
fig = go.Figure()

# Select 10 students with sufficient records
sample_students = student_stats[student_stats['Mark_count'] >= 20].sample(min(10, len(student_stats)), random_state=42)['StudentRef']

for student_id in sample_students:
    student_data = df[df['StudentRef'] == student_id].sort_values('Semester/Year_y')
    semester_gpa = student_data.groupby('Semester/Year_y')['GPA_y'].first().reset_index()
    
    fig.add_trace(go.Scatter(
        x=semester_gpa['Semester/Year_y'],
        y=semester_gpa['GPA_y'],
        mode='lines+markers',
        name=f'Student {student_id[:8]}...',
        opacity=0.7
    ))

fig.update_layout(
    title='Sample Student GPA Trajectories Over Time',
    xaxis_title='Semester',
    yaxis_title='GPA',
    height=500
)
fig.show()

## 4. Feature Engineering

In [9]:
# Create temporal features for each student-semester combination
print("Creating temporal features...")

# Sort by student and semester
df_sorted = df.sort_values(['StudentRef', 'Semester/Year_y']).reset_index(drop=True)

# Group by student and semester to get one record per student-semester
student_semester = df_sorted.groupby(['StudentRef', 'Semester/Year_y']).agg({
    'GPA_y': 'first',
    'CGPA_y': 'first',
    'Mark': 'mean',  # Average mark for that semester
    'Subject Credit': 'sum',  # Total credits
    'Yeargroup': 'first',
    'Academic Year_y': 'first'
}).reset_index()

print(f"Student-semester records: {len(student_semester):,}")
student_semester.head(10)

Creating temporal features...
Student-semester records: 15,856


Unnamed: 0,StudentRef,Semester/Year_y,GPA_y,CGPA_y,Mark,Subject Credit,Yeargroup,Academic Year_y
0,S0001d2c78a76136f,0,,,,0.0,,7
1,S000239c28ac17cd5,0,,,,0.0,,7
2,S00023dfc670bba5d,0,,,,0.0,,7
3,S000384e636664a25,0,,,,0.0,,7
4,S00039f6fd1b74390,1,3.78,3.82,80.303,324.0,2026.0,5
5,S00039f6fd1b74390,2,3.88,3.88,83.897222,486.0,2026.0,4
6,S00039f6fd1b74390,3,3.65,3.78,78.395,180.0,2026.0,5
7,S00064f4260078781,0,,,,0.0,,7
8,S00083696232828a8,0,,,,0.0,,7
9,S000bf694fde2ff22,0,,,,0.0,,7


In [10]:
# Create historical features (lag features)
print("Creating lag features...")

# Sort by student and semester
student_semester = student_semester.sort_values(['StudentRef', 'Semester/Year_y'])

# Create lag features (previous semester performance)
student_semester['GPA_prev'] = student_semester.groupby('StudentRef')['GPA_y'].shift(1)
student_semester['CGPA_prev'] = student_semester.groupby('StudentRef')['CGPA_y'].shift(1)
student_semester['Mark_prev'] = student_semester.groupby('StudentRef')['Mark'].shift(1)

# Create trend features
student_semester['GPA_change'] = student_semester['GPA_y'] - student_semester['GPA_prev']
student_semester['CGPA_change'] = student_semester['CGPA_y'] - student_semester['CGPA_prev']

# Semester count (how many semesters completed)
student_semester['semester_count'] = student_semester.groupby('StudentRef').cumcount() + 1

print("\nFeatures created:")
print(student_semester.columns.tolist())
print(f"\nRecords with complete lag features: {student_semester['GPA_prev'].notna().sum():,}")

student_semester.head(10)

Creating lag features...

Features created:
['StudentRef', 'Semester/Year_y', 'GPA_y', 'CGPA_y', 'Mark', 'Subject Credit', 'Yeargroup', 'Academic Year_y', 'GPA_prev', 'CGPA_prev', 'Mark_prev', 'GPA_change', 'CGPA_change', 'semester_count']

Records with complete lag features: 3,649


Unnamed: 0,StudentRef,Semester/Year_y,GPA_y,CGPA_y,Mark,Subject Credit,Yeargroup,Academic Year_y,GPA_prev,CGPA_prev,Mark_prev,GPA_change,CGPA_change,semester_count
0,S0001d2c78a76136f,0,,,,0.0,,7,,,,,,1
1,S000239c28ac17cd5,0,,,,0.0,,7,,,,,,1
2,S00023dfc670bba5d,0,,,,0.0,,7,,,,,,1
3,S000384e636664a25,0,,,,0.0,,7,,,,,,1
4,S00039f6fd1b74390,1,3.78,3.82,80.303,324.0,2026.0,5,,,,,,1
5,S00039f6fd1b74390,2,3.88,3.88,83.897222,486.0,2026.0,4,3.78,3.82,80.303,0.1,0.06,2
6,S00039f6fd1b74390,3,3.65,3.78,78.395,180.0,2026.0,5,3.88,3.88,83.897222,-0.23,-0.1,3
7,S00064f4260078781,0,,,,0.0,,7,,,,,,1
8,S00083696232828a8,0,,,,0.0,,7,,,,,,1
9,S000bf694fde2ff22,0,,,,0.0,,7,,,,,,1


## 5. Target Variable Definition

In [11]:
# Define targets for CLASSIFICATION
student_semester['Probation_Risk'] = (student_semester['CGPA_y'] < 2.0).astype(int)
student_semester['Deans_List'] = (student_semester['GPA_y'] >= 3.5).astype(int)

print("CLASSIFICATION TARGETS:")
print("="*60)
print(f"\nProbation Risk Distribution:")
print(student_semester['Probation_Risk'].value_counts())
print(f"Percentage at risk: {student_semester['Probation_Risk'].mean()*100:.2f}%")

print(f"\nDean's List Distribution:")
print(student_semester['Deans_List'].value_counts())
print(f"Percentage eligible: {student_semester['Deans_List'].mean()*100:.2f}%")

# Define targets for REGRESSION
# We'll predict NEXT semester's GPA/CGPA
student_semester['Next_GPA'] = student_semester.groupby('StudentRef')['GPA_y'].shift(-1)
student_semester['Next_CGPA'] = student_semester.groupby('StudentRef')['CGPA_y'].shift(-1)

print("\n" + "="*60)
print("REGRESSION TARGETS:")
print(f"\nRecords with next semester data: {student_semester['Next_GPA'].notna().sum():,}")
print(f"Next GPA range: {student_semester['Next_GPA'].min():.2f} to {student_semester['Next_GPA'].max():.2f}")

CLASSIFICATION TARGETS:

Probation Risk Distribution:
Probation_Risk
0    15489
1      367
Name: count, dtype: int64
Percentage at risk: 2.31%

Dean's List Distribution:
Deans_List
0    14169
1     1687
Name: count, dtype: int64
Percentage eligible: 10.64%

REGRESSION TARGETS:

Records with next semester data: 3,649
Next GPA range: 0.00 to 4.00


## 6. Feature Selection & Preprocessing

In [13]:
# Select features for modeling
# We'll use: current performance, historical performance, trends, temporal info, AND clustering

# Remove rows with missing lag features (first semester for each student)
df_complete = student_semester.dropna(subset=['GPA_prev', 'CGPA_prev']).copy()
print(f"\nRecords with complete features: {len(df_complete):,}")

feature_cols = [
    'GPA_y', 'CGPA_y', 'Mark', 'Subject Credit',
    'GPA_prev', 'CGPA_prev', 'Mark_prev',
    'GPA_change', 'CGPA_change',
    'semester_count', 'Yeargroup', 'Academic Year_y'
]

# Add clustering features if available
clustering_cols = ['KMeans_Cluster', 'Hierarchical_Cluster', 'GMM_Cluster']
for col in clustering_cols:
    if col in df_complete.columns:
        feature_cols.append(col)
        print(f"‚úÖ Added {col} to features")

print(f"\nTotal features selected: {len(feature_cols)}")
print(feature_cols)


Records with complete features: 3,649

Total features selected: 12
['GPA_y', 'CGPA_y', 'Mark', 'Subject Credit', 'GPA_prev', 'CGPA_prev', 'Mark_prev', 'GPA_change', 'CGPA_change', 'semester_count', 'Yeargroup', 'Academic Year_y']


# Part 2: Classification Models (Probation Risk)

## 7. Data Preparation for Classification

In [16]:
# Combine features and target for consistent NaN handling
df_classification_prepared = df_complete[feature_cols + ['Probation_Risk']].dropna()

X_class = df_classification_prepared[feature_cols].copy()
y_class = df_classification_prepared['Probation_Risk'].copy()

print(f"Classification dataset size: {len(X_class):,}")
print(f"Target distribution:\n{y_class.value_counts()}")

# Train-test split (stratified)
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_class, y_class,
    test_size=0.2,
    random_state=42,
    stratify=y_class
)

# Scale features
scaler_c = StandardScaler()
X_train_c_scaled = scaler_c.fit_transform(X_train_c)
X_test_c_scaled = scaler_c.transform(X_test_c)

print(f"\nTrain set: {len(X_train_c):,}")
print(f"Test set: {len(X_test_c):,}")
print("‚úÖ Classification data prepared!")

Classification dataset size: 3,638
Target distribution:
Probation_Risk
0    3426
1     212
Name: count, dtype: int64

Train set: 2,910
Test set: 728
‚úÖ Classification data prepared!


## 8. Baseline Classification Model

In [17]:
# Baseline Logistic Regression
baseline_c = LogisticRegression(random_state=42, max_iter=1000, n_jobs=-1)
baseline_c.fit(X_train_c_scaled, y_train_c)

y_pred_c = baseline_c.predict(X_test_c_scaled)
y_proba_c = baseline_c.predict_proba(X_test_c_scaled)[:, 1]

print("Baseline Classification Results:")
print("="*60)
print(classification_report(y_test_c, y_pred_c, target_names=['Not at Risk', 'At Risk']))
print(f"\nROC-AUC: {roc_auc_score(y_test_c, y_proba_c):.4f}")

Baseline Classification Results:
              precision    recall  f1-score   support

 Not at Risk       1.00      1.00      1.00       686
     At Risk       1.00      0.93      0.96        42

    accuracy                           1.00       728
   macro avg       1.00      0.96      0.98       728
weighted avg       1.00      1.00      1.00       728


ROC-AUC: 0.9999


## 9. Advanced Classification Models

In [18]:
# Train multiple classification models
print("Training classification models...")

# Ridge (L2)
ridge_c = LogisticRegressionCV(penalty='l2', cv=5, random_state=42, max_iter=1000, n_jobs=-1)
ridge_c.fit(X_train_c_scaled, y_train_c)
print("‚úÖ Ridge trained")

# Lasso (L1)
lasso_c = LogisticRegressionCV(penalty='l1', solver='saga', cv=5, random_state=42, max_iter=2000, n_jobs=-1)
lasso_c.fit(X_train_c_scaled, y_train_c)
print("‚úÖ Lasso trained")

# Random Forest
rf_c = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42, n_jobs=-1)
rf_c.fit(X_train_c_scaled, y_train_c)
print("‚úÖ Random Forest trained")

# Gradient Boosting
gb_c = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42)
gb_c.fit(X_train_c_scaled, y_train_c)
print("‚úÖ Gradient Boosting trained")

print("\nAll classification models trained!")

Training classification models...
‚úÖ Ridge trained
‚úÖ Lasso trained
‚úÖ Random Forest trained
‚úÖ Gradient Boosting trained

All classification models trained!


## 10. Classification Model Comparison

In [19]:
# Evaluate all classification models
models_c = {
    'Baseline': baseline_c,
    'Ridge': ridge_c,
    'Lasso': lasso_c,
    'Random Forest': rf_c,
    'Gradient Boosting': gb_c
}

results_c = []
for name, model in models_c.items():
    y_pred = model.predict(X_test_c_scaled)
    y_proba = model.predict_proba(X_test_c_scaled)[:, 1] if hasattr(model, 'predict_proba') else model.decision_function(X_test_c_scaled)
    
    results_c.append({
        'Model': name,
        'Accuracy': accuracy_score(y_test_c, y_pred),
        'Precision': precision_score(y_test_c, y_pred),
        'Recall': recall_score(y_test_c, y_pred),
        'F1-Score': f1_score(y_test_c, y_pred),
        'ROC-AUC': roc_auc_score(y_test_c, y_proba)
    })

comparison_c = pd.DataFrame(results_c).sort_values('F1-Score', ascending=False)
print("Classification Model Performance:")
print("="*80)
print(comparison_c.to_string(index=False))
print(f"\nüèÜ Best Model: {comparison_c.iloc[0]['Model']}")

Classification Model Performance:
            Model  Accuracy  Precision   Recall  F1-Score  ROC-AUC
    Random Forest  1.000000        1.0 1.000000  1.000000 1.000000
Gradient Boosting  1.000000        1.0 1.000000  1.000000 1.000000
            Ridge  0.998626        1.0 0.976190  0.987952 1.000000
         Baseline  0.995879        1.0 0.928571  0.962963 0.999931
            Lasso  0.995879        1.0 0.928571  0.962963 1.000000

üèÜ Best Model: Random Forest


In [20]:
# Visualize classification results
fig = go.Figure()

metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']
for metric in metrics:
    fig.add_trace(go.Bar(
        name=metric,
        x=comparison_c['Model'],
        y=comparison_c[metric],
        text=comparison_c[metric].round(3),
        textposition='auto'
    ))

fig.update_layout(
    title='Classification Model Performance Comparison',
    xaxis_title='Model',
    yaxis_title='Score',
    barmode='group',
    height=500,
    yaxis=dict(range=[0, 1])
)
fig.show()

# Part 3: Regression Models (GPA Prediction)

## 11. Data Preparation for Regression

In [21]:
# Prepare data for Next GPA regression
df_regression = df_complete.dropna(subset=['Next_GPA']).copy()

X_reg = df_regression[feature_cols].copy()
y_reg = df_regression['Next_GPA'].copy()

print(f"Regression dataset size: {len(X_reg):,}")
print(f"Target (Next GPA) statistics:")
print(y_reg.describe())

# Train-test split
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
    X_reg, y_reg,
    test_size=0.2,
    random_state=42
)

# Scale features
scaler_r = StandardScaler()
X_train_r_scaled = scaler_r.fit_transform(X_train_r)
X_test_r_scaled = scaler_r.transform(X_test_r)

print(f"\nTrain set: {len(X_train_r):,}")
print(f"Test set: {len(X_test_r):,}")
print("‚úÖ Regression data prepared!")

Regression dataset size: 1,516
Target (Next GPA) statistics:
count    1516.000000
mean        3.011234
std         0.827914
min         0.000000
25%         2.500000
50%         3.170000
75%         3.560000
max         4.000000
Name: Next_GPA, dtype: float64

Train set: 1,212
Test set: 304
‚úÖ Regression data prepared!


## 12. Baseline Regression Model

In [22]:
# Baseline Linear Regression
baseline_r = LinearRegression()
baseline_r.fit(X_train_r_scaled, y_train_r)

y_pred_r = baseline_r.predict(X_test_r_scaled)

print("Baseline Regression Results:")
print("="*60)
print(f"RMSE: {np.sqrt(mean_squared_error(y_test_r, y_pred_r)):.4f}")
print(f"MAE:  {mean_absolute_error(y_test_r, y_pred_r):.4f}")
print(f"R¬≤:   {r2_score(y_test_r, y_pred_r):.4f}")

Baseline Regression Results:
RMSE: 0.6714
MAE:  0.4811
R¬≤:   0.2534


## 13. Advanced Regression Models

In [23]:
# Train multiple regression models
print("Training regression models...")

# Ridge
ridge_r = RidgeCV(alphas=np.logspace(-3, 3, 50), cv=5)
ridge_r.fit(X_train_r_scaled, y_train_r)
print(f"‚úÖ Ridge trained (alpha={ridge_r.alpha_:.4f})")

# Lasso
lasso_r = LassoCV(alphas=np.logspace(-3, 1, 20), cv=5, random_state=42, n_jobs=-1)
lasso_r.fit(X_train_r_scaled, y_train_r)
print(f"‚úÖ Lasso trained (alpha={lasso_r.alpha_:.4f})")

# Elastic Net
elastic_r = ElasticNetCV(alphas=np.logspace(-3, 1, 10), l1_ratio=[0.1, 0.5, 0.7, 0.9], cv=5, random_state=42, n_jobs=-1)
elastic_r.fit(X_train_r_scaled, y_train_r)
print(f"‚úÖ Elastic Net trained (alpha={elastic_r.alpha_:.4f}, l1_ratio={elastic_r.l1_ratio_:.2f})")

# Random Forest
rf_r = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42, n_jobs=-1)
rf_r.fit(X_train_r_scaled, y_train_r)
print("‚úÖ Random Forest trained")

# Gradient Boosting
gb_r = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42)
gb_r.fit(X_train_r_scaled, y_train_r)
print("‚úÖ Gradient Boosting trained")

print("\nAll regression models trained!")

Training regression models...
‚úÖ Ridge trained (alpha=6.2506)
‚úÖ Lasso trained (alpha=0.0010)
‚úÖ Elastic Net trained (alpha=0.0028, l1_ratio=0.10)
‚úÖ Random Forest trained
‚úÖ Gradient Boosting trained

All regression models trained!


## 14. Regression Model Comparison

In [24]:
# Evaluate all regression models
models_r = {
    'Linear Regression': baseline_r,
    'Ridge': ridge_r,
    'Lasso': lasso_r,
    'Elastic Net': elastic_r,
    'Random Forest': rf_r,
    'Gradient Boosting': gb_r
}

results_r = []
for name, model in models_r.items():
    y_pred = model.predict(X_test_r_scaled)
    
    results_r.append({
        'Model': name,
        'RMSE': np.sqrt(mean_squared_error(y_test_r, y_pred)),
        'MAE': mean_absolute_error(y_test_r, y_pred),
        'R¬≤': r2_score(y_test_r, y_pred)
    })

comparison_r = pd.DataFrame(results_r).sort_values('R¬≤', ascending=False)
print("Regression Model Performance:")
print("="*80)
print(comparison_r.to_string(index=False))
print(f"\nüèÜ Best Model: {comparison_r.iloc[0]['Model']}")

Regression Model Performance:
            Model     RMSE      MAE       R¬≤
    Random Forest 0.657020 0.470514 0.285053
            Lasso 0.670722 0.480329 0.254921
      Elastic Net 0.670997 0.480822 0.254310
            Ridge 0.671213 0.481137 0.253830
Linear Regression 0.671396 0.481083 0.253423
Gradient Boosting 0.698372 0.485308 0.192225

üèÜ Best Model: Random Forest


In [25]:
# Visualize regression results
fig = make_subplots(rows=1, cols=3, subplot_titles=('RMSE (lower is better)', 'MAE (lower is better)', 'R¬≤ (higher is better)'))

fig.add_trace(go.Bar(x=comparison_r['Model'], y=comparison_r['RMSE'], name='RMSE'), row=1, col=1)
fig.add_trace(go.Bar(x=comparison_r['Model'], y=comparison_r['MAE'], name='MAE'), row=1, col=2)
fig.add_trace(go.Bar(x=comparison_r['Model'], y=comparison_r['R¬≤'], name='R¬≤'), row=1, col=3)

fig.update_layout(
    title='Regression Model Performance Comparison',
    height=500,
    showlegend=False
)
fig.update_xaxes(tickangle=45)
fig.show()

## 15. Prediction Visualization

In [27]:
# Visualize predictions vs actual for ALL regression models
from plotly.subplots import make_subplots

# Create 2x3 subplot grid for 6 models
fig = make_subplots(
    rows=2, cols=3,
    subplot_titles=[name for name in models_r.keys()],
    specs=[[{'type': 'scatter'}]*3, [{'type': 'scatter'}]*3]
)

# Get min/max for consistent axes
all_actuals = y_test_r
all_predictions = []
for model in models_r.values():
    all_predictions.extend(model.predict(X_test_r_scaled))

min_val = min(all_actuals.min(), min(all_predictions))
max_val = max(all_actuals.max(), max(all_predictions))

# Plot each model
positions = [(1,1), (1,2), (1,3), (2,1), (2,2), (2,3)]
for (row, col), (name, model) in zip(positions, models_r.items()):
    y_pred = model.predict(X_test_r_scaled)
    
    # Scatter plot of predictions
    fig.add_trace(
        go.Scatter(
            x=y_test_r,
            y=y_pred,
            mode='markers',
            name=name,
            marker=dict(size=3, opacity=0.5),
            showlegend=False,
            text=[f'Actual: {a:.2f}<br>Predicted: {p:.2f}' for a, p in zip(y_test_r, y_pred)],
            hovertemplate='%{text}<extra></extra>'
        ),
        row=row, col=col
    )
    
    # Perfect prediction line
    fig.add_trace(
        go.Scatter(
            x=[min_val, max_val],
            y=[min_val, max_val],
            mode='lines',
            line=dict(dash='dash', color='red', width=1),
            showlegend=False
        ),
        row=row, col=col
    )
    
    # Add R¬≤ score as annotation
    r2 = r2_score(y_test_r, y_pred)
    fig.add_annotation(
        text=f'R¬≤ = {r2:.3f}',
        xref=f'x{(row-1)*3+col}', yref=f'y{(row-1)*3+col}',
        x=min_val + 0.1, y=max_val - 0.2,
        showarrow=False,
        font=dict(size=10, color='black'),
        bgcolor='rgba(255,255,255,0.8)',
        row=row, col=col
    )

# Update layout
fig.update_layout(
    title_text='Actual vs Predicted GPA - All Regression Models',
    height=800,
    showlegend=False
)

# Update axes labels
for i in range(1, 7):
    fig.update_xaxes(title_text='Actual Next Semester GPA', row=(i-1)//3+1, col=(i-1)%3+1)
    fig.update_yaxes(title_text='Predicted GPA', row=(i-1)//3+1, col=(i-1)%3+1)

fig.show()

# Part 4: Model Persistence & Summary

## 16. Save Models

In [28]:
# Create models directory
models_dir = '../models/prosit_3_enhanced'
os.makedirs(models_dir, exist_ok=True)

print("Saving models...")

# Save classification models
for name, model in models_c.items():
    filename = f"{models_dir}/{name.lower().replace(' ', '_')}_classifier.pkl"
    with open(filename, 'wb') as f:
        pickle.dump(model, f)
    print(f"‚úÖ Saved {filename}")

# Save regression models
for name, model in models_r.items():
    filename = f"{models_dir}/{name.lower().replace(' ', '_')}_regressor.pkl"
    with open(filename, 'wb') as f:
        pickle.dump(model, f)
    print(f"‚úÖ Saved {filename}")

# Save scalers
with open(f"{models_dir}/scaler_classification.pkl", 'wb') as f:
    pickle.dump(scaler_c, f)
with open(f"{models_dir}/scaler_regression.pkl", 'wb') as f:
    pickle.dump(scaler_r, f)

# Save feature names
with open(f"{models_dir}/feature_names.pkl", 'wb') as f:
    pickle.dump(feature_cols, f)

print("\n‚úÖ All models and artifacts saved!")
print(f"Location: {models_dir}")

Saving models...
‚úÖ Saved ../models/prosit_3_enhanced/baseline_classifier.pkl
‚úÖ Saved ../models/prosit_3_enhanced/ridge_classifier.pkl
‚úÖ Saved ../models/prosit_3_enhanced/lasso_classifier.pkl
‚úÖ Saved ../models/prosit_3_enhanced/random_forest_classifier.pkl
‚úÖ Saved ../models/prosit_3_enhanced/gradient_boosting_classifier.pkl
‚úÖ Saved ../models/prosit_3_enhanced/linear_regression_regressor.pkl
‚úÖ Saved ../models/prosit_3_enhanced/ridge_regressor.pkl
‚úÖ Saved ../models/prosit_3_enhanced/lasso_regressor.pkl
‚úÖ Saved ../models/prosit_3_enhanced/elastic_net_regressor.pkl
‚úÖ Saved ../models/prosit_3_enhanced/random_forest_regressor.pkl
‚úÖ Saved ../models/prosit_3_enhanced/gradient_boosting_regressor.pkl

‚úÖ All models and artifacts saved!
Location: ../models/prosit_3_enhanced


## 17. Final Summary

In [29]:
# Final summary
print("="*80)
print("PROSIT 3: SUPERVISED LEARNING - FINAL SUMMARY")
print("="*80)

print("\nüìä CLASSIFICATION (Probation Risk Prediction):")
print(f"   Best Model: {comparison_c.iloc[0]['Model']}")
print(f"   F1-Score: {comparison_c.iloc[0]['F1-Score']:.4f}")
print(f"   ROC-AUC: {comparison_c.iloc[0]['ROC-AUC']:.4f}")

print("\nüìà REGRESSION (Next Semester GPA Prediction):")
print(f"   Best Model: {comparison_r.iloc[0]['Model']}")
print(f"   R¬≤: {comparison_r.iloc[0]['R¬≤']:.4f}")
print(f"   RMSE: {comparison_r.iloc[0]['RMSE']:.4f}")

print("\n‚úÖ Models trained: 11 (5 classification + 6 regression)")
print(f"‚úÖ Models saved to: {models_dir}")
print(f"‚úÖ Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

print("\n" + "="*80)
print("üéì Ready for deployment and ethical use!")
print("="*80)

PROSIT 3: SUPERVISED LEARNING - FINAL SUMMARY

üìä CLASSIFICATION (Probation Risk Prediction):
   Best Model: Random Forest
   F1-Score: 1.0000
   ROC-AUC: 1.0000

üìà REGRESSION (Next Semester GPA Prediction):
   Best Model: Random Forest
   R¬≤: 0.2851
   RMSE: 0.6570

‚úÖ Models trained: 11 (5 classification + 6 regression)
‚úÖ Models saved to: ../models/prosit_3_enhanced
‚úÖ Timestamp: 2025-12-15 01:26:03

üéì Ready for deployment and ethical use!


## Ethical Considerations

### Responsible AI Deployment

Before deploying these models:

1. **Human-in-the-Loop**: Use predictions to inform advisors, not make automatic decisions
2. **Transparency**: Explain predictions to students and advisors
3. **Positive Framing**: Frame as "eligible for support" not "at-risk"
4. **Regular Audits**: Monitor for bias and model drift
5. **Privacy**: Protect student data and ensure FERPA compliance
6. **Opt-Out**: Allow students to decline intervention
7. **Fairness**: Regularly check performance across demographic groups

### Limitations

- Models are based on historical data and may not capture all factors affecting student success
- Temporal features assume consistent semester progression
- Missing data for first-semester students (no lag features)
- External factors (personal circumstances, health, etc.) are not captured

### Recommendations

- Use models as **decision support tools**, not decision makers
- Combine predictions with advisor expertise and student input
- Regularly retrain models with new data
- Monitor prediction accuracy and update as needed
- Ensure diverse representation in training data