# 🤖 Step 3: Model Training and Evaluation

Goal: Train ML models on student performance data and select the best one.

**Tasks:**
1. Load training and testing data from `data/processed/`.
2. Train the following models:
   - Linear Regression
   - Random Forest Regressor
   - XGBoost Regressor (optional)
3. Evaluate using:
   - Mean Absolute Error (MAE)
   - Root Mean Squared Error (RMSE)
   - R-squared (R² Score)
4. Compare performance across models.
5. Select and save the best model to `models/best_model.pkl`.

**Bonus:**
- Use joblib or pickle for saving the model.
- Include visualizations like predicted vs actual scatter plots.


In [None]:
# 1. Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib
import os
try:
    from xgboost import XGBRegressor
    xgb_installed = True
except ImportError:
    xgb_installed = False
    print('XGBoost not installed, skipping XGBoost Regressor.')


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [26]:
# 2. Load training and testing data
X_train = pd.read_csv('../data/processed/X_train.csv')
X_test = pd.read_csv('../data/processed/X_test.csv')
y_train = pd.read_csv('../data/processed/y_train.csv').squeeze()
y_test = pd.read_csv('../data/processed/y_test.csv').squeeze()
print('Data loaded.')
print('X_train shape:', X_train.shape)
print('y_train shape:', y_train.shape)

Data loaded.
X_train shape: (316, 41)
y_train shape: (316,)


In [27]:
# 3. Train and evaluate models
results = {}
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(random_state=42)
}
if xgb_installed:
    models['XGBoost'] = XGBRegressor(random_state=42)

for name, model in models.items():
    print(f'\nTraining {name}...')
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mae = mean_absolute_error(y_test, y_pred)
    # Compute RMSE manually for compatibility
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    results[name] = {'model': model, 'MAE': mae, 'RMSE': rmse, 'R2': r2, 'y_pred': y_pred}
    print(f'{name} - MAE: {mae:.3f}, RMSE: {rmse:.3f}, R2: {r2:.3f}')


Training Linear Regression...
Linear Regression - MAE: 1.647, RMSE: 2.378, R2: 0.724

Training Random Forest...
Random Forest - MAE: 1.181, RMSE: 1.957, R2: 0.813

Training XGBoost...
Random Forest - MAE: 1.181, RMSE: 1.957, R2: 0.813

Training XGBoost...
XGBoost - MAE: 1.196, RMSE: 2.133, R2: 0.778
XGBoost - MAE: 1.196, RMSE: 2.133, R2: 0.778


In [28]:
# 4. Compare performance and select the best model
import numpy as np

# Create a summary DataFrame
summary = pd.DataFrame({k: {'MAE': v['MAE'], 'RMSE': v['RMSE'], 'R2': v['R2']} for k, v in results.items()}).T
print('\nModel Performance Summary:')
print(summary)

# Select the best model (lowest RMSE)
best_model_name = summary['RMSE'].idxmin()
best_model = results[best_model_name]['model']
print(f'\nBest model: {best_model_name}')

# Save the best model
os.makedirs('../models', exist_ok=True)
joblib.dump(best_model, '../models/best_model.pkl')
print('Best model saved to ../models/best_model.pkl')


Model Performance Summary:
                        MAE      RMSE        R2
Linear Regression  1.646666  2.378370  0.724134
Random Forest      1.180506  1.957487  0.813131
XGBoost            1.196146  2.133497  0.778015

Best model: Random Forest
Best model saved to ../models/best_model.pkl
