# Step 3: Build Model
## 1.1 Model Selection
The target variable in this project is the `score`, which represents a continuous engagement score for each student. Because the outcome is numeric and not a discrete class label, this is primarily a **regression problem**, not a classification problem.

The goal is to predict the engagement score based on early behavioural and demographic features. Using regression allows the model to capture gradual differences in engagement levels instead of forcing students into a small number of discrete categories.

To explore different modelling approaches and compare performance, three supervised regression models will be tested:

1. **Linear Regression** – a simple, interpretable baseline model that assumes a linear relationship between features and engagement score.
2. **k-Nearest Neighbors Regressor (KNN)** – a non-parametric model that predicts scores based on similar students in the feature space, capturing local patterns.
3. **Random Forest Regressor** – an ensemble of decision trees that can model non-linear relationships and interactions between features, often providing strong performance on tabular data.

These models cover a range from simple and interpretable (linear regression) to more flexible and powerful (Random Forest), allowing for a meaningful comparison of bias–variance trade-offs.


In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split

# 1. Load the cleaned, preprocessed dataset
df = pd.read_csv("../data/cleaned_student_data.csv")

# 2. Define target (y)
target_col = "score"
y = df[target_col]

# 3. Define features (X): drop target and keep only numeric columns for modelling
X = df.drop(columns=[target_col])
X = X.select_dtypes(include=["int64", "float64"])

print("Final numeric feature columns used for modelling:")
print(X.columns.tolist())

print("\nShape of feature matrix X:", X.shape)
print("Shape of target vector y:", y.shape)

# 4. Train–test split to avoid data leakage
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

print("\nTraining set size:", X_train.shape[0])
print("Test set size:", X_test.shape[0])


Final numeric feature columns used for modelling:
['age', 'logged in', 'lessons', 'assignments', 'posts', 'orientation', 'total_activity', 'has_activity', 'lessons_assignments_ratio']

Shape of feature matrix X: (500, 9)
Shape of target vector y: (500,)

Training set size: 400
Test set size: 100


### Interpretation

The cleaned dataset was loaded and split into features (`X`) and target (`y`), where `score` is the engagement outcome to be predicted. A train–test split was applied with 80% of the data used for training and 20% held out for testing. This separation helps prevent data leakage and provides an unbiased estimate of model performance. The resulting shapes of `X_train`, `X_test`, `y_train`, and `y_test` confirm that the data is ready for model training in the next step.


## 2.1 Train baseline model

A baseline model provides a simple point of comparison to understand whether more complex models offer meaningful improvements. In regression problems, baseline models help establish the minimum acceptable performance before exploring advanced approaches.

In this step, three baseline regressors will be trained:
- **Linear Regression** – a simple, interpretable model that assumes a linear relationship between features and engagement score.
- **k-Nearest Neighbors Regressor (KNN)** – a non-parametric model that predicts engagement based on the average score of similar students in the feature space.
- **Random Forest Regressor** – an ensemble of decision trees that can capture non-linear patterns and interactions between variables.

Training these models creates a strong foundation for comparison during evaluation and later hyperparameter tuning.


In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

# Initialize the baseline models
lin_reg = LinearRegression()
knn_reg = KNeighborsRegressor()
rf_reg = RandomForestRegressor(random_state=42)

# Train all models on the training set
lin_reg.fit(X_train, y_train)
knn_reg.fit(X_train, y_train)
rf_reg.fit(X_train, y_train)

print("Baseline models trained successfully.")


Baseline models trained successfully.


### Interpretation

All three baseline models were successfully trained on the training data. These baselines will serve as reference points for performance assessment and hyperparameter tuning in the next steps.


## 3.1 Evaluate model performance

To compare the baseline models, each one must be evaluated on the test dataset using regression-appropriate metrics. The following metrics will be used:

- **Mean Squared Error (MSE)**: Measures the average squared difference between predicted and true values. Lower values indicate better performance but MSE is sensitive to outliers.
- **Root Mean Squared Error (RMSE)**: The square root of MSE, expressed in the same units as the target variable. Easier to interpret than MSE.
- **R-squared (R²)**: Represents the proportion of variance in the target explained by the model. Higher values indicate better fit, with 1.0 being a perfect fit.

These metrics provide complementary insights:
MSE/RMSE evaluate the magnitude of prediction errors, while R² assesses how well each model explains the engagement score.



In [6]:
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import pandas as pd

# Generate predictions
y_pred_lin = lin_reg.predict(X_test)
y_pred_knn = knn_reg.predict(X_test)
y_pred_rf = rf_reg.predict(X_test)

# Compute metrics
results = {
    "Model": ["Linear Regression", "KNN Regressor", "Random Forest Regressor"],
    "MSE": [
        mean_squared_error(y_test, y_pred_lin),
        mean_squared_error(y_test, y_pred_knn),
        mean_squared_error(y_test, y_pred_rf)
    ],
    "RMSE": [
        np.sqrt(mean_squared_error(y_test, y_pred_lin)),
        np.sqrt(mean_squared_error(y_test, y_pred_knn)),
        np.sqrt(mean_squared_error(y_test, y_pred_rf))
    ],
    "R² Score": [
        r2_score(y_test, y_pred_lin),
        r2_score(y_test, y_pred_knn),
        r2_score(y_test, y_pred_rf)
    ]
}

results_df = pd.DataFrame(results)
display(results_df)


Unnamed: 0,Model,MSE,RMSE,R² Score
0,Linear Regression,0.423249,0.650576,0.635721
1,KNN Regressor,0.467857,0.684001,0.597327
2,Random Forest Regressor,0.428382,0.654509,0.631303


### Interpretation

The performance table shows how each baseline model performs across multiple regression metrics.
- **Lower MSE and RMSE** indicate more accurate predictions.
- **Higher R² values** indicate that the model explains more variance in engagement score.

These baseline results provide a reference point for assessing whether tuned or more advanced models offer meaningful performance improvements.


## 4. Interpret results

The baseline evaluation highlights several important differences across the three models:

### **4.1 Linear Regression performs the best overall**
Linear Regression achieves the:
- **lowest MSE (0.423)**
- **lowest RMSE (0.651)**
- **highest R² (0.636)**

This indicates that a simple linear relationship explains a substantial portion of the variance in engagement score. The relatively strong performance of this model suggests that early behavioural features (logged-in hours, lessons, assignments) have a mostly linear association with engagement.

### **4.2 Random Forest performs similarly but does not outperform Linear Regression**
Random Forest achieves:
- MSE and RMSE nearly identical to Linear Regression
- Slightly lower R² (0.631)

While Random Forest is capable of capturing non-linear patterns, this small difference suggests that:
- either the dataset is not large enough for the ensemble model to shine
- or the underlying relationships are not strongly non-linear
- or the default hyperparameters may not be optimal

This model may still improve with tuning in a later step.

### **4.3 KNN performs the worst**
KNN has:
- the **highest MSE (0.468)**
- the **highest RMSE (0.684)**
- the **lowest R² (0.597)**

This suggests that engagement scores are **not well predicted by local neighbourhood structure** in the feature space. High-dimensional numeric data (even after scaling) tends to reduce the effectiveness of KNN because distances become less meaningful.

### **Key Insight**
Overall, Linear Regression provides the strongest baseline and indicates that the relationship between early activity and engagement is relatively stable and predictable. Random Forest shows potential and may improve with hyperparameter tuning, while KNN appears less suitable for this specific prediction task.

These findings will guide the next step: evaluating whether tuning more flexible models (such as Random Forest) can meaningfully outperform the linear baseline.
