# Student Performance — 03 Simple Modeling

We fit and evaluate simple linear models under two realistic scenarios:
1) With early-period grades (G1, G2)
2) Without early-period grades (to reduce information leakage)

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

df = pd.read_csv("../data/student-mat-clean.csv")
df.head()

Unnamed: 0,school,sex,age,address,famsize,pstatus,medu,fedu,mjob,fjob,...,famrel,freetime,goout,dalc,walc,health,absences,g1,g2,g3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


In [3]:
target = "g3"

# Keep only numeric columns
df_num = df.select_dtypes(include=["number"]).copy()
df_num.columns.tolist()

['age',
 'medu',
 'fedu',
 'traveltime',
 'studytime',
 'failures',
 'famrel',
 'freetime',
 'goout',
 'dalc',
 'walc',
 'health',
 'absences',
 'g1',
 'g2',
 'g3']

In [4]:
def train_eval_linear(df_features, target_col="g3", test_size=0.3, random_state=42):
    X = df_features.drop(columns=[target_col])
    y = df_features[target_col]

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )

    model = LinearRegression()
    model.fit(X_train, y_train)

    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    results = {
        "train_r2": r2_score(y_train, y_train_pred),
        "test_r2": r2_score(y_test, y_test_pred),
        "train_mae": mean_absolute_error(y_train, y_train_pred),
        "test_mae": mean_absolute_error(y_test, y_test_pred),
        "n_features": X.shape[1],
        "features": list(X.columns),
    }
    return model, results

## Scenario 1: Late-stage prediction (WITH G1 and G2)

G1 and G2 are earlier-period grades and are expected to be strongly associated with G3.
This scenario reflects predicting final grades when previous grades are already available.

In [5]:
df_s1 = df_num.copy()  # includes g1 and g2 by default

model_s1, res_s1 = train_eval_linear(df_s1, target_col=target)
res_s1

{'train_r2': 0.849477654359615,
 'test_r2': 0.8049063890373245,
 'train_mae': 1.124463879341646,
 'test_mae': 1.3361179348327485,
 'n_features': 15,
 'features': ['age',
  'medu',
  'fedu',
  'traveltime',
  'studytime',
  'failures',
  'famrel',
  'freetime',
  'goout',
  'dalc',
  'walc',
  'health',
  'absences',
  'g1',
  'g2']}

## Scenario 2: Early-stage prediction (WITHOUT G1 and G2)

In this scenario we exclude G1 and G2 to reduce information leakage.
This reflects predicting final performance without access to earlier grade outcomes.

In [7]:
df_s2 = df_num.drop(columns=["g1", "g2"], errors="raise")

model_s2, res_s2 = train_eval_linear(df_s2, target_col=target)
res_s2

{'train_r2': 0.18434876017462432,
 'test_r2': 0.1572682895050036,
 'train_mae': 3.1085571658386124,
 'test_mae': 3.3996034050802906,
 'n_features': 13,
 'features': ['age',
  'medu',
  'fedu',
  'traveltime',
  'studytime',
  'failures',
  'famrel',
  'freetime',
  'goout',
  'dalc',
  'walc',
  'health',
  'absences']}

In [8]:
comparison = pd.DataFrame([res_s1, res_s2], index=["WITH_G1_G2", "WITHOUT_G1_G2"])
comparison[["n_features", "train_r2", "test_r2", "train_mae", "test_mae"]]

Unnamed: 0,n_features,train_r2,test_r2,train_mae,test_mae
WITH_G1_G2,15,0.849478,0.804906,1.124464,1.336118
WITHOUT_G1_G2,13,0.184349,0.157268,3.108557,3.399603


## Interpretation

Including G1 and G2 typically improves performance substantially, which is expected because they
contain strong information about final grades.

However, this improvement may not be appropriate for all use-cases. If the goal is to predict
final outcomes early (before any grading period), then using G1/G2 would be unrealistic and can
be considered leakage.

The model without G1/G2 is a more challenging but more honest early-stage scenario. Performance
should be interpreted as "how much can we explain from background and study-related variables alone".

In [10]:
coef_s1 = pd.Series(model_s1.coef_, index=df_s1.drop(columns=[target]).columns).sort_values(key=abs, ascending=False)
coef_s2 = pd.Series(model_s2.coef_, index=df_s2.drop(columns=[target]).columns).sort_values(key=abs, ascending=False)

coef_s1.head(10), coef_s2.head(10)

(g2            0.971373
 famrel        0.355621
 failures     -0.311482
 traveltime    0.188852
 fedu         -0.187993
 g1            0.181662
 age          -0.181545
 studytime    -0.149924
 goout         0.134596
 dalc         -0.120951
 dtype: float64,
 failures     -2.022431
 medu          0.662474
 goout        -0.616267
 walc          0.438868
 fedu         -0.358134
 studytime     0.341974
 traveltime   -0.239845
 dalc         -0.237556
 freetime      0.211398
 age          -0.179412
 dtype: float64)

## Next steps

- Add categorical variables using one-hot encoding to improve early-stage modeling.
- Check residuals and assumptions for interpretability and robustness.
- Summarize conclusions and limitations in the final notebook.