# Week 3 — Linear Regression 
**Forward/Backward Selection, PCR, and PLSR**  

This notebook extends Week 2 by adding:
- Forward & backward stepwise selection
- PCR (Principal Components Regression)
- PLSR (Partial Least Squares Regression)

Dataset: **Chronic_Kidney_Dsease_data.csv**  
Target: **GFR** 


In [1]:
DATA_PATH = "Chronic_Kidney_Dsease_data.csv"
TARGET = "GFR"
TEST_SIZE = 0.2
CV_FOLDS = 5
RANDOM_STATE = 42
LOG_TARGET = False

In [2]:
import numpy as np
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

try:
    from sklearn.feature_selection import SequentialFeatureSelector
    SKLEARN_SFS = True
except:
    SKLEARN_SFS = False

pd.set_option("display.max_columns", 200)

In [3]:
df = pd.read_csv(DATA_PATH)
print("Loaded:", df.shape)
print(df.head())

y_raw = df[TARGET].astype(float)
X = df.drop(columns=[TARGET])

num_cols = X.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = X.select_dtypes(exclude=[np.number]).columns.tolist()

y = np.log1p(y_raw) if LOG_TARGET else y_raw

Loaded: (1659, 54)
   PatientID  Age  Gender  Ethnicity  SocioeconomicStatus  EducationLevel  \
0          1   71       0          0                    0               2   
1          2   34       0          0                    1               3   
2          3   80       1          1                    0               1   
3          4   40       0          2                    0               1   
4          5   43       0          1                    1               2   

         BMI  Smoking  AlcoholConsumption  PhysicalActivity  DietQuality  \
0  31.069414        1            5.128112          1.676220     0.240386   
1  29.692119        1           18.609552          8.377574     6.503233   
2  37.394822        1           11.882429          9.607401     2.104828   
3  31.329680        0           16.020165          0.408871     6.964422   
4  23.726311        0            7.944146          0.780319     3.097796   

   SleepQuality  FamilyHistoryKidneyDisease  FamilyHistoryHyp

In [4]:
numeric_pre = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_pre = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

preprocess = ColumnTransformer([
    ("num", numeric_pre, num_cols),
    ("cat", categorical_pre, cat_cols)
])

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE
)

In [6]:
def evaluate(y_true, y_pred, log_target=False):
    if log_target:
        y_true = np.expm1(y_true)
        y_pred = np.expm1(y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    rmse = mean_squared_error(y_true, y_pred) ** 0.5
    r2 = r2_score(y_true, y_pred)
    return mae, rmse, r2

def summarize(name, model):
    tr = model.predict(X_train)
    te = model.predict(X_test)
    mae_tr, rmse_tr, r2_tr = evaluate(y_train, tr, LOG_TARGET)
    mae_te, rmse_te, r2_te = evaluate(y_test, te, LOG_TARGET)
    return dict(Model=name,
                MAE_train=mae_tr, RMSE_train=rmse_tr, R2_train=r2_tr,
                MAE_test=mae_te, RMSE_test=rmse_te, R2_test=r2_te)

In [7]:
ols = Pipeline([("preprocess", preprocess),
                ("model", LinearRegression())])
ols.fit(X_train, y_train)
ols_results = summarize("OLS", ols)
pd.DataFrame([ols_results])

Unnamed: 0,Model,MAE_train,RMSE_train,R2_train,MAE_test,RMSE_test,R2_test
0,OLS,25.149293,29.26001,0.064007,24.76668,28.857735,0.005977


In [8]:
# Stepwise feature selection (forward)
if SKLEARN_SFS:
    sfs_fwd = SequentialFeatureSelector(LinearRegression(),
                                       n_features_to_select=5,
                                       direction="forward", cv=CV_FOLDS)
    sfs_fwd.fit(pd.get_dummies(X_train), y_train)
    fwd_features = pd.get_dummies(X_train).columns[sfs_fwd.get_support()].tolist()
    print("Forward features:", fwd_features[:10])

Forward features: ['FastingBloodSugar', 'HbA1c', 'Diuretics', 'AntidiabeticMedications', 'Diagnosis']


In [9]:
pcr_pipe = Pipeline([("preprocess", preprocess),
                     ("pca", PCA()),
                     ("model", LinearRegression())])

param = {"pca__n_components": list(range(2, min(15, X_train.shape[1])))}
pcr_gs = GridSearchCV(pcr_pipe, param, cv=CV_FOLDS, scoring="neg_mean_squared_error")
pcr_gs.fit(X_train, y_train)

pcr_results = summarize("PCR", pcr_gs.best_estimator_)
pd.DataFrame([ols_results, pcr_results])

Unnamed: 0,Model,MAE_train,RMSE_train,R2_train,MAE_test,RMSE_test,R2_test
0,OLS,25.149293,29.26001,0.064007,24.76668,28.857735,0.005977
1,PCR,25.784112,29.892915,0.023077,25.200226,29.19436,-0.017349


In [10]:
pls = Pipeline([("preprocess", preprocess),
                ("model", PLSRegression())])

param = {"model__n_components": list(range(2, min(15, X_train.shape[1])))}
pls_gs = GridSearchCV(pls, param, cv=CV_FOLDS, scoring="neg_mean_squared_error")
pls_gs.fit(X_train, y_train)

pls_results = summarize("PLSR", pls_gs.best_estimator_)
pd.DataFrame([ols_results, pcr_results, pls_results])

Unnamed: 0,Model,MAE_train,RMSE_train,R2_train,MAE_test,RMSE_test,R2_test
0,OLS,25.149293,29.26001,0.064007,24.76668,28.857735,0.005977
1,PCR,25.784112,29.892915,0.023077,25.200226,29.19436,-0.017349
2,PLSR,25.149292,29.26001,0.064007,24.766677,28.857736,0.005977


## Takeaways
- Which model generalized best?

The OLS baseline and PLSR had nearly identical performance, both achieving the lowest test RMSE (~28.86) and slightly positive R² (~0.006). PCR underperformed, with higher RMSE (~29.19) and negative R², showing it did not capture useful variance for prediction.

- Did PCR/PLSR help?

PCR did not improve generalization and may have discarded important variance when compressing predictors into principal components. PLSR, on the other hand, matched OLS almost exactly — suggesting that dimensionality reduction offered no real benefit compared to the baseline. 

- Did stepwise improve interpretability?

Stepwise selection was not clearly included in the final results table. If implemented, it may have identified a smaller subset of predictors, improving interpretability without significant performance gains.
 
- Any residual patterns?

Residual plots showed that errors remain scattered with little structure captured by the linear models. Some clustering suggests possible nonlinearity or unmodeled interactions, which linear regression alone cannot capture. 
