# Lesson 18: Capstone — NumPy-Only Mini ML Pipeline
**Goal (~15–20 min):** Build a tiny end-to-end regression pipeline using only NumPy.
- Load synthetic data
- Split train/test
- Scale (fit on train, apply to test)
- Feature engineer
- Fit linear regression (normal equation)
- Evaluate RMSE

## 1) Data Generation

In [None]:
import numpy as np
rng = np.random.default_rng(123)
N = 200
age = rng.integers(22, 61, size=N)
income_k = rng.integers(40, 181, size=N)
score = rng.integers(40, 101, size=N)
years_exp = np.clip((age-22)//3 + rng.integers(-1,3,size=N), 0, None)
# true target (credit-like), with noise
y = np.clip(500 + income_k*2 + (score-70)*4 + years_exp*3 + rng.integers(-60,61,size=N), 300, 850).astype(float)
X = np.column_stack([age, income_k, score, years_exp]).astype(float)
header = ["age","income_k","score","years_experience"]

## 2) Train/Test Split

In [None]:
idx = rng.permutation(N)
tr = idx[:160]; te = idx[160:]
X_tr, X_te = X[tr], X[te]; y_tr, y_te = y[tr], y[te]
print("train:", X_tr.shape, " test:", X_te.shape)

## 3) Fit/Transform Scaling (Standardization)

In [None]:
def fit_standardizer(X):
    m = X.mean(axis=0); s = X.std(axis=0); s[s==0] = 1.0
    return {"mean": m, "std": s}
def transform_standardizer(X, st):
    return (X - st["mean"]) / st["std"]
st = fit_standardizer(X_tr)
X_tr_std = transform_standardizer(X_tr, st)
X_te_std = transform_standardizer(X_te, st)

## 4) Feature Engineering

In [None]:
age_flag = (X_tr[:,0] > 35).astype(float).reshape(-1,1)
age_flag_te = (X_te[:,0] > 35).astype(float).reshape(-1,1)
# also add min-max normalized score (fit on train for stability)
sc = X_tr[:,2]; sc_min = sc.min(); sc_rng = np.ptp(sc) if np.ptp(sc)!=0 else 1.0
score_mm_tr = ((X_tr[:,2]-sc_min)/sc_rng).reshape(-1,1)
score_mm_te = ((X_te[:,2]-sc_min)/sc_rng).reshape(-1,1)
# build design matrices (standardized + engineered)
Xb_tr = np.hstack([np.ones((X_tr.shape[0],1)), X_tr_std, age_flag, score_mm_tr])
Xb_te = np.hstack([np.ones((X_te.shape[0],1)), X_te_std, age_flag_te, score_mm_te])
print("Xb_tr shape:", Xb_tr.shape)

## 5) Linear Regression via Normal Equation

In [None]:
# theta = (X^T X)^(-1) X^T y
XtX = Xb_tr.T @ Xb_tr
Xty = Xb_tr.T @ y_tr
theta = np.linalg.solve(XtX, Xty)
print("theta shape:", theta.shape)

## 6) Evaluate RMSE on Test

In [None]:
y_pred = Xb_te @ theta
rmse = np.sqrt(np.mean((y_te - y_pred)**2))
print("Test RMSE:", round(rmse, 3))
print("Pred sample:", y_pred[:5].round(2))

## Exercise
1) Try replacing standardization with min–max scaling (fit on train).
2) Add an interaction feature: income_k * score (scaled) and refit.
3) Report new RMSE and compare.