Q: Looking at the project description and all the exercises; What is the best structure to have in the code? I don't need the code itself, but a structure and plan to follow for making the code

Great question. Here’s a **simple, robust project structure** that maps *exactly* to the parts (a–h) in your brief, keeps leakage under control, and lets you reuse code cleanly.

---

# 1) Folder layout (minimal but scalable)

```
project1/
├─ data/                      # (optional) saved arrays if you want
├─ figures/                   # all plots saved here
├─ results/                   # CSV/JSON of metrics per run
├─ src/
│  ├─ data.py                 # data generation + feature building + split/scale
│  ├─ models.py               # OLS / Ridge / Lasso closed-form + predict helpers
│  ├─ optim.py                # GD / Momentum / Adagrad / RMSprop / Adam / SGD
│  ├─ resampling.py           # bootstrap, k-fold CV, bias-variance utilities
│  ├─ metrics.py              # MSE, R2, L2-norm, condition number, etc.
│  ├─ plots.py                # all plotting helpers (no math here)
│  └─ utils.py                # seeds, config, reproducibility, saving, timers
├─ notebooks/
│  ├─ 01_OLS_degree.ipynb     # Part (a)
│  ├─ 02_Ridge_lambda.ipynb   # Part (b)
│  ├─ 03_GD_variants.ipynb    # Parts (c,d,e,f)
│  ├─ 04_BiasVariance.ipynb   # Part (g)
│  └─ 05_CrossValidation.ipynb# Part (h)
└─ README.md
```

If you want **super-minimal**, you can collapse to just:

* `src/models.py` (math & training),
* `src/plots.py` (figures),
* a couple of notebooks that import those.

---

# 2) Module responsibilities (what each file does)

## `src/data.py`

* `runge(x)`: the target function.
* `make_data(n, noise_sd, seed)`: build `(x, y)`.
* `build_features(x, degree, include_bias=False)`: polynomial design matrix.
* `train_test_scale(X, y, test_size, random_state)`: **single split**, compute **train-only** mean/std, return:

  * `X_tr, X_te, y_tr, y_te`
  * `X_tr_s, X_te_s` (scaled with **train** stats)
  * `y_mean, y_tr_c` (centered train targets)
  * (Optional) a small `Config` (seed, degree, noise, etc.).

**Why:** all leakage-sensitive steps live in one place; every experiment reuses the same safe split/scale pattern.

## `src/models.py`

Closed-form and “one-shot” solvers:

* `fit_ols_closed(Xtr_s, ytr_c)` → `theta`.
* `fit_ridge_closed(Xtr_s, ytr_c, lam, n_factor=True)` → `theta`
  (`n_factor=True` solves $(XᵀX + nλI)θ = Xᵀy_c$; `False` uses $λI$ only. Document this!)
* `predict_from_centered(Xs, theta, y_mean)` → `y_pred` (adds intercept back).
* Convenience **sweeps**:

  * `sweep_degree(X_full, y, deg_max)` → arrays of test MSE/R²/‖θ‖ vs degree (Part a).
  * `sweep_ridge(X_full, y, degree, lambdas)` → test MSE/R²/‖θ‖ vs λ (Part b).

**Why:** one import gives you everything you need for parts (a,b).

## `src/optim.py`

Gradient methods (Parts c–f):

* Generic loop: `gd(Xtr_s, ytr_c, eta, iters, init=None)` → `theta_path`.
* Add variants: `gd_momentum`, `adagrad`, `rmsprop`, `adam`, and `sgd` (mini-batch).
* Shared gradients:

  * OLS: $\nabla = -\tfrac{1}{n}Xᵀ(y - Xθ)$.
  * Ridge: OLS gradient + $2λθ$ (or $λθ$ depending on your chosen convention). Keep it consistent with your `fit_ridge_closed`.

**Why:** clean separation between *optimizers* and *models*. You can plug any gradient into any optimizer.

## `src/resampling.py`

* `bootstrap_predictions(model_fit_fn, predict_fn, Xtr_s, ytr_c, Xte_s, y_mean, B)` → `(B, n_test)` matrix of predictions.
* `bias_variance_from_preds(P, y_true)` → bias², var, (and optional MSE) averaged over test points.
* `kfold_cv(X_full, y, degree, k, model='ols'|'ridge', lam=None)` → mean test MSE across folds.

**Why:** your bias–variance (Part g) and CV (Part h) use the same primitives, regardless of model.

## `src/metrics.py`

* `mse(y_true, y_pred)`, `r2(y_true, y_pred)`.
* `l2_norm(theta)`, `condition_number(X)`.
* (Optional) `noise_floor = sigma**2` helper for plots.

**Why:** no repeated metric code, easy unit testing.

## `src/plots.py`

* `plot_mse_r2_vs_degree(degrees, mse, r2)`
* `plot_ridge_curves(lambdas, mse, r2)`
* `plot_theta_norms(x, norms, xlabel)`
* `plot_bias_variance(degrees_or_ns, bias2, var, mse=None)`
* stylistic knobs only—**no math** here.

**Why:** figures stay consistent and one-liner to reproduce.

## `src/utils.py`

* `set_seed(seed)`, small `save_json`, `save_csv`, `timestamped_fname`, `ensure_dir`.

---

# 3) Notebook/script flow (what each notebook does)

### 01\_OLS\_degree.ipynb — Part (a)

1. `from src.data import make_data, build_features`
2. Make data; build `X_full` (degree=15).
3. `degrees, mse_deg, r2_deg, norms = sweep_degree(X_full, y, deg_max=15)`.
4. Plot MSE/R² vs degree; plot ‖θ‖ vs degree; optionally print condition numbers.
5. **Also**: “dependence on number of data points” → fix degree `p`, subsample training of size `m`, average test MSE/R² (reuse a small helper in `models.py` or write a tiny loop in the notebook, but keep split/scale logic from `data.py`).

### 02\_Ridge\_lambda.ipynb — Part (b)

1. Fix an “interesting” degree `p` (e.g., 10–12).
2. `lambdas = logspace(...)`.
3. `mse_r, r2_r, norms_r = sweep_ridge(X_full, y, degree=p, lambdas=lambdas)`.
4. Plot MSE/R² vs λ (log x-axis), plot ‖θ‖ vs λ. Compare to OLS line.

### 03\_GD\_variants.ipynb — Parts (c, d, e, f)

1. Build split/scale once.
2. For OLS and Ridge, run `gd` with different `eta` and `iters`, overlay convergence of MSE on train/test.
3. Add momentum/Adagrad/RMSprop/Adam. Compare speed/stability.
4. Implement LASSO gradient (subgradient on L1) and compare to ridge/OLS.
5. Add `sgd` (mini-batch) and compare to full-batch GD.

### 04\_BiasVariance.ipynb — Part (g)

1. Fix degree(s) and/or number of points.
2. Use `bootstrap_predictions` to get prediction matrix `P`.
3. Compute bias²/var via `bias_variance_from_preds(P, y_true)`; plot bias² & var vs degree and vs #points.
4. Optionally confirm `MSE ≈ bias² + var + σ²`.

### 05\_CrossValidation.ipynb — Part (h)

1. Implement k-fold CV for OLS, Ridge, Lasso (you can call `kfold_cv` or sklearn’s `KFold`).
2. Compare CV MSE with bootstrap MSE; discuss differences.

---

# 4) Conventions you should lock down (and mention in your report)

* **Scaling & centering:** always fit mean/std on **train only**; center `y_train` (intercept added back to predictions).
* **Ridge λ convention:** choose one definition and be consistent.

  * If you minimize $(1/n)\|y-Xθ\|^2 + λ\|θ\|^2$, your closed-form is $(XᵀX + nλI)^{-1}Xᵀy_c$.
  * If you minimize $\|y-Xθ\|^2 + λ\|θ\|^2$, it’s $(XᵀX + λI)^{-1}Xᵀy_c$.
    Document which one you use; don’t mix in different parts.
* **Reproducibility:** set seeds *once* (NumPy default\_rng or `np.random.seed`).
* **Fair comparisons:** use the **same split** and **same scaling stats** across degree/λ sweeps.
* **Saving artifacts:** save CSVs of sweeps to `results/` and plots to `figures/` (one line in each notebook).

---

# 5) Minimal function signatures (to guide your coding)

```python
# data.py
def make_data(n=100, noise_sd=0.1, seed=42): ...
def build_features(x, degree, include_bias=False): ...
def train_test_scale(X, y, test_size=0.2, random_state=42): ...

# models.py
def fit_ols_closed(Xtr_s, ytr_c): ...
def fit_ridge_closed(Xtr_s, ytr_c, lam, n_factor=True): ...
def predict_from_centered(Xs, theta, y_mean): ...
def sweep_degree(X_full, y, deg_max=15, test_size=0.2, random_state=42): ...
def sweep_ridge(X_full, y, degree, lambdas, test_size=0.2, random_state=42, n_factor=True): ...

# optim.py
def grad_ols(Xs, y_c, theta): ...
def grad_ridge(Xs, y_c, theta, lam, n_factor=True): ...
def gd(Xs, y_c, grad_fn, eta, num_iters, theta0=None): ...
# momentum/adagrad/rmsprop/adam/sgd variants ...

# resampling.py
def bootstrap_predictions(fit_fn, pred_fn, Xtr_s, ytr_c, Xte_s, y_mean, B=300): ...
def bias_variance_from_preds(P, y_true): ...
def kfold_cv(X_full, y, degree, k=5, model='ols', lam=None): ...

# metrics.py
def mse(y_true, y_pred): ...
def r2(y_true, y_pred): ...
def l2_norm(theta): ...
def condition_number(X): ...
```

---

## Final tip

Start with **part (a) OLS degree sweep** wired through `data.py`, `models.py`, and `plots.py`. Once that’s solid, **clone the pattern** for Ridge (part b), then add `optim.py` for gradient-based parts. Bias–variance and CV drop in naturally once the split/scale and predict conventions are consistent.

If you want, I can turn this plan into stub files with docstrings you can fill in next.
