## Dataset conventions 

- Canonical variant key: `chr_pos_ref_alt` (e.g., `17_43045705_T_C`) derived from `chrom`, `pos`, `ref`, `alt`.
- Variants are **missense SNVs** (single-nucleotide variants).
- Labels: strict ClinVar clinical significance mapping (B/LB → 0, P/LP → 1); ambiguous categories excluded or handled upstream in the cleaned table.
- Leakage control: **gene-disjoint** train/val/test splits (implemented in `scripts/make_week3_splits.py`).

See also: `docs/dataset_orientation_dylan_tan.md`.

In [1]:
from __future__ import annotations

from pathlib import Path
import pandas as pd

REPO_ROOT = Path.cwd().resolve().parents[0] if Path.cwd().name == 'notebooks' else Path.cwd().resolve()
DATA_PATH = REPO_ROOT / 'data/processed/week4_curated_dataset.parquet'

print('Repo root:', REPO_ROOT)
print('Week4 curated dataset exists:', DATA_PATH.exists())

Repo root: /Users/angelhdmorenu/Desktop/EGN 6933 – Project in Applied Data Science/Machine Learning Classification of Pathogenic vs. Benign Missense Variants Using Protein Language Model Embeddings
Week4 curated dataset exists: True


In [2]:
from collections import Counter

if not DATA_PATH.exists():
    print('Missing curated dataset:', DATA_PATH)
    print('Generate it via: python scripts/make_week4_curated_dataset.py')
else:
    df = pd.read_parquet(DATA_PATH)
    required = {'chr_pos_ref_alt', 'label', 'split', 'embedding'}
    missing = sorted(required - set(df.columns))
    if missing:
        raise ValueError(f'Missing required columns in curated dataset: {missing}')

    print('Rows:', len(df))
    print('Splits:', Counter(df['split']))
    print('Label balance overall:', df['label'].value_counts().to_dict())
    print('Label balance by split:')
    print(df.groupby('split')['label'].value_counts().unstack(fill_value=0))

    # quick embedding sanity-check
    first_len = len(df['embedding'].iloc[0])
    print('Embedding dim (first row):', first_len)
    bad_dim = (df['embedding'].map(len) != first_len).sum()
    print('Rows with mismatched embedding length:', int(bad_dim))

Rows: 5000
Splits: Counter({'train': 3958, 'val': 542, 'test': 500})
Label balance overall: {0: 3161, 1: 1839}
Label balance by split:
label     0     1
split            
test    316   184
train  2499  1459
val     346   196
Embedding dim (first row): 2560
Rows with mismatched embedding length: 0


## Week 5 baseline run commands

The main entrypoint for baselines is `scripts/baseline_train_eval.py`.

Run Logistic Regression baseline:

```bash
python scripts/baseline_train_eval.py \
  --data data/processed/week4_curated_dataset.parquet \
  --out-json results/baseline_logreg_report.json
```

Run Random Forest baseline + calibration + bootstrap CIs + plots:

```bash
python scripts/baseline_train_eval.py \
  --model rf \
  --rf-max-depth 4 \
  --rf-n-estimators 200 \
  --data data/processed/week4_curated_dataset.parquet \
  --calibration platt \
  --bootstrap-iters 1000 \
  --plot-pr results/rf_pr_curves_cal_vs_uncal.png \
  --plot-reliability results/rf_reliability.png \
  --plot-scores-test results/test_score_distributions.png \
  --out-json results/baseline_rf_calibrated_report.json
```

If you are needing to split-seed robustness (gene-disjoint), regenerate splits with a different `--seed` using `scripts/make_week3_splits.py`, rebuild Week 4, then rerun baselines.