# 08 · Summary & Assessment prep

> **Purpose**: consolidate the full workflow, align outputs with Assessment 1, and generate a short submission checklist.

> **Learning objectives**
- Recap the end-to-end epidemiology workflow used in this module.
- Verify that your pipeline produces the minimum required outputs for A1.
- Prepare a clean export of figures and text (DAG + 500 words) for submission.

---

In [None]:
# Make sure the repo root (which has scripts/bootstrap.py) is on sys.path.
import sys, os, pathlib, subprocess

REPO_NAME = "fb2nep-epi"
REPO_URL  = "https://github.com/ggkuhnle/fb2nep-epi.git"
IN_COLAB  = "google.colab" in sys.modules

def ensure_repo_on_path():
    here = pathlib.Path.cwd()
    # Walk up a few levels to find scripts/bootstrap.py
    for p in [here, *here.parents]:
        if (p / "scripts" / "bootstrap.py").exists():
            os.chdir(p)                 # normalise CWD to repo root
            sys.path.append(str(p))     # ensure imports like "from scripts..." work
            return p
    # Not found locally: if on Colab, clone then chdir
    if IN_COLAB:
        # clone only if missing
        if not (pathlib.Path("/content") / REPO_NAME).exists():
            subprocess.run(["git", "clone", REPO_URL], check=False)
        os.chdir(f"/content/{REPO_NAME}")
        sys.path.append(os.getcwd())
        return pathlib.Path.cwd()
    # Otherwise, we can’t proceed
    raise FileNotFoundError("Could not find repo root containing scripts/bootstrap.py")

repo_root = ensure_repo_on_path()
print("Repo root:", repo_root)

In [None]:
# Bootstrap: ensure repo root on path, then import init
import sys, pathlib
sys.path.append(str(pathlib.Path.cwd().parent))
from scripts.bootstrap import init
df, ctx = init()
df.head(2)

## 1) Recap — the workflow you implemented
1. **Intro & integrity**: load data, check ranges, flags ↔ dates, expected monotone signals.
2. **Describe the population**: robust *Table 1*; missingness exploration.
3. **Exposure analysis**: distributions; intake vs biomarker (construct validity).
4. **Theory → Model**: DAGs; decide a minimal sufficient adjustment set.
5. **Cross-sectional models**: transformations, non-linearity, diagnostics, VIF.
6. **Prospective models**: incident outcomes (logistic / survival), interpretation.
7. **Advanced**: confounding vs collider vs mediator; missing-data sensitivity.

Your **assessment** now asks you to pull a defensible thread through these steps for one clear question.

## 2) Reproducibility snapshot (optional in submission)
Capture versions to make your environment explicit.

In [None]:
import sys, numpy, pandas, matplotlib, statsmodels
print({
  'python': sys.version.split()[0],
  'numpy': numpy.__version__,
  'pandas': pandas.__version__,
  'matplotlib': matplotlib.__version__,
  'statsmodels': statsmodels.__version__
})

## 3) Assessment 1 — checklist generator
This cell checks for the expected artefacts and gives you a quick status table.

**Expected at minimum**:
- A **DAG figure** (PNG/PDF) describing your reasoning & adjustment set.
- A **500-word** methods/results/interpretation/limitations text.
- One clear **Table 1** (CSV or embedded output) relevant to your question.
- One primary **logistic regression** on the chosen outcome (unadjusted + adjusted).

In [None]:
import os, pandas as pd, pathlib
root = pathlib.Path.cwd().parent  # repo root
artefacts = {
    'Table 1 CSV (submission/table1_by_OUTCOME.csv)': root / 'submission' / 'table1_by_OUTCOME.csv',
    'DAG image (submission/dag.png)':                 root / 'submission' / 'dag.png',
    'DAG PDF (optional)':                             root / 'submission' / 'dag.pdf',
    '500 words (submission/summary_500w.txt)':        root / 'submission' / 'summary_500w.txt',
}
rows = []
for label, path in artefacts.items():
    exists = path.exists()
    size = path.stat().st_size if exists else 0
    rows.append({'Artefact': label, 'Exists': bool(exists), 'Bytes': int(size), 'Path': str(path)})
pd.DataFrame(rows)

If an item is missing, use the helper snippets below to create it. Use the versions you produced in earlier notebooks if you prefer — the point is *consistency* with your chosen question and DAG.

## 4) Helper — export your DAG image (template)
If you built a DAG in a previous notebook, re-run the cell there to regenerate the figure into `submission/`. Otherwise, adapt this minimal template.

_Note_: `networkx` is optional; feel free to export a diagram made elsewhere as long as it matches your model narrative.

In [None]:
import pathlib
from pathlib import Path
Path('submission').mkdir(exist_ok=True)
try:
    import networkx as nx, matplotlib.pyplot as plt
    G = nx.DiGraph()
    # EDIT THESE EDGES to match your final model
    G.add_edges_from([
        ('SES','Exposure'), ('SES','Outcome'),
        ('Age','Exposure'), ('Age','Outcome'),
        ('Smoking','Outcome'),
        ('Exposure','Outcome')
    ])
    pos = {'SES':(-1,1),'Age':(1,1),'Smoking':(1.8,0.7),'Exposure':(0,0),'Outcome':(0,-1)}
    plt.figure(figsize=(5.2,4.2))
    nx.draw(G, pos, with_labels=True, node_size=1600, node_color='#e6f2ff', arrows=True)
    plt.axis('off'); plt.tight_layout()
    plt.savefig('submission/dag.png', dpi=200)
    plt.savefig('submission/dag.pdf')
    print('Saved submission/dag.png and dag.pdf')
except Exception as e:
    print('DAG export skipped — install networkx/matplotlib or supply your own image. Error:', e)

## 5) Helper — logistic model export (tidy OR table)
Run your **unadjusted** and **adjusted** logistic regressions and save a simple OR table for the **primary exposure**. Edit `OUTCOME`, `EXPOSURE`, and `adj` to match your assessment choice.

In [None]:
import numpy as np, pandas as pd, statsmodels.api as sm
from patsy import dmatrices

# === EDIT THESE FOR YOUR QUESTION ===
OUTCOME  = 'Cancer_incident'          # e.g., 'Cancer_incident' or 'CVD_incident'
EXPOSURE = 'red_meat_g_d'             # e.g., 'red_meat_g_d' or 'salt_g_d'
adj = ['age','sex','BMI','SES_class','IMD_quintile','smoking_status']
# ===================================

dat = df[[OUTCOME, EXPOSURE] + adj].dropna().copy()
def wrap_cat(d, v):
    return f"C({v})" if (d[v].dtype=='object' or str(d[v].dtype).startswith('category')) else v
rhs_adj = ' + '.join([EXPOSURE] + [wrap_cat(dat, v) for v in adj])

def fitlog(formula, data):
    y, X = dmatrices(formula, data=data, return_type='dataframe')
    return sm.Logit(y, X).fit(disp=False)

fit_u  = fitlog(f"{OUTCOME} ~ {EXPOSURE}", dat)
fit_a  = fitlog(f"{OUTCOME} ~ " + rhs_adj, dat)

def tidy_or(f):
    OR = np.exp(f.params).rename('OR')
    CI = np.exp(f.conf_int()).rename(columns={0:'2.5%',1:'97.5%'})
    return pd.concat([OR,CI], axis=1).round(3)

tab_u = tidy_or(fit_u).filter(like=EXPOSURE, axis=0)
tab_a = tidy_or(fit_a).filter(like=EXPOSURE, axis=0)
tab = pd.concat([tab_u.assign(Model='Unadjusted'), tab_a.assign(Model='Adjusted')])
tab = tab[['Model','OR','2.5%','97.5%']]
tab.to_csv('submission/primary_logistic_or.csv', index=True)
tab

## 6) Helper — Table 1 export
If you used the `make_table1` utility in notebook 02, re-use it here. Otherwise, this quick variant creates a minimal overall + by-outcome table and writes to CSV. Edit the outcome and variable lists as needed.

In [None]:
import pandas as pd, numpy as np
from pathlib import Path
Path('submission').mkdir(exist_ok=True)

OUTCOME = OUTCOME  # keep aligned with the cell above
cont = ['age','BMI','SBP','energy_kcal','fruit_veg_g_d','red_meat_g_d','salt_g_d']
cat  = ['sex','smoking_status','physical_activity','SES_class','IMD_quintile','menopausal_status']

def simple_table1(data, group, cont, cat):
    pieces = []
    # continuous
    for v in cont:
        g = data.groupby(group)[v].agg(['mean','std','median','count']).round(2)
        g.index.name = 'group'
        g['variable'] = v
        g = g.reset_index().set_index(['variable','group'])
        pieces.append(g)
    # categorical
    for v in cat:
        ct = (data.groupby([group, v]).size().unstack(fill_value=0))
        pct = (ct.T / ct.T.sum()).T.round(3)
        combined = pd.concat({'n': ct, 'pct': pct}, axis=1)
        combined['variable'] = v
        combined = combined.rename_axis(index={'':'level'}).reset_index().set_index(['variable','level'])
        pieces.append(combined)
    return pd.concat(pieces, axis=0, sort=False)

t1 = simple_table1(df, OUTCOME, cont, cat)
t1.to_csv(f'submission/table1_by_{OUTCOME}.csv')
t1.head(12)

## 7) Helper — 500 words template
Run this to create `submission/summary_500w.txt`; paste or edit it to fit your analysis. Keep to **≤ 500 words** (the assert warns if you exceed).

In [None]:
from pathlib import Path
text = (
    "Title: [Your exposure → outcome question here]\n\n"
    "Methods: We analysed N=[…] adults from the FB2NEP synthetic cohort. The primary outcome was […]. "
    "The exposure was […], with construct validity supported by […]. We specified a DAG and selected a minimal "
    "adjustment set: […]. We fit unadjusted and adjusted logistic regressions; sensitivity checks included […] (e.g., simple imputation vs CC).\n\n"
    "Results: [Key numbers: Table 1 highlights; unadjusted OR; adjusted OR with 95% CI; brief direction of change.]\n\n"
    "Interpretation: The adjusted association suggests […]. Given measurement error and residual confounding, "
    "the true effect may be […]. Findings align/contrast with […].\n\n"
    "Limitations: Synthetic data; exposure misclassification; potential MNAR missingness; model misspecification; generalisability.\n"
)
Path('submission').mkdir(exist_ok=True)
Path('submission/summary_500w.txt').write_text(text, encoding='utf-8')
print('Wrote submission/summary_500w.txt (edit this file to finalise text).')
assert len(text.split()) <= 500, 'Keep the summary ≤ 500 words.'

## 8) Self-check vs marking rubric

- **Data handling & clarity (25%)**: Does your Table 1 match the question? Missingness patterns noted? Units/coding clear?
- **Correctness & interpretation (30%)**: Are models appropriate (logistic/survival as needed)? OR/CI reported correctly?
- **Causal reasoning (25%)**: Is the DAG coherent? Is the adjustment set minimal and justified (no colliders/mediators unless direct effect intended)?
- **Communication (20%)**: Is the DAG legible; table readable; 500 words concise and precise (British English)?

Final tip: make sure the **n** used in models is explicit (after exclusions/imputation).

> ## Key takeaways
>
> - Your analysis should tell a coherent causal story from **DAG → model → result → interpretation**.
> - Prefer **simplicity with justification** over complicated models without diagnostics.
> - Export your artefacts to `submission/` and verify with the checklist before uploading.