# Experiment: PRJDB36442 MetaPhlAn QC + Pilot M/S analysis

Objective:
- State the question you want to answer.
- Define the success criteria.


## Data inputs
This notebook reads **small, auditable** outputs produced by the cloud MetaPhlAn run.
- Sample sheet: `results/processed/metadata/PRJDB36442_sample_sheet.tsv`
- Alpha diversity: `results/processed/analysis/PRJDB36442/alpha_diversity.tsv`
- Species differential (CLR + permutation): `results/processed/analysis/PRJDB36442/species_differential_clr.tsv`

If these files do not exist, run the pipeline steps documented in `docs/PIPELINE_REPRO.md`.


In [None]:
from pathlib import Path
import csv

ROOT = Path('.').resolve()
PATH_SAMPLE_SHEET = ROOT / 'results/processed/metadata/PRJDB36442_sample_sheet.tsv'
PATH_ALPHA = ROOT / 'results/processed/analysis/PRJDB36442/alpha_diversity.tsv'
PATH_DIFF = ROOT / 'results/processed/analysis/PRJDB36442/species_differential_clr.tsv'

for p in [PATH_SAMPLE_SHEET, PATH_ALPHA, PATH_DIFF]:
    print(p, 'exists=' + str(p.exists()), 'size=' + str(p.stat().st_size if p.exists() else 0))


In [None]:
# Quick check: group counts
rows = list(csv.DictReader(PATH_SAMPLE_SHEET.open(), delimiter='	'))
counts = {}
for r in rows:
    g = r['group']
    counts[g] = counts.get(g, 0) + 1
print('n_samples=', len(rows))
print('group_counts=', counts)


In [None]:
# Alpha diversity summary (Shannon, richness) by group
alpha = list(csv.DictReader(PATH_ALPHA.open(), delimiter='	'))
by = {'M': [], 'S': []}
for r in alpha:
    by[r['group']].append(r)

def mean(vals):
    return sum(vals) / len(vals)

for g, rs in by.items():
    sh = [float(x['shannon']) for x in rs]
    rich = [int(x['richness']) for x in rs]
    print(g, 'n=', len(rs), 'shannon_mean=', round(mean(sh), 4), 'richness_mean=', round(mean(rich), 1))


In [None]:
# Top species by effect size (Cohen's d) from the pilot CLR analysis
diff = list(csv.DictReader(PATH_DIFF.open(), delimiter='	'))
for r in diff:
    r['_d'] = float(r['cohens_d'])
    r['_q'] = float(r['q_fdr']) if r.get('q_fdr') else 1.0

diff_sorted = sorted(diff, key=lambda x: abs(x['_d']), reverse=True)
for r in diff_sorted[:20]:
    species = r['clade_name'].split('|')[-1]
    print(species, 'd=', round(r['_d'], 3), 'q=', round(r['_q'], 3), 'prev_M=', r['prev_M'], 'prev_S=', r['prev_S'])


In [None]:
# Setup: imports and reproducibility
from __future__ import annotations

import random
import statistics

SEED = 7
random.seed(SEED)
SEED


## Plan

- Hypothesis:
- Variables to sweep:
- Metrics to record:


In [None]:
# Define parameters and lightweight helpers
sample_size = 20
values = [random.random() for _ in range(sample_size)]
summary = {
    "count": len(values),
    "mean": statistics.fmean(values),
    "min": min(values),
    "max": max(values),
}
summary


## Results

- Key observations:
- Surprises or failure modes:
- Decision: continue, pivot, or stop:


In [None]:
# Record findings in a minimal, copy-pasteable structure
result = {
    "seed": SEED,
    "mean": summary["mean"],
    "range": summary["max"] - summary["min"],
}
result


## Next steps

- What to try next:
- What to document elsewhere (PRD, notes, issue):
