Input Representation Benchmark

This repository is the orchestration layer for benchmark reruns:

dataset preparation
SLURM job generation and stage launch
statistics refresh (including bootstrap and paired tests)
manuscript table and figure refresh

Model-side execution is handled by the sibling ../fms-ehrs repository.

Repo responsibilities

Repo	Responsibility
`input-representation-benchmark`	experiment matrix, stage orchestration, stats refresh, paper refresh
`../fms-ehrs`	tokenizer, training, hidden-state extraction, downstream prediction scripts
`../697b81f1f269207e5416f18d/MLHC`	manuscript source and compiled paper

Active run map

Stage -1 (MEDS extraction)
- benchmarks/mimic-meds-extraction/scripts/01_extract_meds_full.sh
- slurm/01_phase0_extract_meds.sh
Stage 0 (tokenization + base outcomes)
- orchestration: pipeline/run_experiments.py + slurm/03_stage0_tokenize_and_outcomes_meds.sh
- model repo call: ../fms-ehrs/fms_ehrs/scripts/tokenize_w_config.py
- outcome join: pipeline/scripts/extract_outcomes_meds.py
Stage 0.5 (optional extended outcomes)
- pipeline/scripts/extract_extended_outcomes.py
- slurm/13_refresh_all_extended_outcomes.sh, slurm/13_refresh_exp3_extended_outcomes.sh
Stage 1 (representation training)
- launchers: slurm/04_exp1_stage1_tune_packed.sh, slurm/07_exp2_stage1_train_representation.sh, slurm/08_exp3_stage1_train_representation.sh
- model repo calls: ../fms-ehrs/fms_ehrs/scripts/tune_model.py, ../fms-ehrs/fms_ehrs/scripts/train_representation.py
Stage 2 (hidden-state extraction)
- launchers: slurm/09_run_stage2_gpu2_extract.sh, slurm/ref_qse/09_extract_reps.sh
- model repo call: ../fms-ehrs/fms_ehrs/scripts/extract_hidden_states.py
Stage 3 (downstream probes)
- launchers: slurm/11_run_stage3_tier2q_lr.sh, slurm/ref_qse/10_xfer_rep_based_preds.sh
- model repo call: ../fms-ehrs/fms_ehrs/scripts/transfer_rep_based_preds.py
Statistics refresh
- pipeline/scripts/regenerate_aligned_family_stats.py + slurm/15_submit_aligned_family_stats.sh
- model repo backend: ../fms-ehrs/fms_ehrs/scripts/aggregate_version_preds.py
Paper refresh
- paper/scripts/generate_mlhc_appendix_tables.py
- paper/scripts/generate_mlhc_appendix_outcome_descriptives.py
- paper/scripts/generate_mlhc_paper_figures.py

For the full file-by-file stage walkthrough, use PIPELINE.md.

Experiment 3 note: The arm-construction step rewrites CLIF-covered LAB, VITAL, MEDICATION, and INFUSION codes upstream, but the tokenizer used for the reported Experiment 3 runs reads only the LAB and VITAL event blocks. See pipeline/scripts/build_exp3_meds_semantics_arms.py together with ../fms-ehrs/fms_ehrs/config/mimic-meds-exp3-icu.yaml.

Reported MLHC compute environment

The paper appendix keeps a short compute summary. The fuller stage-by-stage resource record for the reported MLHC runs is:

Stage0 (tokenization + outcomes, CPU-only): 8 CPUs, 300 GB RAM.
Stage0E (evaluation retokenization, CPU-only): 8 CPUs, 80 GB RAM.
Stage1 (generative medical event model training, GPU): 4 x NVIDIA A100-PCIE-40GB for Exp1, 2 x A100 for Exp2--3, and 1 x A100 for the xVal-affine reruns; 8--16 CPUs, 128--256 GB RAM.
Stage2 (representation extraction, GPU): 1 x NVIDIA A100-PCIE-40GB, 8 CPUs, 128 GB RAM.
Stage3 (logistic regression probes, CPU-only): 8 CPUs, 256 GB RAM.

All reported jobs were single-node runs. All reported Stage1 runs used FlashAttention-2; those precision and kernel choices affected runtime and memory, but not the model objective or evaluation definitions.

Statistics files for Reproducibility

File name	What it stores	First use
`artifacts/runs/statistics/paper_stats_run_artifacts/all_family_metrics.csv`	Master point estimates and 95% CIs for every reported handle, outcome, and metric.	First stop for nearly every AUROC, Spearman rho, AUPRC, Brier, and ECE value in `paper.tex`.
`artifacts/runs/statistics/paper_stats_run_artifacts/all_family_pairwise.csv`	Paired permutation deltas and BH-adjusted p-values for all within-family handle comparisons.	Use for inferential comparisons and non-baseline pairwise checks.
`artifacts/runs/statistics/paper_stats_run_artifacts/all_family_pairwise_baseline.csv`	Baseline-centered subset of pairwise results.	Main-text delta heatmaps and text that compares each experiment to its shared reference arm.
`../697b81f1f269207e5416f18d/MLHC/generated/appendix_binary_outcome_descriptives.tex`	Binary outcome prevalence, eligible `N`, and descriptive coverage.	Outcome sample sizes and appendix descriptive tables.
`../697b81f1f269207e5416f18d/MLHC/generated/appendix_regression_outcome_descriptives.tex`	Regression outcome eligible `N`, summary ranges, and descriptive coverage.	Regression sample sizes and appendix descriptive tables.
`../697b81f1f269207e5416f18d/MLHC/generated/appendix_binary_sweep.tex`	Manuscript binary outcome sweep table with point estimates and CIs.	Appendix binary metrics included by LaTeX.
`../697b81f1f269207e5416f18d/MLHC/generated/appendix_regression_sweep.tex`	Manuscript regression sweep table with point estimates and CIs.	Appendix regression metrics included by LaTeX.
`../697b81f1f269207e5416f18d/MLHC/generated/appendix_stats_coverage.tex`	Coverage summary for which stats surfaces are included in the paper appendix.	Appendix stats-coverage table.
`../697b81f1f269207e5416f18d/MLHC/figures/`	Final paper figures consumed by LaTeX.	What the manuscript actually renders.
`artifacts/runs/models/exp1_`, `artifacts/runs/models/exp2_`, `artifacts/runs/models/exp3_*`	Model configs, checkpoints, vocab sizes, representation metadata, and training logs.	Model inventory, vocab sizes, parameter counts, and pretraining-loss source.
`artifacts/runs/tokenized/mimiciv-3.1_meds_70-10-20/` and `artifacts/runs/exp3/` tokenized directories	Tokenized Parquet timelines and vocabularies.	Sequence lengths, overflow fractions, realized bin counts, and zero-frequency token audits.
`artifacts/runs/exp3/arms/meds_mapped/mappings/meta.json`	Train-split event remapping counts for Experiment 3.	LAB/VITAL remapping percentages reported in `paper.tex`.
`outputs/diagnostics_xval_zero_audit/xval_zero_out_results.json`	xVal near-zero suppression ratios and counts.	Mechanistic xVal trench claims. Regenerate with `slurm/xval_zero_out_audit_tier1q.sh`.

Builder and recompute entrypoints

Surface	Builder / implementation
Aligned family metrics and CIs	`pipeline/scripts/regenerate_aligned_family_stats.py`
Baseline-centered pairwise stats	`paper/scripts/recompute_baseline_pairwise_view.py`
Appendix sweep and coverage tables	`paper/scripts/generate_mlhc_appendix_tables.py`
Appendix descriptive tables	`paper/scripts/generate_mlhc_appendix_outcome_descriptives.py`
Main-text and appendix figures	`paper/scripts/generate_mlhc_paper_figures.py`
Outcome extraction and clinical definitions	`pipeline/scripts/extract_outcomes_meds.py`, `pipeline/scripts/extract_extended_outcomes.py`
xVal zero-out diagnostic	`pipeline/scripts/diagnostics/diag_xval_zero_out.py`, `slurm/xval_zero_out_audit_tier1q.sh`
Soft-vs-discrete contextual CE diagnostic	`pipeline/scripts/diagnostics/diag_attention_washout.py`
Clinical boundary probe	`pipeline/scripts/diagnostics/diag_clinical_boundary_probe.py`
Embedding geometry / sparsity diagnostics	`pipeline/scripts/diagnostics/diag_embedding_geometry.py`, `pipeline/scripts/diagnostics/diag_potassium_geometry_grid.py`, `pipeline/scripts/diagnostics/diag_sparsity_distance.py`
Model-side tokenization and representation mechanics	`../fms-ehrs/fms_ehrs/scripts/tokenize_w_config.py`, `../fms-ehrs/fms_ehrs/framework/tokenizer.py`, `../fms-ehrs/fms_ehrs/framework/soft_discretization.py`, `../fms-ehrs/fms_ehrs/framework/xval.py`, `../fms-ehrs/fms_ehrs/framework/model_wrapper.py`, `../fms-ehrs/fms_ehrs/scripts/train_representation.py`

Paper section to artifact groups

Paper area	Primary artifact files	Supporting code / metadata
Abstract, intro takeaways, conclusion	`artifacts/runs/statistics/paper_stats_run_artifacts/all_family_metrics.csv`, `artifacts/runs/statistics/paper_stats_run_artifacts/all_family_pairwise_baseline.csv`, `artifacts/runs/models/exp1_`, `artifacts/runs/models/exp2_`, `artifacts/runs/models/exp3_*`	`paper/scripts/generate_mlhc_paper_figures.py`
Outcome definitions and eligible sample sizes	`../697b81f1f269207e5416f18d/MLHC/generated/appendix_binary_outcome_descriptives.tex`, `../697b81f1f269207e5416f18d/MLHC/generated/appendix_regression_outcome_descriptives.tex`, `../697b81f1f269207e5416f18d/MLHC/paper.tex` Table `tab:outcomes_sources`	`pipeline/scripts/extract_outcomes_meds.py`, `pipeline/scripts/extract_extended_outcomes.py`
Cohort and extraction details	`artifacts/runs/tokenized/mimiciv-3.1_meds_70-10-20/`, `artifacts/runs/exp3/`, `../697b81f1f269207e5416f18d/MLHC/paper.tex` Table `tab:cohort_summary`	`pipeline/scripts/align_cohorts.py`, `benchmarks/mimic-meds-extraction/` configs and scripts
Experiment 1 claims	`artifacts/runs/statistics/paper_stats_run_artifacts/all_family_metrics.csv`, `artifacts/runs/statistics/paper_stats_run_artifacts/all_family_pairwise_baseline.csv`, `../697b81f1f269207e5416f18d/MLHC/generated/appendix_binary_sweep.tex`, `../697b81f1f269207e5416f18d/MLHC/generated/appendix_regression_sweep.tex`	`paper/scripts/generate_mlhc_paper_figures.py`
Experiment 2 claims	`artifacts/runs/statistics/paper_stats_run_artifacts/all_family_metrics.csv`, `artifacts/runs/statistics/paper_stats_run_artifacts/all_family_pairwise_baseline.csv`, `artifacts/runs/models/exp2_*`, `outputs/diagnostics_xval_zero_audit/xval_zero_out_results.json`	`../fms-ehrs/fms_ehrs/framework/xval.py`, `../fms-ehrs/fms_ehrs/framework/soft_discretization.py`, `pipeline/scripts/diagnostics/diag_xval_zero_out.py`, `pipeline/scripts/diagnostics/diag_attention_washout.py`
Experiment 3 claims	`artifacts/runs/statistics/paper_stats_run_artifacts/all_family_metrics.csv`, `artifacts/runs/statistics/paper_stats_run_artifacts/all_family_pairwise_baseline.csv`, `artifacts/runs/exp3/arms/meds_mapped/mappings/meta.json`, `artifacts/runs/models/exp3_*`	`paper/scripts/recompute_baseline_pairwise_view.py`
Main-text and appendix figures	`../697b81f1f269207e5416f18d/MLHC/figures/`	`paper/scripts/generate_mlhc_paper_figures.py`
Appendix tables	`../697b81f1f269207e5416f18d/MLHC/generated/`	`paper/scripts/generate_mlhc_appendix_tables.py`, `paper/scripts/generate_mlhc_appendix_outcome_descriptives.py`

Archived benchmark side paths

deprecated/scripts/xgboost_baseline.py
deprecated/slurm/xgboost_baseline.sh
deprecated/outputs/xgboost_baseline/xgboost_baseline_results.json

Statistics settings for release reruns

For Exp2/Exp3 stats regeneration:

bootstrap samples: 2000
paired permutation tests: enabled with permutation_n=2000 when requested
baseline-only pairwise mode: pass --baseline_handle (discrete_tt for Exp2, meds for Exp3)

Use slurm/15_submit_aligned_family_stats.sh to generate the local stats rerun jobfiles under slurm/generated/statistics/. The completed strict-parity rerun sheets are archived under deprecated/slurm/strict_parity_exp23/generated/demo/.

Experiment orchestration tests and dry-runs

Use the input-rep environment:

conda activate input-rep

Unit + contract tests:

pytest pipeline/tests/unit

If pytest is not available, you can still run core contract checks directly:

python - <<'PY'
import sys
sys.path.insert(0, ".")
from pipeline.tests.unit import test_pipeline_script_contracts as t
t.test_every_pipeline_script_has_main_and_entry_guard()
t.test_every_pipeline_script_has_dryrun_wrapper()
t.test_dryrun_manifest_matches_pipeline_scripts()
print("pipeline contract checks passed")
PY

Dry-run wrappers (one wrapper per pipeline script):

bash pipeline/tests/dryrun/run_all.sh

Optional execute-mode dry-runs for scripts configured with safe CLI checks:

IRB_DRYRUN_EXECUTE_MODE=1 bash pipeline/tests/dryrun/run_all.sh

Directory map

Path	Role
`pipeline/`	stage orchestration scripts and control logic
`pipeline/tests/unit/`	unit and contract tests for orchestration
`pipeline/tests/dryrun/`	dry-run wrappers for each pipeline script
`paper/`	manuscript table/figure builders
`utilities/`	non-stage helper scripts and QC
`slurm/`	stage launchers and local generated jobfiles
`artifacts/`	run outputs
`deprecated/`	archived CLIF/UCMC/legacy material

Docs

PIPELINE.md: stage-by-stage run notes
pipeline/tests/README.md: orchestration test and dry-run details
slurm/README.md: launcher layout
docs/layout.md: repository layout notes
docs/surface_inventory.md: active vs utility vs archived surfaces

When docs and scripts disagree, follow the scripts and then update the docs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Input Representation Benchmark

Repo responsibilities

Active run map

Reported MLHC compute environment

Statistics files for Reproducibility

Builder and recompute entrypoints

Paper section to artifact groups

Archived benchmark side paths

Statistics settings for release reruns

Experiment orchestration tests and dry-runs

Directory map

Docs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
artifacts		artifacts
benchmarks/mimic-meds-extraction		benchmarks/mimic-meds-extraction
configs		configs
deprecated		deprecated
docs		docs
paper		paper
pipeline		pipeline
qc		qc
slurm		slurm
utilities		utilities
.gitignore		.gitignore
LICENSE		LICENSE
PIPELINE.md		PIPELINE.md
README.md		README.md
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
check_icu.py		check_icu.py
extract_metrics.py		extract_metrics.py
inspect_data.py		inspect_data.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Input Representation Benchmark

Repo responsibilities

Active run map

Reported MLHC compute environment

Statistics files for Reproducibility

Builder and recompute entrypoints

Paper section to artifact groups

Archived benchmark side paths

Statistics settings for release reruns

Experiment orchestration tests and dry-runs

Directory map

Docs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages