OpenADMET PXR Blind Challenge — Structure-Track Method Report

Submission: v10.1_failureproofed.zip — 184 PDB files, ligand residue LIG. A uniform 5-model consensus-central ("medoid") selection (RF3, AlphaFold3, Boltz2, Protenix, ESMFold2), then failure-proofed against the OST scorer's ligand-bond NaN mode. Baseline: openadmet_structure_v3.zip — the unmodified 3-model consensus output (first one we submitted to the leaderboard).

TL;DR

For each of the 184 blinded fragments we predict the PXR–ligand complex with independent co-folding models across many seeds — the base trio RF3, AlphaFold3, Boltz2, plus two architecturally-independent models added in v10, Protenix and ESMFold2 — then pick the structure most central to the set of models that agree (the consensus medoid). When the base trio doesn't agree we fall back to the best cross-model pair. 179 / 184 (97.3 %) of submitted structures have ≥ 2 base-trio models agreeing within 1 Å, and v10's medoid additionally incorporates the two independent models on the 133 / 184 ligands where at least one of them joins the base agreement.

Pipeline

Step	What
Inputs	PXR-LBD protein sequence (293 aa) + ligand SMILES from `pxr-challenge_structure_TEST_BLINDED.csv` (184 ligands).
Predictions	Base trio: RoseTTAFold3, AlphaFold3, Boltz2 — same protein + ligand inputs, ~150 seeds per model accumulated across rounds. v10 adds two architecturally-independent co-folding models, Protenix and ESMFold2, run on all 184 ligands.
Iteration	Each round adds many fresh seeds × 3 models for any ligand that does not yet have a 3-way agreement. The campaign ran for >200 seeds before we found a pose where all 3 models agreed for some ligands, but the success distribution started to converge after 200 seeds (so we usually stopped the pipeline after 200 seeds).
Agreement check	For every cross-model pair of structures of the same ligand: Kabsch-align on protein Cα, then compute symmetry-aware heavy-atom ligand RMSD (RDKit graph-automorphism enumeration). A 3-way agreement is a (RF3, AF3, Boltz2) triplet where all three pairwise RMSDs are ≤ 1.0 Å.
Selection (medoid)	For 3-way-agreed ligands we pick the medoid — the structure minimising the mean cross-model RMSD to the whole agreed model set (i.e. the most consensus-central pose), rather than the single tightest triplet used by the v3.5 baseline. for v7, AF3 is the representative because most benchmarks place it as slightly more accurate than RF3 and Boltz2, but v8 we treat each model the same. v10 extends the medoid set to five models (adding Protenix + ESMFold2 when they join the base agreement). For fallback ligands we keep the lowest-RMSD cross-model pair, preferring the AF3-side member (then RF3, then Boltz2).

Tiering of the 184 submitted structures (1.0 Å threshold)

Tier	Definition	n	%
3-model agreement	RF3 + AF3 + Boltz2 within 1.0 Å on a common triplet	105	57.1
2-model agreement	No 3-way triplet, best cross-model pair ≤ 1.0 Å	75	40.8
Fallback (no agreement)	Best cross-model pair > 1.0 Å — lowest-RMSD pair taken anyway	4	2.2
Total		184	100

179 / 184 (97.3 %) of submitted structures have at least two of the three models agreeing within 1 Å. Of the 79 ligands that did not reach 3-way agreement, 59 have a best cross-model pair ≤ 0.5 Å — these are tight pairs that only just missed the third model. The four true-fallback structures have best pairs in the range 1.0 – 2.17 Å.

The agreement tiers above are a property of the base-trio consensus — identical for every selection variant (v3.5, v7, v8, v10). The variants differ only in which structure is taken within each tier (and, from v8, from which model; v10 additionally lets a Protenix or ESMFold2 pose be the representative when it joins the base agreement — on 133 / 184 ligands one does).

Chosen-model distribution per variant:

Variant	AF3	RF3	Boltz2	Protenix	ESMFold2	Note
v10.1 (submitted)	80	51	27	8	18	v10 + OST failure-proofing: 4 base picks swapped to the nearest bond-clean pose (3× RF3→AF3, 1× AF3→AF3)
v10	77	54	27	8	18	uniform 5-model-agnostic medoid across all 184; adds the two independent models
v8 (prior best)	75	64	45	–	–	model-agnostic medoid — the most-central structure of any base-trio model wins
v7	159	25	0	–	–	AF3-only medoid; same mix as v3.5; beat v3.5 on the leaderboard

Why set the threshold at 1Å? The ligands in this dataset are quite small and rigid so we thought that the common 2 Å threshold is too permissive. 2 Å RMSD in Pymol also looks much much worse than 1 Å RMSD. Unfortunately not all ligands have a pose where all 3-models agree on at this threshold. This indicates that the models have themselves learned different distributions about co-folding protein-ligand complexes, and that is why when all 3 agree it might be a good signal that the pose is good.

v3 → v3.5 → v7 → v8 → v10 → v10.1

v3.5 (two-ligand Rosetta rescue). Two structures — x02859-1 and x03063-1 — failed initial OST scoring on v3, despite passing every structural sanity check (correct molecular graphs perceived from coordinates, no overlapping atoms, no ligand–protein clash). We re-prepared both with Rosetta cartesian cst-FastRelax (coordinate_constraint = 0.1) using MMFF94-charged Generic-FF ligand params and substituted the two relaxed PDBs into v3. The ligand and pocket moved ~0.4 Å on average (steric relief, no significant pose change).

v4/5/6 (Were just like v3 but we chose based on ipTM with preference on AF3 structures) Ended up with worse scores than v3.5

v7 (medoid selection). Within each 3-way-agreed ligand, v7 replaces v3.5's single-tightest-triplet AF3 with the medoid AF3 (lowest mean cross-model RMSD to the agreed RF3+Boltz2 set). This is RMSD-driven and more consensus-central, and it improved on v3.5 on the leaderboard. v7 differs from v3.5 on 62 / 184 ligands (all AF3→AF3); the 2-model / fallback picks and the two Rosetta-relaxed structures carry over unchanged.

v8 (medoid + model-agnostic) — the prior leaderboard best. Like v7, but with no AF3-first preference: the medoid is taken over all models (and extended to the 2-model tier), so the most consensus-central structure of any model — AF3, RF3, or Boltz2 — wins. This swings the chosen-model mix to AF3 75 / RF3 64 / Boltz2 45 and differs from v7 on 127 / 184 ligands. v8's submitted file was v8_rescued.zip: the model-agnostic picks, except for three ligands whose selected pose OST could not score — x02872-1 and x03325-1 fall back to their v7 poses, and x03004-1 is Rosetta cst-relaxed (each flagged via source_submission in the manifest). OST mis-perceives a bond on tight strained-ring contacts in those raw poses, which breaks ligand-graph matching; the fallback poses score cleanly. (v10.1 below generalises exactly this rescue.)

v10 (uniform 5-model-agnostic medoid) — adds two independent models. The base trio (RF3, AF3, Boltz2) are correlated — all AF3-style, MSA-based co-folding diffusion models — so their mutual agreement partly reflects shared bias. v10 adds two architecturally-independent co-folding models, Protenix and ESMFold2, run on all 184 ligands, and takes the medoid over all five. Per ligand it anchors on the v8 pick, forms one representative per agreeing base model, then admits a Protenix and/or ESMFold2 pose only if it lands within 1.0 Å (symmetry-aware ligand RMSD) of every base representative; the 5-model medoid is the member with the lowest mean cross-model RMSD. If neither independent model joins, v10 keeps the v8 pick verbatim, so the diff is bounded. The independent models join on 133 / 184 ligands; v10 differs from v8 on 67 / 184, but every change is sub-Ångström (max 0.81 Å; 0 moves > 2 Å) — the medoid just selects a more consensus-central representative of the same tightly-agreeing cluster. Submitted-model mix: AF3 77 / RF3 54 / Boltz2 27 / ESMFold2 18 / Protenix 8.

v10.1 (OST failure-proofing) — the submission. The official OST scorer returns NaN (a 20 Å BiSyRMSD / 0 LDDT-PLI penalty) whenever its geometry-based bond perception reads an extra ligand bond — the strained-ring false-bond mode. A deposited crystal has clean geometry, so OST perceives its true graph; a pose OST reads as over-bonded can therefore never graph-match the crystal. This is detectable with no ground truth: count the OST-perceived heavy-atom bonds of the LIG residue and compare to the true count from the ligand SMILES (RDKit). Four v10 picks carry exactly one such false bond — x02872-1, x03004-1, x03063-1, x03325-1 (x03063-1 was even a failure v3.5 had hand-fixed and the automated v10 silently reintroduced). For each, v10.1 ranks every candidate pose by ligand-RMSD to v10's pick and swaps in the nearest pose that passes the bond check (each < 0.9 Å away); the other 180 picks are byte-identical to v10. Because a NaN-penalised pose can only improve or tie when replaced by a scoreable one, v10.1 ≥ v10 on both leaderboard halves by construction. The submitted file is v10.1_failureproofed.zip; the four swaps are flagged via source_submission in the manifest.

Per-ligand provenance

The manifests/ directory holds the full per-ligand record (184 rows each):

manifests/v10.1_failureproofed_submission_manifest.csv — what was submitted (v10 5-model medoid + OST failure-proofing; source_submission marks the 4 bond-swapped ligands)
manifests/v10_medoid_5way_submission_manifest.csv — the raw v10 5-model picks before the OST failure-proofing
manifests/v8_rescued_submission_manifest.csv — the prior best (v8 model-agnostic medoid; source_submission / relaxed mark the 3 OST-rescued ligands)
manifests/v8_medoid_agnostic_submission_manifest.csv — the raw v8 picks before the OST rescue
manifests/v7_medoid_submission_manifest.csv — the prior v7 (AF3-only medoid) submission
manifests/v3.5_submission_manifest.csv · manifests/v3_submission_manifest.csv — the v3.5 submission and v3 consensus baseline

Each row records, per ligand: the selection (source/tier/pick_method, chosen_model), the medoid score (medoid_mean_rmsd, v7/v8/v10), the chosen model's confidence (iptm, ptm, plddt, ranking_score), and the cross-model agreement RMSDs (best_max_rmsd_triplet, rmsd_rf3_af3, rmsd_rf3_boltz2, rmsd_af3_boltz2). The v10 / v10.1 manifests add the built-ligand bond counts (lig_covalent_bonds vs expected_bonds) that the OST failure-proofing checks. Missing fields use the literal string none. Absolute cluster paths (chosen_path) and the selecting seed/sample (chosen_seed / chosen_sample, plus the fallback pair_seed_* / pair_sample_*) are withheld while the challenge is live — the retained columns (chosen_model, the confidence metrics, and the cross-model RMSDs) document the method without revealing the exact draws. Full provenance + prediction files are available on request.

Running the code

The pipeline lives in scripts/: prediction round submission, cross-model agreement scoring, the v7/v8 selection + submission builders, and the v10 / v10.1 (5-model medoid → OST failure-proofing) builders. None of it is hard-coded to a particular machine:

Inputs (model predictions + consensus tables) are read from a single base directory given by the OADMET_PREDICTIONS_DIR environment variable (or the --predictions-dir flag). Outputs go to --out-dir / OADMET_OUT_DIR (default .); the blinded ligand CSV is passed via --ligand-csv / OADMET_LIGAND_CSV.
v10 adds the two independent models via --protenix-dir / --esmfold2-dir (OADMET_PROTENIX_DIR / OADMET_ESMFOLD2_DIR). v10.1 runs the OST bond check inside your OpenStructure container — pass --ost-sif to build_v10_1_failureproof.py.
Running the three models is installation-specific and left as a stub. submit_round.py prepares every per-model input (RF3 .json, AlphaFold3 .json, Boltz2 .yaml + a manifest) and the round/seed bookkeeping, then calls launch_models() — which is a clearly-marked placeholder. Because everyone's GPU setup differs (local, SLURM, cloud, different containers), we do not ship our cluster-specific launcher; add your own RF3/AF3/Boltz2 invocation in launch_models() (the reference hyperparameters are documented in its docstring).

Pipeline order and per-flag details are in scripts/README.md.

Tools

RF3 — Institute for Protein Design (UW)
AlphaFold3 — DeepMind
Boltz2 — MIT
Protenix — ByteDance (v10: 4th, independent model)
ESMFold2 — Meta AI (v10: 5th, independent model)
biotite, atomworks — for loading and processing structures
RDKit, OpenBabel, OpenStructure — open source (OpenStructure = the OST scorer used for the v10.1 bond check)
Rosetta — cartesian FastRelax for the v3 → v3.5 rescue

AI use acknowledgement

This project made extensive use of Claude Code (Anthropic) as an autonomous coding-and-analysis agent — it wrote and refactored the pipeline scripts, analysed the cross-model agreement and selection results, and managed the multi-round prediction → agreement → selection pipeline largely autonomously (including root-causing the OST scoring failures and the medoid selection study).

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
manifests		manifests
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenADMET PXR Blind Challenge — Structure-Track Method Report

TL;DR

Pipeline

Tiering of the 184 submitted structures (1.0 Å threshold)

v3 → v3.5 → v7 → v8 → v10 → v10.1

Per-ligand provenance

Running the code

Tools

AI use acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpenADMET PXR Blind Challenge — Structure-Track Method Report

TL;DR

Pipeline

Tiering of the 184 submitted structures (1.0 Å threshold)

v3 → v3.5 → v7 → v8 → v10 → v10.1

Per-ligand provenance

Running the code

Tools

AI use acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages