Submission: v10.1_failureproofed.zip — 184 PDB files, ligand residue LIG. A
uniform 5-model consensus-central ("medoid") selection (RF3, AlphaFold3, Boltz2,
Protenix, ESMFold2), then failure-proofed against the OST scorer's ligand-bond NaN mode.
Baseline: openadmet_structure_v3.zip — the unmodified 3-model consensus output (first one we submitted to the leaderboard).
For each of the 184 blinded fragments we predict the PXR–ligand complex with independent co-folding models across many seeds — the base trio RF3, AlphaFold3, Boltz2, plus two architecturally-independent models added in v10, Protenix and ESMFold2 — then pick the structure most central to the set of models that agree (the consensus medoid). When the base trio doesn't agree we fall back to the best cross-model pair. 179 / 184 (97.3 %) of submitted structures have ≥ 2 base-trio models agreeing within 1 Å, and v10's medoid additionally incorporates the two independent models on the 133 / 184 ligands where at least one of them joins the base agreement.
| Step | What |
|---|---|
| Inputs | PXR-LBD protein sequence (293 aa) + ligand SMILES from pxr-challenge_structure_TEST_BLINDED.csv (184 ligands). |
| Predictions | Base trio: RoseTTAFold3, AlphaFold3, Boltz2 — same protein + ligand inputs, ~150 seeds per model accumulated across rounds. v10 adds two architecturally-independent co-folding models, Protenix and ESMFold2, run on all 184 ligands. |
| Iteration | Each round adds many fresh seeds × 3 models for any ligand that does not yet have a 3-way agreement. The campaign ran for >200 seeds before we found a pose where all 3 models agreed for some ligands, but the success distribution started to converge after 200 seeds (so we usually stopped the pipeline after 200 seeds). |
| Agreement check | For every cross-model pair of structures of the same ligand: Kabsch-align on protein Cα, then compute symmetry-aware heavy-atom ligand RMSD (RDKit graph-automorphism enumeration). A 3-way agreement is a (RF3, AF3, Boltz2) triplet where all three pairwise RMSDs are ≤ 1.0 Å. |
| Selection (medoid) | For 3-way-agreed ligands we pick the medoid — the structure minimising the mean cross-model RMSD to the whole agreed model set (i.e. the most consensus-central pose), rather than the single tightest triplet used by the v3.5 baseline. for v7, AF3 is the representative because most benchmarks place it as slightly more accurate than RF3 and Boltz2, but v8 we treat each model the same. v10 extends the medoid set to five models (adding Protenix + ESMFold2 when they join the base agreement). For fallback ligands we keep the lowest-RMSD cross-model pair, preferring the AF3-side member (then RF3, then Boltz2). |
| Tier | Definition | n | % |
|---|---|---|---|
| 3-model agreement | RF3 + AF3 + Boltz2 within 1.0 Å on a common triplet | 105 | 57.1 |
| 2-model agreement | No 3-way triplet, best cross-model pair ≤ 1.0 Å | 75 | 40.8 |
| Fallback (no agreement) | Best cross-model pair > 1.0 Å — lowest-RMSD pair taken anyway | 4 | 2.2 |
| Total | 184 | 100 |
179 / 184 (97.3 %) of submitted structures have at least two of the three models agreeing within 1 Å. Of the 79 ligands that did not reach 3-way agreement, 59 have a best cross-model pair ≤ 0.5 Å — these are tight pairs that only just missed the third model. The four true-fallback structures have best pairs in the range 1.0 – 2.17 Å.
The agreement tiers above are a property of the base-trio consensus — identical for every selection variant (v3.5, v7, v8, v10). The variants differ only in which structure is taken within each tier (and, from v8, from which model; v10 additionally lets a Protenix or ESMFold2 pose be the representative when it joins the base agreement — on 133 / 184 ligands one does).
Chosen-model distribution per variant:
| Variant | AF3 | RF3 | Boltz2 | Protenix | ESMFold2 | Note |
|---|---|---|---|---|---|---|
| v10.1 (submitted) | 80 | 51 | 27 | 8 | 18 | v10 + OST failure-proofing: 4 base picks swapped to the nearest bond-clean pose (3× RF3→AF3, 1× AF3→AF3) |
| v10 | 77 | 54 | 27 | 8 | 18 | uniform 5-model-agnostic medoid across all 184; adds the two independent models |
| v8 (prior best) | 75 | 64 | 45 | – | – | model-agnostic medoid — the most-central structure of any base-trio model wins |
| v7 | 159 | 25 | 0 | – | – | AF3-only medoid; same mix as v3.5; beat v3.5 on the leaderboard |
Why set the threshold at 1Å? The ligands in this dataset are quite small and rigid so we thought that the common 2 Å threshold is too permissive. 2 Å RMSD in Pymol also looks much much worse than 1 Å RMSD. Unfortunately not all ligands have a pose where all 3-models agree on at this threshold. This indicates that the models have themselves learned different distributions about co-folding protein-ligand complexes, and that is why when all 3 agree it might be a good signal that the pose is good.
v3.5 (two-ligand Rosetta rescue). Two structures — x02859-1 and x03063-1 —
failed initial OST scoring on v3, despite passing every structural sanity check
(correct molecular graphs perceived from coordinates, no overlapping atoms, no
ligand–protein clash). We re-prepared both with Rosetta cartesian cst-FastRelax
(coordinate_constraint = 0.1) using MMFF94-charged Generic-FF ligand params and substituted
the two relaxed PDBs into v3. The ligand and pocket moved ~0.4 Å on average (steric relief,
no significant pose change).
v4/5/6 (Were just like v3 but we chose based on ipTM with preference on AF3 structures) Ended up with worse scores than v3.5
v7 (medoid selection). Within each 3-way-agreed ligand, v7 replaces v3.5's single-tightest-triplet AF3 with the medoid AF3 (lowest mean cross-model RMSD to the agreed RF3+Boltz2 set). This is RMSD-driven and more consensus-central, and it improved on v3.5 on the leaderboard. v7 differs from v3.5 on 62 / 184 ligands (all AF3→AF3); the 2-model / fallback picks and the two Rosetta-relaxed structures carry over unchanged.
v8 (medoid + model-agnostic) — the prior leaderboard best. Like v7, but with no AF3-first
preference: the medoid is taken over all models (and extended to the 2-model tier), so the
most consensus-central structure of any model — AF3, RF3, or Boltz2 — wins. This swings the
chosen-model mix to AF3 75 / RF3 64 / Boltz2 45 and differs from v7 on 127 / 184 ligands.
v8's submitted file was v8_rescued.zip: the model-agnostic picks, except for three
ligands whose selected pose OST could not score — x02872-1 and x03325-1 fall back to
their v7 poses, and x03004-1 is Rosetta cst-relaxed (each flagged via source_submission
in the manifest). OST mis-perceives a bond on tight strained-ring contacts in those raw poses,
which breaks ligand-graph matching; the fallback poses score cleanly. (v10.1 below generalises
exactly this rescue.)
v10 (uniform 5-model-agnostic medoid) — adds two independent models. The base trio (RF3, AF3, Boltz2) are correlated — all AF3-style, MSA-based co-folding diffusion models — so their mutual agreement partly reflects shared bias. v10 adds two architecturally-independent co-folding models, Protenix and ESMFold2, run on all 184 ligands, and takes the medoid over all five. Per ligand it anchors on the v8 pick, forms one representative per agreeing base model, then admits a Protenix and/or ESMFold2 pose only if it lands within 1.0 Å (symmetry-aware ligand RMSD) of every base representative; the 5-model medoid is the member with the lowest mean cross-model RMSD. If neither independent model joins, v10 keeps the v8 pick verbatim, so the diff is bounded. The independent models join on 133 / 184 ligands; v10 differs from v8 on 67 / 184, but every change is sub-Ångström (max 0.81 Å; 0 moves > 2 Å) — the medoid just selects a more consensus-central representative of the same tightly-agreeing cluster. Submitted-model mix: AF3 77 / RF3 54 / Boltz2 27 / ESMFold2 18 / Protenix 8.
v10.1 (OST failure-proofing) — the submission. The official OST scorer returns NaN (a 20 Å
BiSyRMSD / 0 LDDT-PLI penalty) whenever its geometry-based bond perception reads an extra
ligand bond — the strained-ring false-bond mode. A deposited crystal has clean geometry, so OST
perceives its true graph; a pose OST reads as over-bonded can therefore never graph-match the
crystal. This is detectable with no ground truth: count the OST-perceived heavy-atom bonds of
the LIG residue and compare to the true count from the ligand SMILES (RDKit). Four v10 picks
carry exactly one such false bond — x02872-1, x03004-1, x03063-1, x03325-1 (x03063-1 was
even a failure v3.5 had hand-fixed and the automated v10 silently reintroduced). For each, v10.1
ranks every candidate pose by ligand-RMSD to v10's pick and swaps in the nearest pose that passes
the bond check (each < 0.9 Å away); the other 180 picks are byte-identical to v10. Because a
NaN-penalised pose can only improve or tie when replaced by a scoreable one, v10.1 ≥ v10 on both
leaderboard halves by construction. The submitted file is v10.1_failureproofed.zip; the four
swaps are flagged via source_submission in the manifest.
The manifests/ directory holds the full per-ligand record (184 rows each):
manifests/v10.1_failureproofed_submission_manifest.csv— what was submitted (v10 5-model medoid + OST failure-proofing;source_submissionmarks the 4 bond-swapped ligands)manifests/v10_medoid_5way_submission_manifest.csv— the raw v10 5-model picks before the OST failure-proofingmanifests/v8_rescued_submission_manifest.csv— the prior best (v8 model-agnostic medoid;source_submission/relaxedmark the 3 OST-rescued ligands)manifests/v8_medoid_agnostic_submission_manifest.csv— the raw v8 picks before the OST rescuemanifests/v7_medoid_submission_manifest.csv— the prior v7 (AF3-only medoid) submissionmanifests/v3.5_submission_manifest.csv·manifests/v3_submission_manifest.csv— the v3.5 submission and v3 consensus baseline
Each row records, per ligand: the selection (source/tier/pick_method,
chosen_model), the medoid score
(medoid_mean_rmsd, v7/v8/v10), the chosen model's confidence (iptm, ptm,
plddt, ranking_score), and the cross-model agreement RMSDs
(best_max_rmsd_triplet, rmsd_rf3_af3, rmsd_rf3_boltz2, rmsd_af3_boltz2).
The v10 / v10.1 manifests add the built-ligand bond counts (lig_covalent_bonds
vs expected_bonds) that the OST failure-proofing checks.
Missing fields use the literal string none. Absolute cluster paths
(chosen_path) and the selecting seed/sample (chosen_seed / chosen_sample,
plus the fallback pair_seed_* / pair_sample_*) are withheld while the
challenge is live — the retained columns (chosen_model, the confidence
metrics, and the cross-model RMSDs) document the method without revealing the
exact draws. Full provenance + prediction files are available on request.
The pipeline lives in scripts/: prediction round submission,
cross-model agreement scoring, the v7/v8 selection + submission builders, and the
v10 / v10.1 (5-model medoid → OST failure-proofing) builders. None of it is
hard-coded to a particular machine:
- Inputs (model predictions + consensus tables) are read from a single base
directory given by the
OADMET_PREDICTIONS_DIRenvironment variable (or the--predictions-dirflag). Outputs go to--out-dir/OADMET_OUT_DIR(default.); the blinded ligand CSV is passed via--ligand-csv/OADMET_LIGAND_CSV. - v10 adds the two independent models via
--protenix-dir/--esmfold2-dir(OADMET_PROTENIX_DIR/OADMET_ESMFOLD2_DIR). v10.1 runs the OST bond check inside your OpenStructure container — pass--ost-siftobuild_v10_1_failureproof.py. - Running the three models is installation-specific and left as a stub.
submit_round.pyprepares every per-model input (RF3.json, AlphaFold3.json, Boltz2.yaml+ a manifest) and the round/seed bookkeeping, then callslaunch_models()— which is a clearly-marked placeholder. Because everyone's GPU setup differs (local, SLURM, cloud, different containers), we do not ship our cluster-specific launcher; add your own RF3/AF3/Boltz2 invocation inlaunch_models()(the reference hyperparameters are documented in its docstring).
Pipeline order and per-flag details are in scripts/README.md.
- RF3 — Institute for Protein Design (UW)
- AlphaFold3 — DeepMind
- Boltz2 — MIT
- Protenix — ByteDance (v10: 4th, independent model)
- ESMFold2 — Meta AI (v10: 5th, independent model)
- biotite, atomworks — for loading and processing structures
- RDKit, OpenBabel, OpenStructure — open source (OpenStructure = the OST scorer used for the v10.1 bond check)
- Rosetta — cartesian FastRelax for the v3 → v3.5 rescue
This project made extensive use of Claude Code (Anthropic) as an autonomous coding-and-analysis agent — it wrote and refactored the pipeline scripts, analysed the cross-model agreement and selection results, and managed the multi-round prediction → agreement → selection pipeline largely autonomously (including root-causing the OST scoring failures and the medoid selection study).