fix(rf3): fix issues with 5I09 due to chain breaks and add associated tests#190
fix(rf3): fix issues with 5I09 due to chain breaks and add associated tests#190k-chrispens merged 3 commits intomainfrom
Conversation
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThe PR updates EDM sampler documentation to clarify Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Fixes an RF3 atom ordering/count mismatch observed on 5I09 (chain breaks / terminal atoms) by aligning RF3Wrapper’s model_atom_array with RF3’s own inference pipeline output, and adds regression + integration tests to guard against future misalignment.
Changes:
- Add RF3 regression tests asserting
model_atom_arraylength/order matchesatom_to_token_map, and thatOXT/hydrogens are absent. - Add a no-guidance sampling integration test that checks basic peptide geometry after a short trajectory across wrappers/structures (including 5I09 density).
- Update
RF3Wrapper.featurize()to sourcemodel_atom_arrayfrom RF3 pipeline output and require exact atom-count agreement.
Reviewed changes
Copilot reviewed 5 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
tests/models/test_rf3_atom_ordering.py |
New RF3-specific regression tests for atom count/order and filtered atoms (OXT/H). |
tests/integration/test_no_guidance_geometry.py |
New integration test running short no-guidance sampling and validating backbone bond continuity. |
tests/conftest.py |
Adds structure_5i09_density session fixture used by new tests. |
src/sampleworks/models/rf3/wrapper.py |
Uses RF3 pipeline’s atom_array as model_atom_array and enforces exact match with atom_to_token_map. |
src/sampleworks/core/samplers/edm.py |
Clarifies that gamma_min=0.2 differs from AF3 defaults in code comments/docstring. |
pixi.lock |
Dependency lockfile updates (package/build revisions). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/conftest.py (1)
485-487: Add a docstring for the new shared fixture.This fixture is part of the shared test surface, so it should carry the repo's standard NumPy-style docstring too.
Proposed docstring
`@pytest.fixture`(scope="session") def structure_5i09_density(resources_dir: Path) -> dict: + """Return the parsed 5I09 density-input structure. + + Returns + ------- + dict + Parsed Atomworks structure for the 5I09 density fixture. + """ return parse(resources_dir / "5I09" / "5I09_single_001_density_input.cif", ccd_mirror_path=None)As per coding guidelines, "Always include NumPy-style docstrings for every function and class."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/conftest.py` around lines 485 - 487, The new pytest fixture structure_5i09_density is missing a NumPy-style docstring; add a concise NumPy-style docstring directly above the def structure_5i09_density(...) that describes the fixture purpose (returns parsed density structure for 5I09), its parameters (resources_dir: Path), the return type (dict), and any important notes (uses parse with ccd_mirror_path=None and the file "5I09_single_001_density_input.cif"); ensure the docstring follows the repo's NumPy conventions and is placed inside the tests/conftest.py next to the fixture definition.tests/integration/test_no_guidance_geometry.py (1)
79-88: Prefer asserting geometry through the public/reconciled output path.
conditioning.model_atom_arrayis wrapper-internal state, so this test will fail on harmless internal reordering or reconciliation changes even when the user-visible sampled structure is still correct. Building theAtomArrayfrom the same reconciled output path the product uses will keep this black-box and more stable.As per coding guidelines, "Write black-box tests that verify behavior, not implementation. Test public interfaces with realistic inputs and avoid mocking."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/test_no_guidance_geometry.py` around lines 79 - 88, Don't inspect wrapper-internal state via features.conditioning.model_atom_array; instead reconstruct the AtomArray through the same public/reconciled output path the product uses (i.e. run the same postprocessing/reconciliation on state and obtain the reconciled AtomArray), then set/assert its coordinates (instead of directly setting output_aa.coord from state) so the test verifies the public output; update references that currently use features.conditioning.model_atom_array and output_aa.coord to use the reconciled-output factory/function that the wrapper exposes (apply the same transform to state as the runtime does) while keeping wrapper_type and structure_fixture in the assertion message.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/sampleworks/core/samplers/edm.py`:
- Line 116: Update the EDMSamplerConfig class docstring to stop claiming
defaults match AF3 and explicitly document that gamma_min is overridden to 0.2
(not AF3's 1.0); locate the EDMSamplerConfig docstring and change its wording to
reflect the non-AF3 default for the gamma_min parameter and mention the
overridden value so the inline note at the gamma_min definition and the class
docstring are consistent.
---
Nitpick comments:
In `@tests/conftest.py`:
- Around line 485-487: The new pytest fixture structure_5i09_density is missing
a NumPy-style docstring; add a concise NumPy-style docstring directly above the
def structure_5i09_density(...) that describes the fixture purpose (returns
parsed density structure for 5I09), its parameters (resources_dir: Path), the
return type (dict), and any important notes (uses parse with
ccd_mirror_path=None and the file "5I09_single_001_density_input.cif"); ensure
the docstring follows the repo's NumPy conventions and is placed inside the
tests/conftest.py next to the fixture definition.
In `@tests/integration/test_no_guidance_geometry.py`:
- Around line 79-88: Don't inspect wrapper-internal state via
features.conditioning.model_atom_array; instead reconstruct the AtomArray
through the same public/reconciled output path the product uses (i.e. run the
same postprocessing/reconciliation on state and obtain the reconciled
AtomArray), then set/assert its coordinates (instead of directly setting
output_aa.coord from state) so the test verifies the public output; update
references that currently use features.conditioning.model_atom_array and
output_aa.coord to use the reconciled-output factory/function that the wrapper
exposes (apply the same transform to state as the runtime does) while keeping
wrapper_type and structure_fixture in the assertion message.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a0ac8e6a-bc2c-4e66-9190-fe008eda9a05
⛔ Files ignored due to path filters (1)
pixi.lockis excluded by!**/*.lock
📒 Files selected for processing (6)
src/sampleworks/core/samplers/edm.pysrc/sampleworks/models/rf3/wrapper.pytests/conftest.pytests/integration/test_no_guidance_geometry.pytests/models/test_rf3_atom_ordering.pytests/resources/5I09/5I09_single_001_density_input.cif
marcuscollins
left a comment
There was a problem hiding this comment.
Take a look at my comments and see what you think. I think there are improvements we could make still, but this is definitely forward progress. (And I need to add tests for some of my residue numbering operations as well...)
| torch.manual_seed(42) | ||
| state = wrapper.initialize_from_prior(batch_size=1, features=features) | ||
|
|
||
| for i in range(self.NUM_STEPS): |
There was a problem hiding this comment.
Is there some way to run this without explicitly including this loop here? It would be better to test the actual code path we use during inference.
There was a problem hiding this comment.
This is the inference code path, we do a for loop over the schedule (e.g. src/sampleworks/scalers/pure_guidance.py:108)
| # Generous bounds for a 20-step stochastic sample. | ||
| # Ideal backbone bonds are 1.33–1.52 Å | ||
| # We're a little more generous because the short trajectory. | ||
| con_mask = filter_linear_bond_continuity(bb, min_len=1.1, max_len=1.7) |
There was a problem hiding this comment.
Could we be more strict, say by looking at only the peptide bond?
There was a problem hiding this comment.
You can get strict ranges from RDKit, for instance (see the bond length script)
There was a problem hiding this comment.
Yeah, I just figure for speed of testing we might not expect all models to pass the most rigorous geometry checks (but 5I09 was failing this before)
| "structure_6ni6_density", | ||
| ], | ||
| ) | ||
| def test_no_oxt_atoms_in_model_atom_array(self, rf3_wrapper, structure_fixture, request): |
There was a problem hiding this comment.
Shouldn't there be one OXT per chain? Does RF3's pipeline actually remove the final terminal oxygen?
There was a problem hiding this comment.
There should, but it is removed by most models (except protenix)
There was a problem hiding this comment.
As a reference, I presume this is done to standardize all the residues such that a residue token corresponds to a specific set of atoms always. I think protenix does something a bit fancier as their outputs do have OXT, but I don't know the specifics (only that I had to deal with them requiring it)
| (``atom_to_token_map``). | ||
| """ | ||
|
|
||
| import numpy as np |
There was a problem hiding this comment.
This maybe doesn't need to be in here for purposes of this PR, but I feel like these tests are still not stringent enough. (I admit there are basically no tests yet for my CIF patching operations, which pose similar risks). We probably should actually check the alignment and make sure the atoms are correct and match each other.
There was a problem hiding this comment.
I'll add an issue for this
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/sampleworks/models/protenix/structure_processing.py`:
- Around line 353-369: The current code collapses gaps by using
enumerate(start=1) after filtering, causing sequence offsets to be lost and
misaligning PTM serialization; instead compute 1-indexed sequence positions
relative to the chain subsequence start (like raw_res_id - min_res_id + 1). In
the branch where valid_positions and chain_id are present, capture the raw
res_ids (from chain_array.res_id at starts[:-1]), compute min_res_id =
min(chain_valid), and build pos_res_pairs by pairing (raw_id - min_res_id + 1)
with the corresponding res_name only for raw_id in chain_valid (use the same
res_ids and res_names variables) so pos_res_pairs preserves original residue
offsets for get_sequences() and detect_modifications().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 71aac233-8564-41ec-91f2-d6776d9a89ba
📒 Files selected for processing (1)
src/sampleworks/models/protenix/structure_processing.py
| # Filter by valid_positions (res_id membership) before | ||
| # assigning sequential positions, so that the 1 indexed | ||
| # indices match character positions in the sequence | ||
| # string produced by get_sequences() | ||
| if valid_positions is not None and chain_id in valid_positions: | ||
| chain_valid = valid_positions[chain_id] | ||
| min_valid = min(chain_valid) if chain_valid else 1 | ||
| valid_seq_positions = {r - min_valid + 1 for r in chain_valid} | ||
| pos_res_pairs = [ | ||
| (p, r) for p, r in pos_res_pairs if p in valid_seq_positions | ||
| ] | ||
| if hasattr(chain_array, "res_id"): | ||
| res_ids = cast(np.ndarray, chain_array.res_id)[starts[:-1]].tolist() | ||
| res_names = [ | ||
| rn | ||
| for rn, rid in zip(res_names, res_ids, strict=True) | ||
| if rid in chain_valid | ||
| ] | ||
|
|
||
| # 1 indexed positions that correspond to | ||
| # character indices in the sequence string. | ||
| pos_res_pairs = list(enumerate(res_names, start=1)) |
There was a problem hiding this comment.
Preserve sequence offsets when renumbering filtered residues.
After the res_id filter, enumerate(start=1) collapses internal gaps. get_sequences() still exports the subsequence from min(chain_valid) to max(chain_valid), so a chain with valid residues {41, 42, 45} has sequence positions 1, 2, 5, not 1, 2, 3. That makes detect_modifications() serialize PTMs onto the wrong residue for broken chains. This should mirror the raw_res_id - min_res_id + 1 mapping already used later for covalent bonds.
💡 Proposed fix
- if valid_positions is not None and chain_id in valid_positions:
- chain_valid = valid_positions[chain_id]
- if hasattr(chain_array, "res_id"):
- res_ids = cast(np.ndarray, chain_array.res_id)[starts[:-1]].tolist()
- res_names = [
- rn
- for rn, rid in zip(res_names, res_ids, strict=True)
- if rid in chain_valid
- ]
-
- # 1 indexed positions that correspond to
- # character indices in the sequence string.
- pos_res_pairs = list(enumerate(res_names, start=1))
+ if valid_positions is not None and chain_id in valid_positions:
+ chain_valid = valid_positions[chain_id]
+ if hasattr(chain_array, "res_id"):
+ res_ids = cast(np.ndarray, chain_array.res_id)[starts[:-1]].tolist()
+ min_res_id = min(chain_valid) if chain_valid else 1
+ pos_res_pairs = [
+ (int(rid) - min_res_id + 1, rn)
+ for rn, rid in zip(res_names, res_ids, strict=True)
+ if rid in chain_valid
+ ]
+ else:
+ pos_res_pairs = list(enumerate(res_names, start=1))
+ else:
+ pos_res_pairs = list(enumerate(res_names, start=1))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/sampleworks/models/protenix/structure_processing.py` around lines 353 -
369, The current code collapses gaps by using enumerate(start=1) after
filtering, causing sequence offsets to be lost and misaligning PTM
serialization; instead compute 1-indexed sequence positions relative to the
chain subsequence start (like raw_res_id - min_res_id + 1). In the branch where
valid_positions and chain_id are present, capture the raw res_ids (from
chain_array.res_id at starts[:-1]), compute min_res_id = min(chain_valid), and
build pos_res_pairs by pairing (raw_id - min_res_id + 1) with the corresponding
res_name only for raw_id in chain_valid (use the same res_ids and res_names
variables) so pos_res_pairs preserves original residue offsets for
get_sequences() and detect_modifications().
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/integration/test_no_guidance_geometry.py`:
- Around line 99-102: Replace the EN DASH character in the comment "Ideal
backbone bonds are 1.33–1.52 Å" with a standard hyphen-minus so the range reads
"1.33-1.52 Å"; locate the comment near the call to filter_linear_bond_continuity
(con_mask = filter_linear_bond_continuity(bb, min_len=1.1, max_len=1.7)) and
edit that comment only to use the ASCII hyphen-minus character.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d6a4394f-9a52-422f-bd79-13ade0bb424e
📒 Files selected for processing (3)
src/sampleworks/core/samplers/edm.pytests/integration/test_no_guidance_geometry.pytests/models/test_rf3_atom_ordering.py
✅ Files skipped from review due to trivial changes (1)
- src/sampleworks/core/samplers/edm.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/models/test_rf3_atom_ordering.py
…protein chain I accidentally added a stale density_input CIF
03fec81 to
292d42b
Compare
Summary by CodeRabbit
Bug Fixes
Documentation
gamma_minvalue specifications.