Fix problem with heterogeneous HETATM/ATOM in the same position in different altlocs#195
Fix problem with heterogeneous HETATM/ATOM in the same position in different altlocs#195marcuscollins merged 4 commits intomainfrom
Conversation
📝 WalkthroughWalkthroughAdded a CIF preprocessing utility that detects positions with both ATOM and HETATM records having different residue names, removes conflicting HETATM rows (writing a temporary fixed CIF when needed), and integrates this step into the guidance script before parsing. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/sampleworks/utils/cif_utils.py (1)
227-231: Temporary file cleanup consideration.The temporary file is created with
delete=Falseand persists after the process exits. This is necessary for downstream parsing but may accumulate orphan files over time. Consider documenting this behavior or implementing cleanup in callers for long-running processes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/sampleworks/utils/cif_utils.py` around lines 227 - 231, The code in the block that creates a temporary CIF file using tempfile.NamedTemporaryFile(..., delete=False) and assigns tmp_path (tmp_path = Path(tmp_file.name)) leaves files behind; update the implementation or docs to avoid orphan files by either 1) documenting this behavior in the function docstring (mention that a temp file is created with delete=False and callers are responsible for removing tmp_path) or 2) changing the call sites that use tmp_path (the downstream parser consumers) to explicitly unlink(tmp_path) after parsing; alternatively implement a try/finally in the function that yields tmp_path to callers and unlinks it in finally so temporary files are always cleaned up. Ensure references to tempfile.NamedTemporaryFile and tmp_path (and the surrounding function name) are updated accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/sampleworks/utils/cif_utils.py`:
- Around line 232-235: The code contains duplicated calls to
save_structure_to_cif and logger.info (both save_structure_to_cif(fixed_array,
tmp_path) and logger.info(f"Wrote altloc-fixed CIF to temporary file:
{tmp_path}") appear twice); remove the second duplicate pair so only one call to
save_structure_to_cif and one logger.info remain (locate the duplicate lines
near the end of the altloc-fixing routine in cif_utils.py where
save_structure_to_cif and logger.info are invoked).
---
Nitpick comments:
In `@src/sampleworks/utils/cif_utils.py`:
- Around line 227-231: The code in the block that creates a temporary CIF file
using tempfile.NamedTemporaryFile(..., delete=False) and assigns tmp_path
(tmp_path = Path(tmp_file.name)) leaves files behind; update the implementation
or docs to avoid orphan files by either 1) documenting this behavior in the
function docstring (mention that a temp file is created with delete=False and
callers are responsible for removing tmp_path) or 2) changing the call sites
that use tmp_path (the downstream parser consumers) to explicitly
unlink(tmp_path) after parsing; alternatively implement a try/finally in the
function that yields tmp_path to callers and unlinks it in finally so temporary
files are always cleaned up. Ensure references to tempfile.NamedTemporaryFile
and tmp_path (and the surrounding function name) are updated accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e6c405d9-fa3d-43ac-a18d-7af00da1ae6d
📒 Files selected for processing (3)
src/sampleworks/utils/cif_utils.pysrc/sampleworks/utils/guidance_script_utils.pytests/utils/test_cif_utils.py
k-chrispens
left a comment
There was a problem hiding this comment.
Looks like a solid fix, the tests could take another pass and the coderabbit comment should be addressed, but will approve and let you take care of that
| This occurs when a residue has a modified form (e.g. CSO, cysteic acid) as some | ||
| altlocs and the canonical form (e.g. CYS) as another altloc at the same sequence | ||
| position. Atomworks treats these as two sequential residues rather than alternates, | ||
| inserting a spurious extra residue into the sequence fed to Boltz2. |
There was a problem hiding this comment.
Would it be worth keeping an issue or something open for us so that we can track changes in the upstream? And should we file an atomworks issue?
| @pytest.fixture | ||
| def cif_mixed(tmp_path) -> Path: | ||
| """CIF mimicking the 6NI6 bug: CYS (ATOM) and CSO (HETATM) share residue 101.""" | ||
| return _write_cif( | ||
| [ | ||
| _atom("A", 100, "VAL", hetero=False), | ||
| _atom("A", 101, "CYS", hetero=False), # canonical — keep | ||
| _atom("A", 101, "CSO", hetero=True), # modified altloc — remove | ||
| _atom("A", 102, "ALA", hetero=False), | ||
| ], | ||
| tmp_path / "mixed.cif", |
There was a problem hiding this comment.
We actually already have 6NI6 as a fixture. I know at least in my hands Claude loves to add fixtures and mocks willy nilly, I usually have to go back and tell it to only use existing fixtures and utilities if I am having it write the tests
… same residue position Atomworks infers chain sequences from the atom array when no entity_poly_seq is present. If a residue has both a canonical ATOM record (e.g. CYS, altloc A) and a modified HETATM record (e.g. CSO, altlocs B/C) at the same sequence position, atomworks treats them as two sequential residues and inserts a spurious extra residue into the sequence passed to Boltz2. Adds resolve_mixed_hetatm_atom_altlocs() in cif_utils.py, which loads the CIF with all altlocs, detects positions where ATOM and HETATM records coexist with different residue names, logs a warning per affected position, removes the HETATM records, and returns a corrected temporary CIF for atomworks to parse. Called in get_reward_function_and_structure() before atomworks.parse(). Fixes #194 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…M and ATOM mixed in an altloc, resolves #194
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
scripts/eval/run_and_process_tortoize.py (1)
43-49: Consider explicit directory creation before CSV writes for defensive robustness.Lines 43–49 depend on
args.grid_search_results_pathbeing a valid, existing directory. Whileparse_eval_args()already ensures it is aPathobject (viatype=Path), andsetup_evaluation_parameters()at line 31 implicitly requires it to exist (by scanning it), there is no explicit guarantee that it persists or that subdirectory creation is handled. Adding an explicitmkdir()call makes the intent clearer and adds a safety layer.Suggested approach
all_residue_results, all_protein_results = tuple(zip(*tortoize_results, strict=True)) + results_dir = Path(args.grid_search_results_path) + results_dir.mkdir(parents=True, exist_ok=True) + - output_file = args.grid_search_results_path / "tortoize_residues.csv" + output_file = results_dir / "tortoize_residues.csv" pd.concat(all_residue_results).to_csv(output_file, index=False) logger.info(f"Residue results saved to {output_file}") - output_file = args.grid_search_results_path / "tortoize_protein_stats.csv" + output_file = results_dir / "tortoize_protein_stats.csv" pd.concat(all_protein_results).to_csv(output_file, index=False) logger.info(f"Protein-level stats saved to {output_file}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/run_and_process_tortoize.py` around lines 43 - 49, Ensure the output directory exists before writing CSVs: call mkdir(parents=True, exist_ok=True) on args.grid_search_results_path prior to the pd.concat(...).to_csv calls so that writing "tortoize_residues.csv" and "tortoize_protein_stats.csv" cannot fail if the directory is missing; you can add this immediately before the first output_file assignment (keeping existing logger.info calls and references to all_residue_results and all_protein_results unchanged).src/sampleworks/utils/cif_utils.py (1)
159-187: Clarify temp-file lifecycle in the function contract.The function creates a persistent temp file (
delete=False) as a side effect. Please explicitly document caller cleanup responsibility in the docstring (or enforce cleanup via a higher-level try/finally contract).As per coding guidelines, "Only comment when truly necessary (but ALWAYS annotate complex array shapes and note side effects)."
Also applies to: 230-237
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/sampleworks/utils/cif_utils.py` around lines 159 - 187, Update the docstring of resolve_mixed_hetatm_atom_altlocs to explicitly state that when a temp file is created it uses delete=False and is persistent, and that the caller is responsible for removing the returned temporary file (or alternately change the implementation to ensure automatic cleanup); include a brief “Side effects” note annotating the persistent temp-file lifecycle and any other persistent-temp-file helper in this module so callers know they must clean up returned temp Paths (apply the same docstring/behavior change to the other temp-file-creating function in this file).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/sampleworks/utils/cif_utils.py`:
- Around line 201-223: The code currently groups rows by (chain_id, res_id) only
and may conflate distinct residues that differ by insertion code or alternate
location; update the grouping and masks to include insertion-code and altloc
fields (e.g., insertion_code / ins_code and alt_loc / altloc) wherever chain_id
and res_id are used. Concretely, when iterating unique residues and building
pos_mask, require equality on chain_id, res_id, and the insertion/altloc fields;
when computing atom_res_names and hetatm_res_names and when applying
keep_mask[pos_mask & hetero] = False, narrow the mask to the matching
insertion-code/altloc subset so only the intended HETATM rows for that exact
residue instance are removed. Ensure you reference the existing variables
(chain_id, res_id, res_name, hetero, keep_mask) and add the insertion/altloc
arrays in the same comparisons.
---
Nitpick comments:
In `@scripts/eval/run_and_process_tortoize.py`:
- Around line 43-49: Ensure the output directory exists before writing CSVs:
call mkdir(parents=True, exist_ok=True) on args.grid_search_results_path prior
to the pd.concat(...).to_csv calls so that writing "tortoize_residues.csv" and
"tortoize_protein_stats.csv" cannot fail if the directory is missing; you can
add this immediately before the first output_file assignment (keeping existing
logger.info calls and references to all_residue_results and all_protein_results
unchanged).
In `@src/sampleworks/utils/cif_utils.py`:
- Around line 159-187: Update the docstring of resolve_mixed_hetatm_atom_altlocs
to explicitly state that when a temp file is created it uses delete=False and is
persistent, and that the caller is responsible for removing the returned
temporary file (or alternately change the implementation to ensure automatic
cleanup); include a brief “Side effects” note annotating the persistent
temp-file lifecycle and any other persistent-temp-file helper in this module so
callers know they must clean up returned temp Paths (apply the same
docstring/behavior change to the other temp-file-creating function in this
file).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4f771d5d-50b2-4e8e-9cc7-2620d88c20e7
⛔ Files ignored due to path filters (1)
pixi.lockis excluded by!**/*.lock
📒 Files selected for processing (3)
scripts/eval/run_and_process_tortoize.pysrc/sampleworks/utils/cif_utils.pysrc/sampleworks/utils/guidance_script_utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/sampleworks/utils/guidance_script_utils.py
| for chain in np.unique(chain_id): | ||
| for rid in np.unique(res_id[chain_id == chain]): | ||
| pos_mask = (chain_id == chain) & (res_id == rid) | ||
| has_no_hetatm = np.any(~hetero[pos_mask]) | ||
| has_hetatm = np.any(hetero[pos_mask]) | ||
|
|
||
| if not (has_no_hetatm and has_hetatm): | ||
| # there are either only HETATM or only ATOM records at this position, or none at all | ||
| continue | ||
|
|
||
| atom_res_names = np.unique(res_name[pos_mask & ~hetero]) | ||
| hetatm_res_names = np.unique(res_name[pos_mask & hetero]) | ||
|
|
||
| if set(atom_res_names) == set(hetatm_res_names): | ||
| continue # Same residue name on both — not the case we're fixing | ||
|
|
||
| logger.warning( | ||
| f"Chain {chain}, residue {rid}: found mixed ATOM {list(atom_res_names)} " | ||
| f"and HETATM {list(hetatm_res_names)} records with different residue names " | ||
| f"at the same sequence position. Removing HETATM records to prevent " | ||
| f"atomworks from inserting a duplicate residue into the Boltz2 input sequence." | ||
| ) | ||
| keep_mask[pos_mask & hetero] = False |
There was a problem hiding this comment.
Narrow the match key before deleting HETATM rows.
This logic keys only on (chain_id, res_id). If insertion codes or non-altloc same-number residues are present, distinct residues can be merged into one bucket and legitimate HETATM rows can be dropped. Please include insertion-code/altloc disambiguation before applying keep_mask[pos_mask & hetero] = False.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/sampleworks/utils/cif_utils.py` around lines 201 - 223, The code
currently groups rows by (chain_id, res_id) only and may conflate distinct
residues that differ by insertion code or alternate location; update the
grouping and masks to include insertion-code and altloc fields (e.g.,
insertion_code / ins_code and alt_loc / altloc) wherever chain_id and res_id are
used. Concretely, when iterating unique residues and building pos_mask, require
equality on chain_id, res_id, and the insertion/altloc fields; when computing
atom_res_names and hetatm_res_names and when applying keep_mask[pos_mask &
hetero] = False, narrow the mask to the matching insertion-code/altloc subset so
only the intended HETATM rows for that exact residue instance are removed.
Ensure you reference the existing variables (chain_id, res_id, res_name, hetero,
keep_mask) and add the insertion/altloc arrays in the same comparisons.
2274268 to
caf7b1b
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/sampleworks/utils/cif_utils.py (1)
201-223:⚠️ Potential issue | 🟠 MajorDon't key this cleanup on
chain_id/res_idalone.Lines 201-223 still collapse everything to one
(chain_id, res_id)bucket before deleting allHETATMrows there. That is broader than the altloc-specific bug in this PR: same-number residues that are not alternate locations of the same site can still be conflated, andkeep_mask[pos_mask & hetero] = Falsewill drop legitimate records. Please narrow the match to a residue instance before masking.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/sampleworks/utils/cif_utils.py` around lines 201 - 223, The current cleanup buckets by (chain_id, res_id) and deletes all HETATM rows for that bucket, which can remove legitimate non-altloc residues; refine the mask to target a specific residue instance by including the per-instance identifiers (e.g., alt/location and insertion-code fields present in this parser) in addition to chain_id and res_id before computing atom_res_names/hetatm_res_names and before setting keep_mask; change pos_mask to a more specific res_instance_mask (or create one) that ANDs chain_id == chain, res_id == rid, and the altloc/insertion-code equality, then use keep_mask[res_instance_mask & hetero] = False and use res_instance_mask when computing atom_res_names/hetatm_res_names so only the actual alternate-location residue is removed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/utils/test_cif_utils.py`:
- Around line 19-35: The synthetic CIF fixtures lack alternate location (altloc)
and insertion-code annotations, so modify the helper functions to include at
least one atom with a non-empty altloc and/or ins_code: update _atom to accept
altloc and ins_code parameters (default empty) and set those attributes on the
Atom, and update _write_cif to ensure the resulting array annotations include
"altloc" and "ins_code" (e.g., set_annotation("altloc", ...) and
set_annotation("ins_code", ...)) before calling save_structure_to_cif so
generated CIFs exercise alternate-location/insertion-code handling; reference
_atom, _write_cif, Atom, set_annotation, and save_structure_to_cif when making
changes.
---
Duplicate comments:
In `@src/sampleworks/utils/cif_utils.py`:
- Around line 201-223: The current cleanup buckets by (chain_id, res_id) and
deletes all HETATM rows for that bucket, which can remove legitimate non-altloc
residues; refine the mask to target a specific residue instance by including the
per-instance identifiers (e.g., alt/location and insertion-code fields present
in this parser) in addition to chain_id and res_id before computing
atom_res_names/hetatm_res_names and before setting keep_mask; change pos_mask to
a more specific res_instance_mask (or create one) that ANDs chain_id == chain,
res_id == rid, and the altloc/insertion-code equality, then use
keep_mask[res_instance_mask & hetero] = False and use res_instance_mask when
computing atom_res_names/hetatm_res_names so only the actual alternate-location
residue is removed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a00680bd-83a5-4a6e-95bf-7735c4e060bd
📒 Files selected for processing (4)
scripts/eval/run_and_process_tortoize.pysrc/sampleworks/utils/cif_utils.pysrc/sampleworks/utils/guidance_script_utils.pytests/utils/test_cif_utils.py
🚧 Files skipped from review as they are similar to previous changes (2)
- src/sampleworks/utils/guidance_script_utils.py
- scripts/eval/run_and_process_tortoize.py
Summary by CodeRabbit
Bug Fixes
Integration
Tests
Chores