Skip to content

Fix problem with heterogeneous HETATM/ATOM in the same position in different altlocs#195

Merged
marcuscollins merged 4 commits intomainfrom
claude/fix-mixed-hetatm
Mar 31, 2026
Merged

Fix problem with heterogeneous HETATM/ATOM in the same position in different altlocs#195
marcuscollins merged 4 commits intomainfrom
claude/fix-mixed-hetatm

Conversation

@marcuscollins
Copy link
Copy Markdown
Collaborator

@marcuscollins marcuscollins commented Mar 28, 2026

Summary by CodeRabbit

  • Bug Fixes

    • Detects and resolves mixed ATOM/HETATM entries at the same residue positions, removing conflicting HETATM atoms and returning a corrected structure path when changes occur; logs warnings identifying affected residues and modified names.
  • Integration

    • Structure loading now pre-processes inputs with the resolver and removes any temporary resolved files after use.
  • Tests

    • Added comprehensive tests for clean, mixed, standalone ligand, multi-mixed, and real-file cases, including warning assertions.
  • Chores

    • Adjusted evaluation script output locations to write results into the configured results path.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 28, 2026

📝 Walkthrough

Walkthrough

Added a CIF preprocessing utility that detects positions with both ATOM and HETATM records having different residue names, removes conflicting HETATM rows (writing a temporary fixed CIF when needed), and integrates this step into the guidance script before parsing.

Changes

Cohort / File(s) Summary
Core Utility
src/sampleworks/utils/cif_utils.py
New resolve_mixed_hetatm_atom_altlocs() implemented: loads CIF (preserving altlocs), normalizes AtomArrayStack to an AtomArray, detects (chain, res_id) positions with both ATOM and HETATM whose residue name sets differ, logs warnings for each mixed position, filters out HETATM atoms at those positions using a boolean mask, and writes a temporary .cif when changes occur. Added imports: tempfile, numpy as np, AtomArrayStack, and save_structure_to_cif.
Integration
src/sampleworks/utils/guidance_script_utils.py
get_reward_function_and_structure() now calls resolve_mixed_hetatm_atom_altlocs(Path(structure_path)) and uses the returned path for atomworks.parse(); when a temporary path is returned, it attempts to unlink() the temp file after parsing.
Tests
tests/utils/test_cif_utils.py
Added comprehensive test module validating no-op behavior, removal of HETATM at mixed positions, preservation of unrelated residues, handling of standalone ligands/same-resname cases, multiple mixed positions, expected warning messages, and an integration check using CIF resource 6NI6. New fixtures/helpers included.
Script output paths
scripts/eval/run_and_process_tortoize.py
CSV output paths updated to be rooted at args.grid_search_results_path instead of the current working directory (file names unchanged).

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Guidance as Guidance Script
participant CIFUtil as cif_utils.resolve_mixed_hetatm_atom_altlocs
participant FS as Filesystem
participant Parser as atomworks.parse
participant Logger as Logger

Guidance->>CIFUtil: call resolve_mixed_hetatm_atom_altlocs(path)
CIFUtil->>FS: load CIF (preserve altlocs)
CIFUtil->>CIFUtil: normalize AtomArrayStack -> AtomArray
CIFUtil->>CIFUtil: detect positions with ATOM+HETATM and differing resnames
alt mixed positions found
    CIFUtil->>Logger: warn about each mixed (chain,resid) and het resname
    CIFUtil->>CIFUtil: filter out HETATM atoms at mixed positions
    CIFUtil->>FS: write fixed temp .cif and return temp path
else no fixes
    CIFUtil->>FS: return original path
end
Guidance->>Parser: call atomworks.parse(resolved_path)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Suggested reviewers

  • k-chrispens

Poem

🐇 I hop through altloc fields where atoms quarrel and play,

ATOMs keep their places while HETATMs stray,
I nibble the clashes, write a tidy new file,
leave residues neat so the parser can smile,
and twirl with a carrot at the end of the day. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 46.15% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: resolving mixed HETATM/ATOM records at the same residue position, which is the core functionality added across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/fix-mixed-hetatm

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/sampleworks/utils/cif_utils.py (1)

227-231: Temporary file cleanup consideration.

The temporary file is created with delete=False and persists after the process exits. This is necessary for downstream parsing but may accumulate orphan files over time. Consider documenting this behavior or implementing cleanup in callers for long-running processes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/sampleworks/utils/cif_utils.py` around lines 227 - 231, The code in the
block that creates a temporary CIF file using tempfile.NamedTemporaryFile(...,
delete=False) and assigns tmp_path (tmp_path = Path(tmp_file.name)) leaves files
behind; update the implementation or docs to avoid orphan files by either 1)
documenting this behavior in the function docstring (mention that a temp file is
created with delete=False and callers are responsible for removing tmp_path) or
2) changing the call sites that use tmp_path (the downstream parser consumers)
to explicitly unlink(tmp_path) after parsing; alternatively implement a
try/finally in the function that yields tmp_path to callers and unlinks it in
finally so temporary files are always cleaned up. Ensure references to
tempfile.NamedTemporaryFile and tmp_path (and the surrounding function name) are
updated accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/sampleworks/utils/cif_utils.py`:
- Around line 232-235: The code contains duplicated calls to
save_structure_to_cif and logger.info (both save_structure_to_cif(fixed_array,
tmp_path) and logger.info(f"Wrote altloc-fixed CIF to temporary file:
{tmp_path}") appear twice); remove the second duplicate pair so only one call to
save_structure_to_cif and one logger.info remain (locate the duplicate lines
near the end of the altloc-fixing routine in cif_utils.py where
save_structure_to_cif and logger.info are invoked).

---

Nitpick comments:
In `@src/sampleworks/utils/cif_utils.py`:
- Around line 227-231: The code in the block that creates a temporary CIF file
using tempfile.NamedTemporaryFile(..., delete=False) and assigns tmp_path
(tmp_path = Path(tmp_file.name)) leaves files behind; update the implementation
or docs to avoid orphan files by either 1) documenting this behavior in the
function docstring (mention that a temp file is created with delete=False and
callers are responsible for removing tmp_path) or 2) changing the call sites
that use tmp_path (the downstream parser consumers) to explicitly
unlink(tmp_path) after parsing; alternatively implement a try/finally in the
function that yields tmp_path to callers and unlinks it in finally so temporary
files are always cleaned up. Ensure references to tempfile.NamedTemporaryFile
and tmp_path (and the surrounding function name) are updated accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e6c405d9-fa3d-43ac-a18d-7af00da1ae6d

📥 Commits

Reviewing files that changed from the base of the PR and between bcfb45c and 11ca78e.

📒 Files selected for processing (3)
  • src/sampleworks/utils/cif_utils.py
  • src/sampleworks/utils/guidance_script_utils.py
  • tests/utils/test_cif_utils.py

Comment thread src/sampleworks/utils/cif_utils.py Outdated
Copy link
Copy Markdown
Collaborator

@k-chrispens k-chrispens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a solid fix, the tests could take another pass and the coderabbit comment should be addressed, but will approve and let you take care of that

Comment on lines +163 to +166
This occurs when a residue has a modified form (e.g. CSO, cysteic acid) as some
altlocs and the canonical form (e.g. CYS) as another altloc at the same sequence
position. Atomworks treats these as two sequential residues rather than alternates,
inserting a spurious extra residue into the sequence fed to Boltz2.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth keeping an issue or something open for us so that we can track changes in the upstream? And should we file an atomworks issue?

Comment thread src/sampleworks/utils/cif_utils.py Outdated
Comment on lines +65 to +75
@pytest.fixture
def cif_mixed(tmp_path) -> Path:
"""CIF mimicking the 6NI6 bug: CYS (ATOM) and CSO (HETATM) share residue 101."""
return _write_cif(
[
_atom("A", 100, "VAL", hetero=False),
_atom("A", 101, "CYS", hetero=False), # canonical — keep
_atom("A", 101, "CSO", hetero=True), # modified altloc — remove
_atom("A", 102, "ALA", hetero=False),
],
tmp_path / "mixed.cif",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually already have 6NI6 as a fixture. I know at least in my hands Claude loves to add fixtures and mocks willy nilly, I usually have to go back and tell it to only use existing fixtures and utilities if I am having it write the tests

Comment thread tests/utils/test_cif_utils.py
marcuscollins and others added 3 commits March 30, 2026 21:36
… same residue position

Atomworks infers chain sequences from the atom array when no entity_poly_seq
is present. If a residue has both a canonical ATOM record (e.g. CYS, altloc A)
and a modified HETATM record (e.g. CSO, altlocs B/C) at the same sequence
position, atomworks treats them as two sequential residues and inserts a
spurious extra residue into the sequence passed to Boltz2.

Adds resolve_mixed_hetatm_atom_altlocs() in cif_utils.py, which loads the CIF
with all altlocs, detects positions where ATOM and HETATM records coexist with
different residue names, logs a warning per affected position, removes the
HETATM records, and returns a corrected temporary CIF for atomworks to parse.
Called in get_reward_function_and_structure() before atomworks.parse().

Fixes #194

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
scripts/eval/run_and_process_tortoize.py (1)

43-49: Consider explicit directory creation before CSV writes for defensive robustness.

Lines 43–49 depend on args.grid_search_results_path being a valid, existing directory. While parse_eval_args() already ensures it is a Path object (via type=Path), and setup_evaluation_parameters() at line 31 implicitly requires it to exist (by scanning it), there is no explicit guarantee that it persists or that subdirectory creation is handled. Adding an explicit mkdir() call makes the intent clearer and adds a safety layer.

Suggested approach
     all_residue_results, all_protein_results = tuple(zip(*tortoize_results, strict=True))
 
+    results_dir = Path(args.grid_search_results_path)
+    results_dir.mkdir(parents=True, exist_ok=True)
+
-    output_file = args.grid_search_results_path / "tortoize_residues.csv"
+    output_file = results_dir / "tortoize_residues.csv"
     pd.concat(all_residue_results).to_csv(output_file, index=False)
     logger.info(f"Residue results saved to {output_file}")
 
-    output_file = args.grid_search_results_path / "tortoize_protein_stats.csv"
+    output_file = results_dir / "tortoize_protein_stats.csv"
     pd.concat(all_protein_results).to_csv(output_file, index=False)
     logger.info(f"Protein-level stats saved to {output_file}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/run_and_process_tortoize.py` around lines 43 - 49, Ensure the
output directory exists before writing CSVs: call mkdir(parents=True,
exist_ok=True) on args.grid_search_results_path prior to the
pd.concat(...).to_csv calls so that writing "tortoize_residues.csv" and
"tortoize_protein_stats.csv" cannot fail if the directory is missing; you can
add this immediately before the first output_file assignment (keeping existing
logger.info calls and references to all_residue_results and all_protein_results
unchanged).
src/sampleworks/utils/cif_utils.py (1)

159-187: Clarify temp-file lifecycle in the function contract.

The function creates a persistent temp file (delete=False) as a side effect. Please explicitly document caller cleanup responsibility in the docstring (or enforce cleanup via a higher-level try/finally contract).

As per coding guidelines, "Only comment when truly necessary (but ALWAYS annotate complex array shapes and note side effects)."

Also applies to: 230-237

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/sampleworks/utils/cif_utils.py` around lines 159 - 187, Update the
docstring of resolve_mixed_hetatm_atom_altlocs to explicitly state that when a
temp file is created it uses delete=False and is persistent, and that the caller
is responsible for removing the returned temporary file (or alternately change
the implementation to ensure automatic cleanup); include a brief “Side effects”
note annotating the persistent temp-file lifecycle and any other
persistent-temp-file helper in this module so callers know they must clean up
returned temp Paths (apply the same docstring/behavior change to the other
temp-file-creating function in this file).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/sampleworks/utils/cif_utils.py`:
- Around line 201-223: The code currently groups rows by (chain_id, res_id) only
and may conflate distinct residues that differ by insertion code or alternate
location; update the grouping and masks to include insertion-code and altloc
fields (e.g., insertion_code / ins_code and alt_loc / altloc) wherever chain_id
and res_id are used. Concretely, when iterating unique residues and building
pos_mask, require equality on chain_id, res_id, and the insertion/altloc fields;
when computing atom_res_names and hetatm_res_names and when applying
keep_mask[pos_mask & hetero] = False, narrow the mask to the matching
insertion-code/altloc subset so only the intended HETATM rows for that exact
residue instance are removed. Ensure you reference the existing variables
(chain_id, res_id, res_name, hetero, keep_mask) and add the insertion/altloc
arrays in the same comparisons.

---

Nitpick comments:
In `@scripts/eval/run_and_process_tortoize.py`:
- Around line 43-49: Ensure the output directory exists before writing CSVs:
call mkdir(parents=True, exist_ok=True) on args.grid_search_results_path prior
to the pd.concat(...).to_csv calls so that writing "tortoize_residues.csv" and
"tortoize_protein_stats.csv" cannot fail if the directory is missing; you can
add this immediately before the first output_file assignment (keeping existing
logger.info calls and references to all_residue_results and all_protein_results
unchanged).

In `@src/sampleworks/utils/cif_utils.py`:
- Around line 159-187: Update the docstring of resolve_mixed_hetatm_atom_altlocs
to explicitly state that when a temp file is created it uses delete=False and is
persistent, and that the caller is responsible for removing the returned
temporary file (or alternately change the implementation to ensure automatic
cleanup); include a brief “Side effects” note annotating the persistent
temp-file lifecycle and any other persistent-temp-file helper in this module so
callers know they must clean up returned temp Paths (apply the same
docstring/behavior change to the other temp-file-creating function in this
file).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4f771d5d-50b2-4e8e-9cc7-2620d88c20e7

📥 Commits

Reviewing files that changed from the base of the PR and between 7d2a661 and 2274268.

⛔ Files ignored due to path filters (1)
  • pixi.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • scripts/eval/run_and_process_tortoize.py
  • src/sampleworks/utils/cif_utils.py
  • src/sampleworks/utils/guidance_script_utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/sampleworks/utils/guidance_script_utils.py

Comment on lines +201 to +223
for chain in np.unique(chain_id):
for rid in np.unique(res_id[chain_id == chain]):
pos_mask = (chain_id == chain) & (res_id == rid)
has_no_hetatm = np.any(~hetero[pos_mask])
has_hetatm = np.any(hetero[pos_mask])

if not (has_no_hetatm and has_hetatm):
# there are either only HETATM or only ATOM records at this position, or none at all
continue

atom_res_names = np.unique(res_name[pos_mask & ~hetero])
hetatm_res_names = np.unique(res_name[pos_mask & hetero])

if set(atom_res_names) == set(hetatm_res_names):
continue # Same residue name on both — not the case we're fixing

logger.warning(
f"Chain {chain}, residue {rid}: found mixed ATOM {list(atom_res_names)} "
f"and HETATM {list(hetatm_res_names)} records with different residue names "
f"at the same sequence position. Removing HETATM records to prevent "
f"atomworks from inserting a duplicate residue into the Boltz2 input sequence."
)
keep_mask[pos_mask & hetero] = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Narrow the match key before deleting HETATM rows.

This logic keys only on (chain_id, res_id). If insertion codes or non-altloc same-number residues are present, distinct residues can be merged into one bucket and legitimate HETATM rows can be dropped. Please include insertion-code/altloc disambiguation before applying keep_mask[pos_mask & hetero] = False.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/sampleworks/utils/cif_utils.py` around lines 201 - 223, The code
currently groups rows by (chain_id, res_id) only and may conflate distinct
residues that differ by insertion code or alternate location; update the
grouping and masks to include insertion-code and altloc fields (e.g.,
insertion_code / ins_code and alt_loc / altloc) wherever chain_id and res_id are
used. Concretely, when iterating unique residues and building pos_mask, require
equality on chain_id, res_id, and the insertion/altloc fields; when computing
atom_res_names and hetatm_res_names and when applying keep_mask[pos_mask &
hetero] = False, narrow the mask to the matching insertion-code/altloc subset so
only the intended HETATM rows for that exact residue instance are removed.
Ensure you reference the existing variables (chain_id, res_id, res_name, hetero,
keep_mask) and add the insertion/altloc arrays in the same comparisons.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/sampleworks/utils/cif_utils.py (1)

201-223: ⚠️ Potential issue | 🟠 Major

Don't key this cleanup on chain_id/res_id alone.

Lines 201-223 still collapse everything to one (chain_id, res_id) bucket before deleting all HETATM rows there. That is broader than the altloc-specific bug in this PR: same-number residues that are not alternate locations of the same site can still be conflated, and keep_mask[pos_mask & hetero] = False will drop legitimate records. Please narrow the match to a residue instance before masking.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/sampleworks/utils/cif_utils.py` around lines 201 - 223, The current
cleanup buckets by (chain_id, res_id) and deletes all HETATM rows for that
bucket, which can remove legitimate non-altloc residues; refine the mask to
target a specific residue instance by including the per-instance identifiers
(e.g., alt/location and insertion-code fields present in this parser) in
addition to chain_id and res_id before computing atom_res_names/hetatm_res_names
and before setting keep_mask; change pos_mask to a more specific
res_instance_mask (or create one) that ANDs chain_id == chain, res_id == rid,
and the altloc/insertion-code equality, then use keep_mask[res_instance_mask &
hetero] = False and use res_instance_mask when computing
atom_res_names/hetatm_res_names so only the actual alternate-location residue is
removed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/utils/test_cif_utils.py`:
- Around line 19-35: The synthetic CIF fixtures lack alternate location (altloc)
and insertion-code annotations, so modify the helper functions to include at
least one atom with a non-empty altloc and/or ins_code: update _atom to accept
altloc and ins_code parameters (default empty) and set those attributes on the
Atom, and update _write_cif to ensure the resulting array annotations include
"altloc" and "ins_code" (e.g., set_annotation("altloc", ...) and
set_annotation("ins_code", ...)) before calling save_structure_to_cif so
generated CIFs exercise alternate-location/insertion-code handling; reference
_atom, _write_cif, Atom, set_annotation, and save_structure_to_cif when making
changes.

---

Duplicate comments:
In `@src/sampleworks/utils/cif_utils.py`:
- Around line 201-223: The current cleanup buckets by (chain_id, res_id) and
deletes all HETATM rows for that bucket, which can remove legitimate non-altloc
residues; refine the mask to target a specific residue instance by including the
per-instance identifiers (e.g., alt/location and insertion-code fields present
in this parser) in addition to chain_id and res_id before computing
atom_res_names/hetatm_res_names and before setting keep_mask; change pos_mask to
a more specific res_instance_mask (or create one) that ANDs chain_id == chain,
res_id == rid, and the altloc/insertion-code equality, then use
keep_mask[res_instance_mask & hetero] = False and use res_instance_mask when
computing atom_res_names/hetatm_res_names so only the actual alternate-location
residue is removed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a00680bd-83a5-4a6e-95bf-7735c4e060bd

📥 Commits

Reviewing files that changed from the base of the PR and between 2274268 and caf7b1b.

📒 Files selected for processing (4)
  • scripts/eval/run_and_process_tortoize.py
  • src/sampleworks/utils/cif_utils.py
  • src/sampleworks/utils/guidance_script_utils.py
  • tests/utils/test_cif_utils.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/sampleworks/utils/guidance_script_utils.py
  • scripts/eval/run_and_process_tortoize.py

Comment thread tests/utils/test_cif_utils.py
@marcuscollins marcuscollins merged commit 15de1e9 into main Mar 31, 2026
14 of 26 checks passed
@k-chrispens k-chrispens deleted the claude/fix-mixed-hetatm branch April 22, 2026 00:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants