Skip to content

refactor(eval): unify rscc script structure loading with generate_synthetic_density.py#196

Closed
DorisMai wants to merge 1 commit intomainfrom
dm/issue-113
Closed

refactor(eval): unify rscc script structure loading with generate_synthetic_density.py#196
DorisMai wants to merge 1 commit intomainfrom
dm/issue-113

Conversation

@DorisMai
Copy link
Copy Markdown
Contributor

@DorisMai DorisMai commented Mar 28, 2026

Address issue #113 (one of the TODO comments in rscc_grid_search_script.py).

Specific changes include:

  • Replace parse() + get_asym_unit_from_structure() with the shared load_structure_with_altlocs()
  • Adds an altloc parameter to load_structure_with_altlocs() ("all" / "occupancy" / "first") so callers can select the biotite mode
  • Loads refined structure with altloc="all" for density calculation (matching generate_synthetic_density.py) and altloc="occupancy" for alignment
  • Removes inline b_factor and coord validation , should be handled by extract_density_inputs_from_atomarray inside compute_density_from_atomarray

Summary by CodeRabbit

  • Improvements
    • Enhanced handling of alternate location representations in structural data loading.
    • Added flexibility to configure how alternate locations are processed during structure loading, with options for all representations, occupancy-weighted, or first representation.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 28, 2026

📝 Walkthrough

Walkthrough

This PR updates structure loading to support flexible altloc selection strategies. The load_structure_with_altlocs utility function now accepts an altloc parameter to choose between "all", "occupancy", or "first" representations. The evaluation script is refactored to load dual structure representations: one for alignment and one for density computation.

Changes

Cohort / File(s) Summary
Utility Function Enhancement
src/sampleworks/utils/atom_array_utils.py
Added altloc parameter to load_structure_with_altlocs with options "all", "occupancy", "first" (default: "all"), allowing flexible altloc selection strategy. Updated docstring to reflect new parameter behavior.
Script Refactoring
scripts/eval/rscc_grid_search_script.py
Migrated from atomworks.io.parser.parse to load_structure_with_altlocs for structure loading. Implemented dual-representation workflow: highest-occupancy structure for alignment, all-altloc structure for density computation. Reference caching changed from AtomArrayStack to AtomArray. Removed coordinate/b-factor validation logic tied to old parser path.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • fix issue 17 #115: Addresses related altloc and occupancy data flow handling with overlapping API changes to load_structure_with_altlocs.

Suggested reviewers

  • marcuscollins

Poem

🐰 Two structures waltz through altloc arrays,
One leads the dance, one shows the ways!
Occupancy-weighted and highest-choice too,
Alignment and density—we've got the view! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main objective of the PR: unifying structure loading between two scripts by refactoring rscc_grid_search_script.py to use the same approach as generate_synthetic_density.py.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dm/issue-113

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@DorisMai DorisMai marked this pull request as ready for review March 28, 2026 23:26
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
scripts/eval/rscc_grid_search_script.py (1)

136-140: Cache the refined-structure loads per trial.

Lines 139-140 parse the same refined.cif twice for every selection. On proteins with several selections, that adds redundant I/O to an already expensive evaluation pass. Cache the altloc="all" and altloc="occupancy" structures once per trial; if you keep the in-place coordinate update at Line 213, copy the all-altloc array before reusing it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/rscc_grid_search_script.py` around lines 136 - 140, The code is
re-parsing trial.refined_cif_path twice for every selection; cache the results
of load_structure_with_altlocs(trial.refined_cif_path, altloc="all") and
load_structure_with_altlocs(trial.refined_cif_path, altloc="occupancy") once per
trial (e.g., store as atom_array_all_altlocs and atom_array) and reuse them for
all selections instead of reloading; if atom_array_all_altlocs is later modified
in-place (see the coordinate update that mutates arrays), make and use a
shallow/deep copy of atom_array_all_altlocs before mutation so the cached
version remains reusable across selections.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/eval/rscc_grid_search_script.py`:
- Around line 169-171: The code currently calls remove_atoms_with_any_nan_coords
on atom_array_all_altlocs before compute_density_from_atomarray, which can
change the density input and hide invalid structures; instead, revert removal
for atom_array_all_altlocs so compute_density_from_atomarray receives the
original data and only apply remove_atoms_with_any_nan_coords to alignment-only
arrays (e.g., the arrays used for alignment/registration), leaving
atom_array_all_altlocs untouched until after shared validation has run; locate
remove_atoms_with_any_nan_coords and the call sites involving
atom_array_all_altlocs and compute_density_from_atomarray to implement this
change.

---

Nitpick comments:
In `@scripts/eval/rscc_grid_search_script.py`:
- Around line 136-140: The code is re-parsing trial.refined_cif_path twice for
every selection; cache the results of
load_structure_with_altlocs(trial.refined_cif_path, altloc="all") and
load_structure_with_altlocs(trial.refined_cif_path, altloc="occupancy") once per
trial (e.g., store as atom_array_all_altlocs and atom_array) and reuse them for
all selections instead of reloading; if atom_array_all_altlocs is later modified
in-place (see the coordinate update that mutates arrays), make and use a
shallow/deep copy of atom_array_all_altlocs before mutation so the cached
version remains reusable across selections.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d0f656e8-3739-459e-b0d5-48d548243b82

📥 Commits

Reviewing files that changed from the base of the PR and between bcfb45c and cca24a7.

⛔ Files ignored due to path filters (1)
  • pixi.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • scripts/eval/rscc_grid_search_script.py
  • src/sampleworks/utils/atom_array_utils.py

Comment thread scripts/eval/rscc_grid_search_script.py
Copy link
Copy Markdown
Collaborator

@marcuscollins marcuscollins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR introduces a bug: it looks like you're loading just the first model in the Trial refined cif file (see comments inline) which would mean we'd only compute the density from that first model, rather than as the average of all the generated ensemble of structures.

# Load refined structure twice:
# - all altlocs (occupancy-weighted) for density computation
# - highest-occupancy altloc for alignment (no duplicate atoms)
atom_array_all_altlocs = load_structure_with_altlocs(trial.refined_cif_path)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW these shouldn't have altlocs (the current structure predictors produce structures with no altlocs, although I suppose this could change.) I'm curious why you chose this path?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this could be a problem, because load_structure_with_altlocs currently returns only the first model in the CIF/PDB file. Later on, we need to handle multiple models (as each of our output CIF files contains the entire ensemble, but they are represented as models, not with altlocs.

As part of the PR, could you include a test that makes sure this code works correctly? I can point you to files if you need example input. But make sure that the new version produces the same output as the old version.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I thought the predicted ensemble would be represented as altlocs. Can you point me to files that currently represent the ensemble predictions?

Comment thread scripts/eval/rscc_grid_search_script.py
@DorisMai DorisMai closed this Mar 31, 2026
@DorisMai DorisMai deleted the dm/issue-113 branch March 31, 2026 04:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants