Remove phenix clashscore script and add occupancy/density utility improvements by marcuscollins · Pull Request #112 · diff-use/sampleworks

marcuscollins · 2026-02-27T07:24:59Z

The one possibly controversial thing here is in density_utils.py. I expand arrays like elements and b-factors to have a matching batch dimension. I can update to use the new match_batch if you think it is worthwhile.

Otherwise:
--removed a duplicate script (for phenix clashscores)
--handling some quirks of our initial 40 protein run
--making it possible to search for files other than "refined.cif" (necessary since we patch those files and rename)

Summary by CodeRabbit

Release Notes

Bug Fixes
- Fixed input shape mismatches in density processing
- Improved occupancy detection for native structures
Improvements
- Enhanced error messages with actionable guidance
- Added configurable occupancy levels for grid search evaluation
Chores
- Removed obsolete clashscore processing script

coderabbitai · 2026-02-27T07:25:13Z

📝 Walkthrough

Walkthrough

The PR removes an outdated phenix clashscore processing script and refactors several utility modules. Changes include adding occupancy level imports, enhancing error guidance, enforcing input validation in CIF utilities, and fixing tensor shape mismatches in density utilities.

Changes

Cohort / File(s)	Summary
Script Removal `scripts/run_and_process_phenix_clashscore.py`	Removed entire script that handled phenix.clashscore execution, JSON parsing, and CSV generation for grid search experiments.
Grid Search & Occupancy Updates `src/sampleworks/eval/grid_search_eval_utils.py`, `src/sampleworks/eval/occupancy_utils.py`	Added OCCUPANCY_LEVELS import and --occupancies CLI option; modified scan_grid_search_results to propagate target_filename properly; introduced native directory branch in occupancy parsing to set occ_a to 0.5.
Metrics & Utility Refinements `src/sampleworks/metrics/lddt.py`, `src/sampleworks/utils/cif_utils.py`, `src/sampleworks/utils/density_utils.py`	Expanded LDDT error message with guidance; enforced input sequence length consistency in CIF altloc key extraction using strict=True; expanded elements, b_factors, and occupancies tensors to match batch dimensions in density utilities.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Adding new evaluation scripts and moving eval scripts to a new home. #106: Adds run_and_process_phenix_clashscore.py to scripts/eval—directly related to the removal of the same script from scripts/.
New clash score script and improvements to the CIF patching script. #87: Adds run_and_process_phenix_clashscore.py script with CIF patching behavior—mirrors the removed script and overlaps in occupancy/CIF handling.

Suggested reviewers

k-chrispens

Poem

🐰 A clashscore script hops away,

While occupancies find a brighter day,

Tensors now match their rightful shape,

Validation strict—no data escape!

Clean refactors, changes made with care,

The codebase bounces, light as air! 🌟

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the two main changes: removing the phenix clashscore script and adding occupancy/density utility improvements, matching the changeset content.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch mdc-cleanup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/sampleworks/metrics/lddt.py`:
- Around line 349-352: The RuntimeError message raised when checking
predicted_aa_stack concatenates two sentences without proper
punctuation/spacing; update the string in the raise RuntimeError inside the
relevant block (the one referencing predicted_aa_stack and mask()) to add a
period and a space so it reads "...mask(). You should read in atom arrays..."
ensuring the error text is grammatically correct and clear.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8653b14 and 2c138ba.

📒 Files selected for processing (6)

scripts/run_and_process_phenix_clashscore.py
src/sampleworks/eval/grid_search_eval_utils.py
src/sampleworks/eval/occupancy_utils.py
src/sampleworks/metrics/lddt.py
src/sampleworks/utils/cif_utils.py
src/sampleworks/utils/density_utils.py

💤 Files with no reviewable changes (1)

scripts/run_and_process_phenix_clashscore.py

…anning for results

marcuscollins · 2026-02-27T18:23:02Z

        atom_array, device
    )

+    # need to make sure these all have the same batch dimension or the transformer will fail.


@k-chrispens this may no longer be needed with the changes you main in real_space_density.py but I'm going to leave here for now then test everything in my next PR, if that's okay with you.

I don't think I made any changes affecting this

I do think it would be clearer to do match_batch here in the next PR though, esp. since these are potentially already at batch size n_model from an AtomArrayStack, which could lead to errors that we would catch and report well in match_batch

coderabbitai

🧹 Nitpick comments (1)

src/sampleworks/eval/occupancy_utils.py (1)
17-20: Consider documenting the "native" case in the docstring.

The new "native" branch is acknowledged as a hack, but it would help future maintainers to add an example in the docstring (e.g., '1vme_native' -> ('1vme', 0.5)). This clarifies expected behavior until the directory naming is properly addressed.
📝 Proposed docstring update
     """Extract protein name and occupancy from directory name.
 
     Examples:
     - '1vme_0.5occA_0.5occB' -> ('1vme', 0.5)
     - '6b8x_1.0occA' -> ('6b8x', 1.0)
     - '5sop_1.0occB' -> ('5sop', 0.0)
+    - '1vme_native' -> ('1vme', 0.5)  # hack: assumes native means 0.5 occupancy
     """
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/sampleworks/eval/occupancy_utils.py` around lines 17 - 20, Update the
docstring of the function in src/sampleworks/eval/occupancy_utils.py that parses
directory names (the one using the dir_name variable) to document the new
"native" branch: add a short example such as "'1vme_native' -> ('1vme', 0.5)"
and a brief note that this is a temporary hack until directory naming is fixed
so future maintainers understand expected behavior when "native" appears in
dir_name.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/sampleworks/eval/occupancy_utils.py`:
- Around line 17-20: Update the docstring of the function in
src/sampleworks/eval/occupancy_utils.py that parses directory names (the one
using the dir_name variable) to document the new "native" branch: add a short
example such as "'1vme_native' -> ('1vme', 0.5)" and a brief note that this is a
temporary hack until directory naming is fixed so future maintainers understand
expected behavior when "native" appears in dir_name.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c138ba and 2026961.

📒 Files selected for processing (6)

scripts/run_and_process_phenix_clashscore.py
src/sampleworks/eval/grid_search_eval_utils.py
src/sampleworks/eval/occupancy_utils.py
src/sampleworks/metrics/lddt.py
src/sampleworks/utils/cif_utils.py
src/sampleworks/utils/density_utils.py

💤 Files with no reviewable changes (1)

scripts/run_and_process_phenix_clashscore.py

🚧 Files skipped from review as they are similar to previous changes (2)

src/sampleworks/utils/cif_utils.py
src/sampleworks/utils/density_utils.py

k-chrispens

Looks good, I do think it's worth switching to match_batch wherever we do these expansions just so it is a bit clearer what we're trying to do and to catch potential errors, but that can come later.

k-chrispens · 2026-02-27T23:03:54Z

        atom_array, device
    )

+    # need to make sure these all have the same batch dimension or the transformer will fail.


I do think it would be clearer to do match_batch here in the next PR though, esp. since these are potentially already at batch size n_model from an AtomArrayStack, which could lead to errors that we would catch and report well in match_batch

marcuscollins requested a review from k-chrispens February 27, 2026 07:25

coderabbitai Bot reviewed Feb 27, 2026

View reviewed changes

Comment thread src/sampleworks/metrics/lddt.py

marcuscollins changed the title ~~Cleaning up little things to make all the analysis scripts run.~~ Remove phenix clashscore script and add occupancy/density utility improvements Feb 27, 2026

marcuscollins added 2 commits February 27, 2026 10:20

remove duplicate script, pass target filename pattern through when sc…

de7c02b

…anning for results

Various little cleanups needed to make analysis scripts run error free

2026961

marcuscollins force-pushed the mdc-cleanup branch from 33084d6 to 2026961 Compare February 27, 2026 18:21

marcuscollins commented Feb 27, 2026

View reviewed changes

coderabbitai Bot reviewed Feb 27, 2026

View reviewed changes

k-chrispens approved these changes Feb 27, 2026

View reviewed changes

k-chrispens merged commit 14c38a2 into main Feb 27, 2026
1 check passed

k-chrispens deleted the mdc-cleanup branch February 27, 2026 23:06

This was referenced Feb 28, 2026

fix issue 17 #115

Merged

Creates a new way to test multiple selections for each protein; parallelizes LDDT script #116

Merged

coderabbitai Bot mentioned this pull request Mar 13, 2026

code(eval): refactoring eval scripts, primarily unifying arguments #157

Merged

coderabbitai Bot mentioned this pull request Mar 30, 2026

Fix problem with heterogeneous HETATM/ATOM in the same position in different altlocs #195

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove phenix clashscore script and add occupancy/density utility improvements#112

Remove phenix clashscore script and add occupancy/density utility improvements#112
k-chrispens merged 2 commits intomainfrom
mdc-cleanup

marcuscollins commented Feb 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Feb 27, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

marcuscollins Feb 27, 2026

Uh oh!

k-chrispens Feb 27, 2026

Uh oh!

k-chrispens Feb 27, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

k-chrispens left a comment

Uh oh!

k-chrispens Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

marcuscollins commented Feb 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

marcuscollins Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

k-chrispens Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

k-chrispens Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

k-chrispens left a comment

Choose a reason for hiding this comment

Uh oh!

k-chrispens Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

marcuscollins commented Feb 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 27, 2026 •

edited

Loading