Skip to content

Remove phenix clashscore script and add occupancy/density utility improvements#112

Merged
k-chrispens merged 2 commits intomainfrom
mdc-cleanup
Feb 27, 2026
Merged

Remove phenix clashscore script and add occupancy/density utility improvements#112
k-chrispens merged 2 commits intomainfrom
mdc-cleanup

Conversation

@marcuscollins
Copy link
Copy Markdown
Collaborator

@marcuscollins marcuscollins commented Feb 27, 2026

The one possibly controversial thing here is in density_utils.py. I expand arrays like elements and b-factors to have a matching batch dimension. I can update to use the new match_batch if you think it is worthwhile.

Otherwise:
--removed a duplicate script (for phenix clashscores)
--handling some quirks of our initial 40 protein run
--making it possible to search for files other than "refined.cif" (necessary since we patch those files and rename)

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Fixed input shape mismatches in density processing
    • Improved occupancy detection for native structures
  • Improvements

    • Enhanced error messages with actionable guidance
    • Added configurable occupancy levels for grid search evaluation
  • Chores

    • Removed obsolete clashscore processing script

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 27, 2026

📝 Walkthrough

Walkthrough

The PR removes an outdated phenix clashscore processing script and refactors several utility modules. Changes include adding occupancy level imports, enhancing error guidance, enforcing input validation in CIF utilities, and fixing tensor shape mismatches in density utilities.

Changes

Cohort / File(s) Summary
Script Removal
scripts/run_and_process_phenix_clashscore.py
Removed entire script that handled phenix.clashscore execution, JSON parsing, and CSV generation for grid search experiments.
Grid Search & Occupancy Updates
src/sampleworks/eval/grid_search_eval_utils.py, src/sampleworks/eval/occupancy_utils.py
Added OCCUPANCY_LEVELS import and --occupancies CLI option; modified scan_grid_search_results to propagate target_filename properly; introduced native directory branch in occupancy parsing to set occ_a to 0.5.
Metrics & Utility Refinements
src/sampleworks/metrics/lddt.py, src/sampleworks/utils/cif_utils.py, src/sampleworks/utils/density_utils.py
Expanded LDDT error message with guidance; enforced input sequence length consistency in CIF altloc key extraction using strict=True; expanded elements, b_factors, and occupancies tensors to match batch dimensions in density utilities.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Suggested reviewers

  • k-chrispens

Poem

🐰 A clashscore script hops away,

While occupancies find a brighter day,

Tensors now match their rightful shape,

Validation strict—no data escape!

Clean refactors, changes made with care,

The codebase bounces, light as air! 🌟

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the two main changes: removing the phenix clashscore script and adding occupancy/density utility improvements, matching the changeset content.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch mdc-cleanup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/sampleworks/metrics/lddt.py`:
- Around line 349-352: The RuntimeError message raised when checking
predicted_aa_stack concatenates two sentences without proper
punctuation/spacing; update the string in the raise RuntimeError inside the
relevant block (the one referencing predicted_aa_stack and mask()) to add a
period and a space so it reads "...mask(). You should read in atom arrays..."
ensuring the error text is grammatically correct and clear.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8653b14 and 2c138ba.

📒 Files selected for processing (6)
  • scripts/run_and_process_phenix_clashscore.py
  • src/sampleworks/eval/grid_search_eval_utils.py
  • src/sampleworks/eval/occupancy_utils.py
  • src/sampleworks/metrics/lddt.py
  • src/sampleworks/utils/cif_utils.py
  • src/sampleworks/utils/density_utils.py
💤 Files with no reviewable changes (1)
  • scripts/run_and_process_phenix_clashscore.py

Comment thread src/sampleworks/metrics/lddt.py
@marcuscollins marcuscollins changed the title Cleaning up little things to make all the analysis scripts run. Remove phenix clashscore script and add occupancy/density utility improvements Feb 27, 2026
atom_array, device
)

# need to make sure these all have the same batch dimension or the transformer will fail.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@k-chrispens this may no longer be needed with the changes you main in real_space_density.py but I'm going to leave here for now then test everything in my next PR, if that's okay with you.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I made any changes affecting this

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it would be clearer to do match_batch here in the next PR though, esp. since these are potentially already at batch size n_model from an AtomArrayStack, which could lead to errors that we would catch and report well in match_batch

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/sampleworks/eval/occupancy_utils.py (1)

17-20: Consider documenting the "native" case in the docstring.

The new "native" branch is acknowledged as a hack, but it would help future maintainers to add an example in the docstring (e.g., '1vme_native' -> ('1vme', 0.5)). This clarifies expected behavior until the directory naming is properly addressed.

📝 Proposed docstring update
     """Extract protein name and occupancy from directory name.
 
     Examples:
     - '1vme_0.5occA_0.5occB' -> ('1vme', 0.5)
     - '6b8x_1.0occA' -> ('6b8x', 1.0)
     - '5sop_1.0occB' -> ('5sop', 0.0)
+    - '1vme_native' -> ('1vme', 0.5)  # hack: assumes native means 0.5 occupancy
     """
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/sampleworks/eval/occupancy_utils.py` around lines 17 - 20, Update the
docstring of the function in src/sampleworks/eval/occupancy_utils.py that parses
directory names (the one using the dir_name variable) to document the new
"native" branch: add a short example such as "'1vme_native' -> ('1vme', 0.5)"
and a brief note that this is a temporary hack until directory naming is fixed
so future maintainers understand expected behavior when "native" appears in
dir_name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/sampleworks/eval/occupancy_utils.py`:
- Around line 17-20: Update the docstring of the function in
src/sampleworks/eval/occupancy_utils.py that parses directory names (the one
using the dir_name variable) to document the new "native" branch: add a short
example such as "'1vme_native' -> ('1vme', 0.5)" and a brief note that this is a
temporary hack until directory naming is fixed so future maintainers understand
expected behavior when "native" appears in dir_name.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c138ba and 2026961.

📒 Files selected for processing (6)
  • scripts/run_and_process_phenix_clashscore.py
  • src/sampleworks/eval/grid_search_eval_utils.py
  • src/sampleworks/eval/occupancy_utils.py
  • src/sampleworks/metrics/lddt.py
  • src/sampleworks/utils/cif_utils.py
  • src/sampleworks/utils/density_utils.py
💤 Files with no reviewable changes (1)
  • scripts/run_and_process_phenix_clashscore.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/sampleworks/utils/cif_utils.py
  • src/sampleworks/utils/density_utils.py

Copy link
Copy Markdown
Collaborator

@k-chrispens k-chrispens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I do think it's worth switching to match_batch wherever we do these expansions just so it is a bit clearer what we're trying to do and to catch potential errors, but that can come later.

atom_array, device
)

# need to make sure these all have the same batch dimension or the transformer will fail.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it would be clearer to do match_batch here in the next PR though, esp. since these are potentially already at batch size n_model from an AtomArrayStack, which could lead to errors that we would catch and report well in match_batch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants