New clash score script and improvements to the CIF patching script. by marcuscollins · Pull Request #87 · diff-use/sampleworks

marcuscollins · 2026-02-11T00:19:14Z

…sure cif patching script doesn't write nan coordinates; reorg'ing dependencies in pyproject.toml

Summary by CodeRabbit

New Features
- Added a clashscore analysis pipeline with parallel processing and aggregated metrics export.
Dependencies
- Added a new "protenix" dependency group.
- Expanded the "analysis" group with ipython, marimo, matplotlib, seaborn and reordered items.
Improvements
- More robust CIF handling with NaN cleanup, safer file backups, and refined CLI defaults.

…sure cif patching script doesn't write nan coordinates; reorg'ing dependencies in pyproject.toml

coderabbitai · 2026-02-11T00:19:32Z

📝 Walkthrough

Walkthrough

Updated dependency groups in pyproject.toml (expanded analysis, added protenix), refactored CIF patching to use generic loader with NaN-coordinate filtering and backup copying, and added a new script to run and aggregate Phenix clashscore analyses in parallel.

Changes

Cohort / File(s)	Summary
Dependency Configuration `pyproject.toml`	Expanded `analysis` group (added `ipython`, `marimo`, `matplotlib`, `seaborn` and reordered entries) and added new `protenix` group (`protenix>=0.6.3`, `einx`, `triton`).
CIF Patching Logic `scripts/patch_input_cif_files.py`	Replaced direct CIF parsing with `load_any(...)`, added NaN-coordinate cleanup using `einx`/numpy, compute `label_entity_id` from polymer entity mapping, use `shutil.copy` for backups, and updated CLI arg names/defaults (`rcsb_regex`, `refined.cif`, added `depth`).
Phenix Clashscore Processing `scripts/run_and_process_phenix_clashscore.py`	New script: CLI via `parse_args()`, `main()` scans workspace and runs experiments in parallel (`joblib`), `process_one_experiment()` creates NaN-filtered CIF and invokes `phenix.clashscore`, and `process_clashscore_json_output()` flattens JSON to a DataFrame and writes `clashscore_metrics.csv`.

Sequence Diagram(s)

sequenceDiagram
    participant User as User
    participant Main as main()
    participant Scanner as scan_grid_search_results()
    participant Parallel as joblib.Parallel
    participant Worker as process_one_experiment()
    participant Phenix as phenix.clashscore
    participant Parser as process_clashscore_json_output()

    User->>Main: invoke with workspace & n_jobs
    Main->>Scanner: locate experiments (grid_search_results)
    Scanner-->>Main: experiments[]
    Main->>Parallel: submit process_one_experiment jobs
    loop per experiment
        Parallel->>Worker: process_one_experiment(experiment)
        Worker->>Worker: create nonan.cif (filter NaN)
        Worker->>Phenix: run phenix.clashscore --json-filename
        Phenix-->>Worker: clashscore.json + logs
        Worker->>Parser: process_clashscore_json_output(clashscore.json)
        Parser-->>Worker: DataFrame (metrics)
        Worker-->>Parallel: return DataFrame
    end
    Parallel-->>Main: aggregated DataFrames
    Main->>Main: write clashscore_metrics.csv

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

New script that patches the CIF files we write #72: Related changes to scripts/patch_input_cif_files.py that modify CIF parsing, NaN handling, and entity_id adjustments.

Poem

🐰 I hop through CIFs where NaNs once lay,
I copy, clean, and send Phenix away,
Dependencies sprout — protenix and more,
Parallel clashscores tumble to the floor,
A joyful rabbit cheers this tidy day 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main changes: introducing a new clash score script and improving the CIF patching script, which are the two primary focus areas of the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch mdc-add-phenix-clashscore

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments

scripts/run_and_process_phenix_clashscore.py (1)

40-45: Chain the exception for a clean traceback.

The original FileNotFoundError is swallowed. Use raise ... from so callers see both the root cause and your friendlier message.

Proposed fix

     try:
         subprocess.call("phenix.clashscore", stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
-    except FileNotFoundError:
-        raise RuntimeError(
+    except FileNotFoundError as err:
+        raise RuntimeError(
             "phenix.clashscore is not available, make sure phenix is installed "
             " and that you have activated it, e.g. `source phenix-dir/phenix_env.sh`"
-        )
+        ) from err

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In `@scripts/patch_input_cif_files.py`:
- Around line 109-114: load_any can return either AtomArray or AtomArrayStack so
the NaN-filtering assumes 3-D coords and 3-D indexing will fail for AtomArray;
wrap or convert the returned asym_unit to an AtomArrayStack (use the existing
ensure_atom_array_stack function or an isinstance check for
AtomArray/AtomArrayStack) before calling einx.rearrange on asym_unit.coord and
before slicing asym_unit[:, mask], then after filtering convert back to
AtomArray if the original was a single array so downstream code expectations
remain unchanged.

In `@scripts/run_and_process_phenix_clashscore.py`:
- Around line 44-48: Handle the case where scan_grid_search_results returns no
experiments by checking clashscore_metrics (the list produced by joblib.Parallel
calling process_one_experiment) before calling pd.concat: if the list is empty,
set clashscore_df to an empty DataFrame with the expected columns (or
pd.DataFrame()) instead of calling pd.concat([]); otherwise call
pd.concat(clashscore_metrics). After concatenation, check if clashscore_df is
empty and emit a log message indicating that all experiments produced empty
results (use the same logger used elsewhere in this script) so failures of
process_one_experiment/phenix are visible.
- Around line 64-65: The current check treats any non-zero retcode from the
`grep -v nan` call as a failure; change the logic in the block that examines
`retcode` (the return value from the grep subprocess) so only an actual grep
error (exit code 2) raises a RuntimeError referencing `logfile`, while exit code
1 (no matches / all lines are "nan") is handled as a valid case (e.g., treat as
empty result or continue processing). Update the conditional around `retcode` to
explicitly test for 2 instead of != 0 and add a clear comment near that check
explaining exit code 1 means “no match”.
- Line 61: The inline "grep -v nan" is too broad and can remove valid CIF lines;
replace it with a targeted NaN filter or, preferably, remove the grep entirely
and ensure the CIF is cleaned upstream by the existing patch_input_cif_files.py
preprocessing. Concretely: locate the command that builds the subprocess list
referencing experiment.refined_cif_path in run_and_process_phenix_clashscore.py
and either (A) remove the "grep -v nan" element and ensure the pipeline calls
patch_input_cif_files.py to produce a cleaned CIF before this script runs, or
(B) if you must keep an inline filter, change it to a restrictive regex such as
"grep -v -E '(^|[[:space:]])NaN([[:space:]]|$)'" so only standalone NaN float
tokens are excluded. Ensure test cases or a guard confirm the CIF is
preprocessed when choosing option A.
- Around line 60-63: The code leaks file descriptors by calling
file_with_no_nans.open("w") (and logfile.open("w")) and passing the file object
directly to subprocess.call without closing it; wrap those opens in a with-block
so the handle is closed after the subprocess returns (e.g. with
file_with_no_nans.open("w") as fh: subprocess.call([...], stdout=fh)), and build
the grep command as an explicit list (['grep','-v','nan',
experiment.refined_cif_path]) instead of using f-string.split() to avoid issues
with spaces and RUF010 warnings; update both occurrences that use
subprocess.call and file_*.open accordingly (refer to subprocess.call,
file_with_no_nans.open, logfile.open).

🧹 Nitpick comments (2)

scripts/patch_input_cif_files.py (1)

1-17: AtomArrayStack import appears unused after refactor.

AtomArrayStack is still imported on line 13 but is no longer referenced directly in the code. If the guard from the previous comment is not added, this import is dead.

scripts/run_and_process_phenix_clashscore.py (1)

13-34: parse_args is well-structured.

The docstring mentions "All eval scripts should use this same framework" — if that's the intent, consider moving this to a shared module (e.g., alongside grid_search_eval_utils) to avoid copy-paste across future scripts.

k-chrispens

Looks good! Made a few comments but merge when ready.

k-chrispens · 2026-02-11T04:07:04Z

+
+
+def main(args) -> None:
+    # TODO check that phenix is installed and commands are available, bail early if not.


This is kind of annoying to do in a way that ends up providing useful feedback to the user. from what I remember when I was calling phenix a lot in my early processing scripts - the way I ended up implementing then was just checking if phenix was on PATH and then erroring if not

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@scripts/run_and_process_phenix_clashscore.py`:
- Around line 80-82: The subprocess.call invocation building grep args
incorrectly concatenates strings (`"-viP" r"\bnan\b"`) so grep gets a single
malformed argument; in run_and_process_phenix_clashscore.py fix the argument
list passed to subprocess.call (where retcode is assigned) by separating the
flags and the pattern into distinct list elements (e.g., keep "-viP" and
r"\bnan\b" as two entries) so grep receives the option string and the regex
pattern as separate arguments; ensure you still pass
str(experiment.refined_cif_path) and stdout=fn unchanged.
- Around line 91-93: The subprocess.call invocation that sets retcode passes the
second argument with a leading space (" --json-filename"), causing
phenix.clashscore to receive a space-prefixed argv and fail to recognize the
flag; fix by making each CLI token a separate list element without leading
spaces so that subprocess.call([... "phenix.clashscore", str(file_with_no_nans),
"--json-filename", str(json_output)], stderr=fn) is used (locate the call to
subprocess.call that references retcode, file_with_no_nans, and json_output and
remove the leading space and split the option and its value into separate list
items).

🧹 Nitpick comments (2)

scripts/patch_input_cif_files.py (1)

57-78: Inconsistent naming between CLI args and main() parameters.

--cif-pattern maps to args.cif_pattern but the corresponding main() parameter is target_pattern; similarly --rcsb-pattern → args.rcsb_pattern vs rcsb_regex. This works because they're passed positionally on line 153, but the mismatch makes it harder to grep usages and reason about the interface.

Consider aligning the names — either rename the CLI flags or the function parameters.
scripts/run_and_process_phenix_clashscore.py (1)
37-48: Prefer subprocess.DEVNULL and chain the exception.

Minor cleanup: subprocess.DEVNULL avoids manually managing a file handle, and chaining the caught exception preserves the traceback.
Proposed fix
-    fp = open("/dev/null", "w")
     try:
-        subprocess.call("phenix.clashscore", stdout=fp)
+        subprocess.call("phenix.clashscore", stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
     except FileNotFoundError:
-        raise RuntimeError(
+        raise RuntimeError(
             "phenix.clashscore is not available, make sure phenix is installed "
             " and that you have activated it, e.g. `source phenix-dir/phenix_env.sh`"
-        )
-    finally:
-        fp.close()
+        ) from None

adding script to compute clashscores using phenix.clashscore; making …

0ddb18d

…sure cif patching script doesn't write nan coordinates; reorg'ing dependencies in pyproject.toml

marcuscollins requested a review from k-chrispens February 11, 2026 00:19

coderabbitai Bot reviewed Feb 11, 2026

View reviewed changes

Addressing coderabbit f/b on PR 87

545e72c

k-chrispens approved these changes Feb 11, 2026

View reviewed changes

Addressing k.chrispen's f/b on PR 87

8732a23

coderabbitai Bot reviewed Feb 11, 2026

View reviewed changes

Comment thread scripts/run_and_process_phenix_clashscore.py

Comment thread scripts/run_and_process_phenix_clashscore.py

addressing more coderabbit f/b

2d90cc7

marcuscollins merged commit 6ee9f8b into main Feb 11, 2026
1 check passed

coderabbitai Bot mentioned this pull request Feb 12, 2026

Adding a script to compute bond length and angle outliers using peppr #89

Merged

k-chrispens deleted the mdc-add-phenix-clashscore branch February 12, 2026 21:02



		def main(args) -> None:
		# TODO check that phenix is installed and commands are available, bail early if not.

Conversation

marcuscollins commented Feb 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

k-chrispens left a comment

Choose a reason for hiding this comment

Uh oh!

k-chrispens Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

marcuscollins commented Feb 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 11, 2026 •

edited

Loading