Skip to content

New clash score script and improvements to the CIF patching script.#87

Merged
marcuscollins merged 4 commits intomainfrom
mdc-add-phenix-clashscore
Feb 11, 2026
Merged

New clash score script and improvements to the CIF patching script.#87
marcuscollins merged 4 commits intomainfrom
mdc-add-phenix-clashscore

Conversation

@marcuscollins
Copy link
Copy Markdown
Collaborator

@marcuscollins marcuscollins commented Feb 11, 2026

…sure cif patching script doesn't write nan coordinates; reorg'ing dependencies in pyproject.toml

Summary by CodeRabbit

  • New Features

    • Added a clashscore analysis pipeline with parallel processing and aggregated metrics export.
  • Dependencies

    • Added a new "protenix" dependency group.
    • Expanded the "analysis" group with ipython, marimo, matplotlib, seaborn and reordered items.
  • Improvements

    • More robust CIF handling with NaN cleanup, safer file backups, and refined CLI defaults.

…sure cif patching script doesn't write nan coordinates; reorg'ing dependencies in pyproject.toml
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 11, 2026

📝 Walkthrough

Walkthrough

Updated dependency groups in pyproject.toml (expanded analysis, added protenix), refactored CIF patching to use generic loader with NaN-coordinate filtering and backup copying, and added a new script to run and aggregate Phenix clashscore analyses in parallel.

Changes

Cohort / File(s) Summary
Dependency Configuration
pyproject.toml
Expanded analysis group (added ipython, marimo, matplotlib, seaborn and reordered entries) and added new protenix group (protenix>=0.6.3, einx, triton).
CIF Patching Logic
scripts/patch_input_cif_files.py
Replaced direct CIF parsing with load_any(...), added NaN-coordinate cleanup using einx/numpy, compute label_entity_id from polymer entity mapping, use shutil.copy for backups, and updated CLI arg names/defaults (rcsb_regex, refined.cif, added depth).
Phenix Clashscore Processing
scripts/run_and_process_phenix_clashscore.py
New script: CLI via parse_args(), main() scans workspace and runs experiments in parallel (joblib), process_one_experiment() creates NaN-filtered CIF and invokes phenix.clashscore, and process_clashscore_json_output() flattens JSON to a DataFrame and writes clashscore_metrics.csv.

Sequence Diagram(s)

sequenceDiagram
    participant User as User
    participant Main as main()
    participant Scanner as scan_grid_search_results()
    participant Parallel as joblib.Parallel
    participant Worker as process_one_experiment()
    participant Phenix as phenix.clashscore
    participant Parser as process_clashscore_json_output()

    User->>Main: invoke with workspace & n_jobs
    Main->>Scanner: locate experiments (grid_search_results)
    Scanner-->>Main: experiments[]
    Main->>Parallel: submit process_one_experiment jobs
    loop per experiment
        Parallel->>Worker: process_one_experiment(experiment)
        Worker->>Worker: create nonan.cif (filter NaN)
        Worker->>Phenix: run phenix.clashscore --json-filename
        Phenix-->>Worker: clashscore.json + logs
        Worker->>Parser: process_clashscore_json_output(clashscore.json)
        Parser-->>Worker: DataFrame (metrics)
        Worker-->>Parallel: return DataFrame
    end
    Parallel-->>Main: aggregated DataFrames
    Main->>Main: write clashscore_metrics.csv
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 I hop through CIFs where NaNs once lay,
I copy, clean, and send Phenix away,
Dependencies sprout — protenix and more,
Parallel clashscores tumble to the floor,
A joyful rabbit cheers this tidy day 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: introducing a new clash score script and improving the CIF patching script, which are the two primary focus areas of the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch mdc-add-phenix-clashscore

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
scripts/run_and_process_phenix_clashscore.py (1)

40-45: Chain the exception for a clean traceback.

The original FileNotFoundError is swallowed. Use raise ... from so callers see both the root cause and your friendlier message.

Proposed fix
     try:
         subprocess.call("phenix.clashscore", stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
-    except FileNotFoundError:
-        raise RuntimeError(
+    except FileNotFoundError as err:
+        raise RuntimeError(
             "phenix.clashscore is not available, make sure phenix is installed "
             " and that you have activated it, e.g. `source phenix-dir/phenix_env.sh`"
-        )
+        ) from err

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@scripts/patch_input_cif_files.py`:
- Around line 109-114: load_any can return either AtomArray or AtomArrayStack so
the NaN-filtering assumes 3-D coords and 3-D indexing will fail for AtomArray;
wrap or convert the returned asym_unit to an AtomArrayStack (use the existing
ensure_atom_array_stack function or an isinstance check for
AtomArray/AtomArrayStack) before calling einx.rearrange on asym_unit.coord and
before slicing asym_unit[:, mask], then after filtering convert back to
AtomArray if the original was a single array so downstream code expectations
remain unchanged.

In `@scripts/run_and_process_phenix_clashscore.py`:
- Around line 44-48: Handle the case where scan_grid_search_results returns no
experiments by checking clashscore_metrics (the list produced by joblib.Parallel
calling process_one_experiment) before calling pd.concat: if the list is empty,
set clashscore_df to an empty DataFrame with the expected columns (or
pd.DataFrame()) instead of calling pd.concat([]); otherwise call
pd.concat(clashscore_metrics). After concatenation, check if clashscore_df is
empty and emit a log message indicating that all experiments produced empty
results (use the same logger used elsewhere in this script) so failures of
process_one_experiment/phenix are visible.
- Around line 64-65: The current check treats any non-zero retcode from the
`grep -v nan` call as a failure; change the logic in the block that examines
`retcode` (the return value from the grep subprocess) so only an actual grep
error (exit code 2) raises a RuntimeError referencing `logfile`, while exit code
1 (no matches / all lines are "nan") is handled as a valid case (e.g., treat as
empty result or continue processing). Update the conditional around `retcode` to
explicitly test for 2 instead of != 0 and add a clear comment near that check
explaining exit code 1 means “no match”.
- Line 61: The inline "grep -v nan" is too broad and can remove valid CIF lines;
replace it with a targeted NaN filter or, preferably, remove the grep entirely
and ensure the CIF is cleaned upstream by the existing patch_input_cif_files.py
preprocessing. Concretely: locate the command that builds the subprocess list
referencing experiment.refined_cif_path in run_and_process_phenix_clashscore.py
and either (A) remove the "grep -v nan" element and ensure the pipeline calls
patch_input_cif_files.py to produce a cleaned CIF before this script runs, or
(B) if you must keep an inline filter, change it to a restrictive regex such as
"grep -v -E '(^|[[:space:]])NaN([[:space:]]|$)'" so only standalone NaN float
tokens are excluded. Ensure test cases or a guard confirm the CIF is
preprocessed when choosing option A.
- Around line 60-63: The code leaks file descriptors by calling
file_with_no_nans.open("w") (and logfile.open("w")) and passing the file object
directly to subprocess.call without closing it; wrap those opens in a with-block
so the handle is closed after the subprocess returns (e.g. with
file_with_no_nans.open("w") as fh: subprocess.call([...], stdout=fh)), and build
the grep command as an explicit list (['grep','-v','nan',
experiment.refined_cif_path]) instead of using f-string.split() to avoid issues
with spaces and RUF010 warnings; update both occurrences that use
subprocess.call and file_*.open accordingly (refer to subprocess.call,
file_with_no_nans.open, logfile.open).
🧹 Nitpick comments (2)
scripts/patch_input_cif_files.py (1)

1-17: AtomArrayStack import appears unused after refactor.

AtomArrayStack is still imported on line 13 but is no longer referenced directly in the code. If the guard from the previous comment is not added, this import is dead.

scripts/run_and_process_phenix_clashscore.py (1)

13-34: parse_args is well-structured.

The docstring mentions "All eval scripts should use this same framework" — if that's the intent, consider moving this to a shared module (e.g., alongside grid_search_eval_utils) to avoid copy-paste across future scripts.

Comment thread scripts/patch_input_cif_files.py
Comment thread scripts/run_and_process_phenix_clashscore.py
Comment thread scripts/run_and_process_phenix_clashscore.py Outdated
Comment thread scripts/run_and_process_phenix_clashscore.py Outdated
Comment thread scripts/run_and_process_phenix_clashscore.py Outdated
Copy link
Copy Markdown
Collaborator

@k-chrispens k-chrispens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Made a few comments but merge when ready.



def main(args) -> None:
# TODO check that phenix is installed and commands are available, bail early if not.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of annoying to do in a way that ends up providing useful feedback to the user. from what I remember when I was calling phenix a lot in my early processing scripts - the way I ended up implementing then was just checking if phenix was on PATH and then erroring if not

Comment thread scripts/run_and_process_phenix_clashscore.py
Comment thread scripts/run_and_process_phenix_clashscore.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@scripts/run_and_process_phenix_clashscore.py`:
- Around line 80-82: The subprocess.call invocation building grep args
incorrectly concatenates strings (`"-viP" r"\bnan\b"`) so grep gets a single
malformed argument; in run_and_process_phenix_clashscore.py fix the argument
list passed to subprocess.call (where retcode is assigned) by separating the
flags and the pattern into distinct list elements (e.g., keep "-viP" and
r"\bnan\b" as two entries) so grep receives the option string and the regex
pattern as separate arguments; ensure you still pass
str(experiment.refined_cif_path) and stdout=fn unchanged.
- Around line 91-93: The subprocess.call invocation that sets retcode passes the
second argument with a leading space (" --json-filename"), causing
phenix.clashscore to receive a space-prefixed argv and fail to recognize the
flag; fix by making each CLI token a separate list element without leading
spaces so that subprocess.call([... "phenix.clashscore", str(file_with_no_nans),
"--json-filename", str(json_output)], stderr=fn) is used (locate the call to
subprocess.call that references retcode, file_with_no_nans, and json_output and
remove the leading space and split the option and its value into separate list
items).
🧹 Nitpick comments (2)
scripts/patch_input_cif_files.py (1)

57-78: Inconsistent naming between CLI args and main() parameters.

--cif-pattern maps to args.cif_pattern but the corresponding main() parameter is target_pattern; similarly --rcsb-patternargs.rcsb_pattern vs rcsb_regex. This works because they're passed positionally on line 153, but the mismatch makes it harder to grep usages and reason about the interface.

Consider aligning the names — either rename the CLI flags or the function parameters.

scripts/run_and_process_phenix_clashscore.py (1)

37-48: Prefer subprocess.DEVNULL and chain the exception.

Minor cleanup: subprocess.DEVNULL avoids manually managing a file handle, and chaining the caught exception preserves the traceback.

Proposed fix
-    fp = open("/dev/null", "w")
     try:
-        subprocess.call("phenix.clashscore", stdout=fp)
+        subprocess.call("phenix.clashscore", stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
     except FileNotFoundError:
-        raise RuntimeError(
+        raise RuntimeError(
             "phenix.clashscore is not available, make sure phenix is installed "
             " and that you have activated it, e.g. `source phenix-dir/phenix_env.sh`"
-        )
-    finally:
-        fp.close()
+        ) from None

Comment thread scripts/run_and_process_phenix_clashscore.py
Comment thread scripts/run_and_process_phenix_clashscore.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants