New clash score script and improvements to the CIF patching script.#87
New clash score script and improvements to the CIF patching script.#87marcuscollins merged 4 commits intomainfrom
Conversation
…sure cif patching script doesn't write nan coordinates; reorg'ing dependencies in pyproject.toml
📝 WalkthroughWalkthroughUpdated dependency groups in Changes
Sequence Diagram(s)sequenceDiagram
participant User as User
participant Main as main()
participant Scanner as scan_grid_search_results()
participant Parallel as joblib.Parallel
participant Worker as process_one_experiment()
participant Phenix as phenix.clashscore
participant Parser as process_clashscore_json_output()
User->>Main: invoke with workspace & n_jobs
Main->>Scanner: locate experiments (grid_search_results)
Scanner-->>Main: experiments[]
Main->>Parallel: submit process_one_experiment jobs
loop per experiment
Parallel->>Worker: process_one_experiment(experiment)
Worker->>Worker: create nonan.cif (filter NaN)
Worker->>Phenix: run phenix.clashscore --json-filename
Phenix-->>Worker: clashscore.json + logs
Worker->>Parser: process_clashscore_json_output(clashscore.json)
Parser-->>Worker: DataFrame (metrics)
Worker-->>Parallel: return DataFrame
end
Parallel-->>Main: aggregated DataFrames
Main->>Main: write clashscore_metrics.csv
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 🧹 Recent nitpick comments
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@scripts/patch_input_cif_files.py`:
- Around line 109-114: load_any can return either AtomArray or AtomArrayStack so
the NaN-filtering assumes 3-D coords and 3-D indexing will fail for AtomArray;
wrap or convert the returned asym_unit to an AtomArrayStack (use the existing
ensure_atom_array_stack function or an isinstance check for
AtomArray/AtomArrayStack) before calling einx.rearrange on asym_unit.coord and
before slicing asym_unit[:, mask], then after filtering convert back to
AtomArray if the original was a single array so downstream code expectations
remain unchanged.
In `@scripts/run_and_process_phenix_clashscore.py`:
- Around line 44-48: Handle the case where scan_grid_search_results returns no
experiments by checking clashscore_metrics (the list produced by joblib.Parallel
calling process_one_experiment) before calling pd.concat: if the list is empty,
set clashscore_df to an empty DataFrame with the expected columns (or
pd.DataFrame()) instead of calling pd.concat([]); otherwise call
pd.concat(clashscore_metrics). After concatenation, check if clashscore_df is
empty and emit a log message indicating that all experiments produced empty
results (use the same logger used elsewhere in this script) so failures of
process_one_experiment/phenix are visible.
- Around line 64-65: The current check treats any non-zero retcode from the
`grep -v nan` call as a failure; change the logic in the block that examines
`retcode` (the return value from the grep subprocess) so only an actual grep
error (exit code 2) raises a RuntimeError referencing `logfile`, while exit code
1 (no matches / all lines are "nan") is handled as a valid case (e.g., treat as
empty result or continue processing). Update the conditional around `retcode` to
explicitly test for 2 instead of != 0 and add a clear comment near that check
explaining exit code 1 means “no match”.
- Line 61: The inline "grep -v nan" is too broad and can remove valid CIF lines;
replace it with a targeted NaN filter or, preferably, remove the grep entirely
and ensure the CIF is cleaned upstream by the existing patch_input_cif_files.py
preprocessing. Concretely: locate the command that builds the subprocess list
referencing experiment.refined_cif_path in run_and_process_phenix_clashscore.py
and either (A) remove the "grep -v nan" element and ensure the pipeline calls
patch_input_cif_files.py to produce a cleaned CIF before this script runs, or
(B) if you must keep an inline filter, change it to a restrictive regex such as
"grep -v -E '(^|[[:space:]])NaN([[:space:]]|$)'" so only standalone NaN float
tokens are excluded. Ensure test cases or a guard confirm the CIF is
preprocessed when choosing option A.
- Around line 60-63: The code leaks file descriptors by calling
file_with_no_nans.open("w") (and logfile.open("w")) and passing the file object
directly to subprocess.call without closing it; wrap those opens in a with-block
so the handle is closed after the subprocess returns (e.g. with
file_with_no_nans.open("w") as fh: subprocess.call([...], stdout=fh)), and build
the grep command as an explicit list (['grep','-v','nan',
experiment.refined_cif_path]) instead of using f-string.split() to avoid issues
with spaces and RUF010 warnings; update both occurrences that use
subprocess.call and file_*.open accordingly (refer to subprocess.call,
file_with_no_nans.open, logfile.open).
🧹 Nitpick comments (2)
scripts/patch_input_cif_files.py (1)
1-17:AtomArrayStackimport appears unused after refactor.
AtomArrayStackis still imported on line 13 but is no longer referenced directly in the code. If the guard from the previous comment is not added, this import is dead.scripts/run_and_process_phenix_clashscore.py (1)
13-34:parse_argsis well-structured.The docstring mentions "All eval scripts should use this same framework" — if that's the intent, consider moving this to a shared module (e.g., alongside
grid_search_eval_utils) to avoid copy-paste across future scripts.
k-chrispens
left a comment
There was a problem hiding this comment.
Looks good! Made a few comments but merge when ready.
|
|
||
|
|
||
| def main(args) -> None: | ||
| # TODO check that phenix is installed and commands are available, bail early if not. |
There was a problem hiding this comment.
This is kind of annoying to do in a way that ends up providing useful feedback to the user. from what I remember when I was calling phenix a lot in my early processing scripts - the way I ended up implementing then was just checking if phenix was on PATH and then erroring if not
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@scripts/run_and_process_phenix_clashscore.py`:
- Around line 80-82: The subprocess.call invocation building grep args
incorrectly concatenates strings (`"-viP" r"\bnan\b"`) so grep gets a single
malformed argument; in run_and_process_phenix_clashscore.py fix the argument
list passed to subprocess.call (where retcode is assigned) by separating the
flags and the pattern into distinct list elements (e.g., keep "-viP" and
r"\bnan\b" as two entries) so grep receives the option string and the regex
pattern as separate arguments; ensure you still pass
str(experiment.refined_cif_path) and stdout=fn unchanged.
- Around line 91-93: The subprocess.call invocation that sets retcode passes the
second argument with a leading space (" --json-filename"), causing
phenix.clashscore to receive a space-prefixed argv and fail to recognize the
flag; fix by making each CLI token a separate list element without leading
spaces so that subprocess.call([... "phenix.clashscore", str(file_with_no_nans),
"--json-filename", str(json_output)], stderr=fn) is used (locate the call to
subprocess.call that references retcode, file_with_no_nans, and json_output and
remove the leading space and split the option and its value into separate list
items).
🧹 Nitpick comments (2)
scripts/patch_input_cif_files.py (1)
57-78: Inconsistent naming between CLI args andmain()parameters.
--cif-patternmaps toargs.cif_patternbut the correspondingmain()parameter istarget_pattern; similarly--rcsb-pattern→args.rcsb_patternvsrcsb_regex. This works because they're passed positionally on line 153, but the mismatch makes it harder to grep usages and reason about the interface.Consider aligning the names — either rename the CLI flags or the function parameters.
scripts/run_and_process_phenix_clashscore.py (1)
37-48: Prefersubprocess.DEVNULLand chain the exception.Minor cleanup:
subprocess.DEVNULLavoids manually managing a file handle, and chaining the caught exception preserves the traceback.Proposed fix
- fp = open("/dev/null", "w") try: - subprocess.call("phenix.clashscore", stdout=fp) + subprocess.call("phenix.clashscore", stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL) except FileNotFoundError: - raise RuntimeError( + raise RuntimeError( "phenix.clashscore is not available, make sure phenix is installed " " and that you have activated it, e.g. `source phenix-dir/phenix_env.sh`" - ) - finally: - fp.close() + ) from None
…sure cif patching script doesn't write nan coordinates; reorg'ing dependencies in pyproject.toml
Summary by CodeRabbit
New Features
Dependencies
Improvements