fix issue 17#115
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplace scalar occupancy with per-altloc occupancy mappings across dataclasses, utilities, reference-loading, scripts, and tests; add deterministic occupancy formatting/validation helpers; and remove the RSCC grid-search analysis module. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR updates occupancy filename-string generation to support arbitrary altloc sets and refactors the RSCC grid search analysis notebook to use shared eval utilities/config loading, aligning the analysis workflow with the newer grid search dataset conventions (and dropping legacy 6B8X occupancy-string compatibility).
Changes:
- Generalize
occupancy_to_str()to accept arbitrary altloc occupancies via keyword args and omit zero-occupancy altlocs. - Update
ProteinConfigpath generation to use the new occupancy string format (removing legacy 6B8X-specific fallbacks). - Refactor
rscc_grid_search_analysis.pyto rely on shared constants/utilities (scan_grid_search_results,rscc,compute_density_from_atomarray, CSV-driven protein configs) and export results underGRID_SEARCH_DIR.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/sampleworks/eval/rscc_grid_search_analysis.py |
Refactors RSCC analysis to use shared eval utilities/dataclasses/constants and CSV-driven configs; updates map loading/density computation and output pathing. |
src/sampleworks/eval/occupancy_utils.py |
Replaces two-altloc occupancy string helper with a generalized altloc-kwargs implementation. |
src/sampleworks/eval/eval_dataclasses.py |
Updates occupancy string generation in ProteinConfig to match the new generalized format and removes legacy 6B8X fallbacks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/sampleworks/eval/rscc_grid_search_analysis.py (1)
219-239:⚠️ Potential issue | 🟠 MajorGuard summary aggregation when no experiments are processed.
If
resultsis empty,pd.DataFrame([r.__dict__ for r in results])can be columnless, and Line 234 then raises on missingprotein/scaler/rscc.🛠️ Suggested change
def _(pd, results): df = pd.DataFrame([r.__dict__ for r in results]) @@ print("Results Summary:") print(df.drop(drop_cols, axis=1, errors="ignore").head(20).to_string()) - print("\n\nSummary Statistics by Protein and Scaler:") - summary = ( - df.groupby(["protein", "scaler"])["rscc"] - .agg(["count", "mean", "std", "min", "max"]) - .round(4) - ) - print(summary) + _required = {"protein", "scaler", "rscc"} + if df.empty or not _required.issubset(df.columns): + print("\n\nSummary Statistics by Protein and Scaler: no rows to summarize") + else: + print("\n\nSummary Statistics by Protein and Scaler:") + summary = ( + df.groupby(["protein", "scaler"])["rscc"] + .agg(["count", "mean", "std", "min", "max"]) + .round(4) + ) + print(summary) return (df,)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/sampleworks/eval/rscc_grid_search_analysis.py` around lines 219 - 239, The aggregation fails when results is empty; modify the logic around df and summary to guard that case: after building df from results, check if df is empty (or if results is falsy) and either initialize df with the expected columns ("protein", "scaler", "rscc", plus any dropped cols like "exp_dir", "refined_cif_path", "base_map_path", "error", "protein_dir_name") or skip the groupby and set summary to an empty DataFrame/appropriate placeholder; ensure the prints still run without causing KeyError (use df.empty or "protein" in df.columns to decide), and return the same tuple shape (return (df,)) so callers get a consistent return type.
🧹 Nitpick comments (2)
src/sampleworks/eval/occupancy_utils.py (1)
72-77: Canonicalize altloc ordering before assembling the occupancy string.Line 72 preserves caller insertion order, so equivalent occupancy sets can produce different strings (
A,BvsB,A) and cause avoidable path/cache misses.♻️ Suggested change
- for label, occ in altloc_occupancies.items(): + for label in sorted(altloc_occupancies): + occ = altloc_occupancies[label] if occ < 0: raise ValueError(f"Negative occupancy for altloc {label}: {occ}") occ = round(occ, 2) if abs(occ) > 1e-6: parts.append(f"{occ}occ{label}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/sampleworks/eval/occupancy_utils.py` around lines 72 - 77, The loop over altloc_occupancies is using caller insertion order which yields non-deterministic occupancy strings; iterate over a canonical sorted order instead (e.g., iterate over sorted(altloc_occupancies.items()) or sorted(altloc_occupancies.keys())) before applying the existing checks and formatting so parts are appended deterministically; keep the ValueError for negative occupancies, the rounding to two decimals, and the abs(occ) > 1e-6 check but perform them in the sorted iteration that references altloc_occupancies and parts.src/sampleworks/eval/rscc_grid_search_analysis.py (1)
681-686: Normalize occupancy precision before merging on float keys.Line 684 merges on raw
occ_afloats; tiny precision differences can silently drop correlation columns.♻️ Suggested change
if not df_base_vs_pure.empty: + _export_df = _export_df.copy() + _corr_df = df_base_vs_pure.copy() + _export_df["occ_a"] = _export_df["occ_a"].round(2) + _corr_df["occ_a"] = _corr_df["occ_a"].round(2) _export_df = pd.merge( _export_df, - df_base_vs_pure[["protein", "occ_a", "base_vs_1occA", "base_vs_1occB"]], + _corr_df[["protein", "occ_a", "base_vs_1occA", "base_vs_1occB"]], on=["protein", "occ_a"], how="left", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/sampleworks/eval/rscc_grid_search_analysis.py` around lines 681 - 686, The merge uses raw float key occ_a which can misalign due to precision; before calling pd.merge on ["protein","occ_a"], normalize/quantize occ_a in both _export_df and df_base_vs_pure (e.g., round to a consistent number of decimals or format/astype string) so the keys match reliably, then perform the pd.merge to bring in base_vs_1occA and base_vs_1occB. Ensure you update the same column name occ_a in both dataframes (or add a normalized key like occ_a_key) so pd.merge(_export_df, df_base_vs_pure[...], on=["protein","occ_a"]) succeeds without silent drops.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/sampleworks/eval/eval_dataclasses.py`:
- Around line 55-56: The calls to occupancy_to_str (e.g., where occ_str =
occupancy_to_str(A=occupancy_a, B=1.0 - occupancy_a) that feed map_path =
self.base_map_dir / self.map_pattern.format(occ_str=occ_str)) can raise
ValueError and must be handled locally to preserve the declared Path | None
return; wrap the occupancy_to_str call in a try/except that catches ValueError
and returns/assigns None for map_path (or otherwise yields the Path | None
fallback) instead of allowing the exception to propagate, updating both sites
(the occ_str usage around map_path and the similar usage at lines ~78-79) to
follow this pattern.
---
Outside diff comments:
In `@src/sampleworks/eval/rscc_grid_search_analysis.py`:
- Around line 219-239: The aggregation fails when results is empty; modify the
logic around df and summary to guard that case: after building df from results,
check if df is empty (or if results is falsy) and either initialize df with the
expected columns ("protein", "scaler", "rscc", plus any dropped cols like
"exp_dir", "refined_cif_path", "base_map_path", "error", "protein_dir_name") or
skip the groupby and set summary to an empty DataFrame/appropriate placeholder;
ensure the prints still run without causing KeyError (use df.empty or "protein"
in df.columns to decide), and return the same tuple shape (return (df,)) so
callers get a consistent return type.
---
Nitpick comments:
In `@src/sampleworks/eval/occupancy_utils.py`:
- Around line 72-77: The loop over altloc_occupancies is using caller insertion
order which yields non-deterministic occupancy strings; iterate over a canonical
sorted order instead (e.g., iterate over sorted(altloc_occupancies.items()) or
sorted(altloc_occupancies.keys())) before applying the existing checks and
formatting so parts are appended deterministically; keep the ValueError for
negative occupancies, the rounding to two decimals, and the abs(occ) > 1e-6
check but perform them in the sorted iteration that references
altloc_occupancies and parts.
In `@src/sampleworks/eval/rscc_grid_search_analysis.py`:
- Around line 681-686: The merge uses raw float key occ_a which can misalign due
to precision; before calling pd.merge on ["protein","occ_a"], normalize/quantize
occ_a in both _export_df and df_base_vs_pure (e.g., round to a consistent number
of decimals or format/astype string) so the keys match reliably, then perform
the pd.merge to bring in base_vs_1occA and base_vs_1occB. Ensure you update the
same column name occ_a in both dataframes (or add a normalized key like
occ_a_key) so pd.merge(_export_df, df_base_vs_pure[...], on=["protein","occ_a"])
succeeds without silent drops.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/sampleworks/eval/eval_dataclasses.pysrc/sampleworks/eval/occupancy_utils.pysrc/sampleworks/eval/rscc_grid_search_analysis.py
marcuscollins
left a comment
There was a problem hiding this comment.
I see at least one place where we don't handle the possibility of more than two altlocs. I think we probably should deal with that, since you partially handle it in this PR.
|
Because there are two altloc dependencies in the LDDT eval script, I will postpone updating and merging this until that is merged |
e0915dd to
bffa7e6
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (2)
src/sampleworks/eval/structure_utils.py (1)
468-469:⚠️ Potential issue | 🟠 MajorDefault occupancy list is still hardcoded to two altlocs (A/B).
This fallback prevents true multi-altloc behavior and will miss valid reference combinations for proteins with 3+ altlocs.
🛠️ Proposed fix (fail fast instead of silently assuming A/B)
- if occ_list is None: - occ_list = [{"A": 0.0, "B": 1.0}, {"A": 1.0, "B": 0.0}] + if occ_list is None: + raise ValueError( + "occ_list must be provided explicitly for multi-altloc proteins; " + "no safe default occupancy set exists." + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/sampleworks/eval/structure_utils.py` around lines 468 - 469, The code currently hardcodes occ_list to two altlocs [{"A":0.0,"B":1.0},{"A":1.0,"B":0.0}] when occ_list is None, which prevents correct handling of proteins with 3+ altlocs; instead remove the A/B fallback and fail-fast: if occ_list is None raise a ValueError (or TypeError) explaining that occ_list must be provided (or compute it dynamically from the structure's altloc identifiers if that logic exists), and update callers to pass a generated occ_list; specifically change the occ_list default handling (the occ_list assignment block) to raise an error or call a utility that enumerates occupancy combinations from the structure's altloc set rather than assuming "A"/"B".scripts/eval/lddt_evaluation_script.py (1)
223-225:⚠️ Potential issue | 🟠 MajorPreloading is still hardcoded to two altlocs (A/B).
This loop cannot represent experiments with 3+ altloc occupancies, so generalized occupancy experiments will not be preloaded correctly.
🛠️ Suggested direction
- for occ in args.occupancies: - altloc_occ = {"A": occ, "B": 1.0 - occ} - occ_key = tuple(sorted(altloc_occ.items())) + protein_occ_maps = { + exp.occ_key: dict(exp.occ_key) + for exp in all_experiments + if exp.protein.lower() == protein_key.lower() + } + for occ_key, altloc_occ in protein_occ_maps.items():
🧹 Nitpick comments (1)
scripts/eval/lddt_evaluation_script.py (1)
226-229: Load each reference structure once per occupancy, not once per selection.
get_reference_atomarraystack()is inside the selection loop, causing repeated disk loads of the same file.♻️ Proposed refactor
- for sel in protein_config.selection: - ref_path, reference_proteins = get_reference_atomarraystack( - protein_config, altloc_occ - ) - if reference_proteins is None: - logger.warning( - f"Could not find ref structure for {protein_key} " - f"and occupancies {altloc_occ}" - ) - continue + ref_path, reference_proteins = get_reference_atomarraystack( + protein_config, altloc_occ + ) + if reference_proteins is None: + logger.warning( + f"Could not find ref structure for {protein_key} " + f"and occupancies {altloc_occ}" + ) + continue + for sel in protein_config.selection: try:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/lddt_evaluation_script.py` around lines 226 - 229, get_reference_atomarraystack() is being called inside the loop over protein_config.selection causing repeated disk loads; move the call out of the selection loop so each reference structure is loaded once per occupancy (altloc_occ) and reused for all selections—e.g., compute ref_path and reference_proteins once before iterating over protein_config.selection (or cache them keyed by altloc_occ) and then use those variables inside the loop that processes each sel.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/eval/lddt_evaluation_script.py`:
- Around line 281-283: The cache lookup uses _exp.protein (original casing)
instead of the normalized protein variable, causing misses; update the
atom_array_key construction to use the normalized protein (the variable named
protein) together with _exp.occ_key when checking/reference_atom_arrays and
anywhere atom_array_key is created so preloaded references match the normalized
keys.
In `@src/sampleworks/eval/grid_search_eval_utils.py`:
- Around line 102-103: The current validation only checks for an empty
altloc_occupancies but allows mappings with all-zero occupancies; update the
condition around altloc_occupancies (the variable used in the if) to also reject
mappings where every value is zero by adding a check such as not any(v for v in
altloc_occupancies.values()) or all(v == 0.0 for v in
altloc_occupancies.values()); keep the existing checks for model ==
StructurePredictor.BOLTZ_2 and method is None unchanged (i.e., combine the new
all-zero check with the existing or clauses so all-zero mappings are treated as
invalid).
In `@src/sampleworks/eval/occupancy_utils.py`:
- Around line 7-9: The docstring and implementation disagree: the code currently
extracts the protein name with dir_name.split("_")[0] but the docstring claims
it should be "the first underscore-delimited token that does not match the
occupancy pattern." Update the implementation in occupancy_utils.py to actually
find the first token that does not match the occupancy regex (use the same
occupancy matching logic used elsewhere in this module) instead of always taking
the zeroth token, or alternatively change the docstring to state that the
protein name is simply the first underscore-delimited token; reference the
existing dir_name.split("_")[0] usage and the occupancy matching logic in this
module when making the change.
- Line 79: The lambda parameter name `l` in the sorted call over
`altloc_occupancies` triggers Ruff E741; update the key function in the loop
(the sorted(..., key=lambda l: str(l).upper())) to use a non-ambiguous parameter
name such as `label` or `item` (e.g., key=lambda label: str(label).upper()),
keeping the same logic and the loop header `for label in
sorted(altloc_occupancies, ...)` intact.
---
Duplicate comments:
In `@src/sampleworks/eval/structure_utils.py`:
- Around line 468-469: The code currently hardcodes occ_list to two altlocs
[{"A":0.0,"B":1.0},{"A":1.0,"B":0.0}] when occ_list is None, which prevents
correct handling of proteins with 3+ altlocs; instead remove the A/B fallback
and fail-fast: if occ_list is None raise a ValueError (or TypeError) explaining
that occ_list must be provided (or compute it dynamically from the structure's
altloc identifiers if that logic exists), and update callers to pass a generated
occ_list; specifically change the occ_list default handling (the occ_list
assignment block) to raise an error or call a utility that enumerates occupancy
combinations from the structure's altloc set rather than assuming "A"/"B".
---
Nitpick comments:
In `@scripts/eval/lddt_evaluation_script.py`:
- Around line 226-229: get_reference_atomarraystack() is being called inside the
loop over protein_config.selection causing repeated disk loads; move the call
out of the selection loop so each reference structure is loaded once per
occupancy (altloc_occ) and reused for all selections—e.g., compute ref_path and
reference_proteins once before iterating over protein_config.selection (or cache
them keyed by altloc_occ) and then use those variables inside the loop that
processes each sel.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: db0c9f95-2c33-4a5b-bc5f-693de9f973a1
📒 Files selected for processing (7)
scripts/eval/lddt_evaluation_script.pysrc/sampleworks/eval/eval_dataclasses.pysrc/sampleworks/eval/grid_search_eval_utils.pysrc/sampleworks/eval/occupancy_utils.pysrc/sampleworks/eval/rscc_grid_search_analysis.pysrc/sampleworks/eval/structure_utils.pytests/eval/test_structure_utils.py
💤 Files with no reviewable changes (1)
- src/sampleworks/eval/rscc_grid_search_analysis.py
| logger.info("Pre-loading reference structures for each protein for coordinate extraction") | ||
| reference_atom_arrays = {} | ||
| for protein_key, protein_config in protein_configs.items(): | ||
| for occ, sel in itertools.product(args.occupancies, protein_config.selection): |
There was a problem hiding this comment.
Is there a reason not to use itertools.product? I used it to facilitate possible future parallelization with joblib.
| occ_str = occupancy_to_str(occupancy_a, use_6b8x_format=self.protein == "6b8x") | ||
| try: | ||
| occ_str = occupancy_to_str(**altloc_occupancies) | ||
| except ValueError as e: |
There was a problem hiding this comment.
when does this happen? Can we prevent it with input validation?
There was a problem hiding this comment.
this happens when the altlocs have negative or >1 occupancy
There was a problem hiding this comment.
This is where input validation can really help. Maybe make an issue to add this feature.
| @@ -242,7 +242,7 @@ class TestGetReferenceAtomArrayStack: | |||
|
|
|||
| def test_returns_none_when_not_found(self, mock_protein_config): | |||
There was a problem hiding this comment.
(Not specific to this line): can you add tests with more than two altlocs?
23017f4 to
8bf46d7
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
src/sampleworks/eval/occupancy_utils.py (1)
7-9:⚠️ Potential issue | 🟡 MinorDocstring and parser behavior still diverge.
Line 34 always takes the first underscore token, but Lines 7-9 document selecting the first token that is not an occupancy token.
🛠️ Proposed fix
- protein = dir_name.split("_")[0].lower() + tokens = dir_name.split("_") + occ_token = re.compile(r"^\d+\.?\d*occ[A-Za-z]$", re.IGNORECASE) + protein = next((tok for tok in tokens if not occ_token.match(tok)), tokens[0]).lower()Also applies to: 34-34
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/sampleworks/eval/occupancy_utils.py` around lines 7 - 9, The parser that extracts protein name and occupancy tokens from dir_name currently always selects the first underscore-delimited token as the protein name; update the logic so it instead scans dir_name.split("_") and selects the first token that does NOT match the occupancy pattern (the "{value}occ{label}" pattern), using the same regex or helper (e.g., is_occupancy_token) you use elsewhere to detect occupancy tokens, and keep the rest of tokens for occupancy parsing; also update the function docstring to match this behavior so it states "first underscore-delimited token that does not match the occupancy pattern."
🧹 Nitpick comments (1)
scripts/eval/lddt_evaluation_script.py (1)
225-228: Load each occupancy-specific reference once, then fan out by selection.
get_reference_atomarraystack()is currently inside the selection loop, so the same reference file is read repeatedly for each selection.♻️ Proposed refactor
for occ in args.occupancies: altloc_occ = {"A": occ, "B": 1.0 - occ} occ_key = tuple(sorted(altloc_occ.items())) - for sel in protein_config.selection: - ref_path, reference_proteins = get_reference_atomarraystack( - protein_config, altloc_occ - ) - if reference_proteins is None: - logger.warning( - f"Could not find ref structure for {protein_key} " - f"and occupancies {altloc_occ}" - ) - continue + ref_path, reference_proteins = get_reference_atomarraystack( + protein_config, altloc_occ + ) + if reference_proteins is None: + logger.warning( + f"Could not find ref structure for {protein_key} " + f"and occupancies {altloc_occ}" + ) + continue + for sel in protein_config.selection: try: logger.info( f"Loaded ref structure for {protein_key} " f"and occupancies {altloc_occ}: {ref_path}" )Also applies to: 246-249
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/lddt_evaluation_script.py` around lines 225 - 228, The reference loader get_reference_atomarraystack(...) is being called inside the per-selection loop (for sel in protein_config.selection), causing the same file to be read multiple times; move the call out of that loop so you call get_reference_atomarraystack(protein_config, altloc_occ) once per occupancy and reuse the returned ref_path and reference_proteins for each sel, and apply the same change to the other duplicated block (the similar call at lines 246-249) so selection iterations only fan out from the already-loaded reference instead of reloading it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/eval/lddt_evaluation_script.py`:
- Around line 223-224: The cache key construction keeps zero-valued
alternate-location entries which can differ from experiment keys parsed from
filenames; update the altloc_occ -> occ_key creation so you drop any entries
with zero (or negligibly small) occupancy before building occ_key (e.g., build
altloc_occ with only occ > 0 then set occ_key =
tuple(sorted(altloc_occ.items()))); apply the same canonicalization in the other
key-creation spot that mirrors this logic so the lookup that uses occ_key will
match preloaded data.
In `@src/sampleworks/eval/occupancy_utils.py`:
- Around line 79-97: The sum check on altloc_occupancies happens on the raw
input but later you round occupancies for output (in the loop that builds
parts), which can produce a filename whose rounded occupancies sum to >1; fix by
canonicalizing occupancies before validation and output: create a canonical_occ
dict (e.g., canonical_occ = {label: round(float(val), 2) for label, val in
altloc_occupancies.items()}) and use canonical_occ for the sum check, the
per-value 0..1 range check, and when building parts (replace uses of occ =
round(...) with occ = canonical_occ[label]), keeping the small tolerance check
(abs(occ) > 1e-6) as needed.
---
Duplicate comments:
In `@src/sampleworks/eval/occupancy_utils.py`:
- Around line 7-9: The parser that extracts protein name and occupancy tokens
from dir_name currently always selects the first underscore-delimited token as
the protein name; update the logic so it instead scans dir_name.split("_") and
selects the first token that does NOT match the occupancy pattern (the
"{value}occ{label}" pattern), using the same regex or helper (e.g.,
is_occupancy_token) you use elsewhere to detect occupancy tokens, and keep the
rest of tokens for occupancy parsing; also update the function docstring to
match this behavior so it states "first underscore-delimited token that does not
match the occupancy pattern."
---
Nitpick comments:
In `@scripts/eval/lddt_evaluation_script.py`:
- Around line 225-228: The reference loader get_reference_atomarraystack(...) is
being called inside the per-selection loop (for sel in
protein_config.selection), causing the same file to be read multiple times; move
the call out of that loop so you call
get_reference_atomarraystack(protein_config, altloc_occ) once per occupancy and
reuse the returned ref_path and reference_proteins for each sel, and apply the
same change to the other duplicated block (the similar call at lines 246-249) so
selection iterations only fan out from the already-loaded reference instead of
reloading it.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: aa4ef3b6-db08-4207-9969-a4060d9c77ab
📒 Files selected for processing (7)
scripts/eval/lddt_evaluation_script.pysrc/sampleworks/eval/eval_dataclasses.pysrc/sampleworks/eval/grid_search_eval_utils.pysrc/sampleworks/eval/occupancy_utils.pysrc/sampleworks/eval/rscc_grid_search_analysis.pysrc/sampleworks/eval/structure_utils.pytests/eval/test_structure_utils.py
💤 Files with no reviewable changes (1)
- src/sampleworks/eval/rscc_grid_search_analysis.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/sampleworks/eval/grid_search_eval_utils.py
8bf46d7 to
314483e
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/sampleworks/eval/structure_utils.py (1)
467-468: Defaultocc_listassumes two altlocs (A and B).The default value
[{"A": 0.0, "B": 1.0}, {"A": 1.0, "B": 0.0}]is consistent with the hardcoded pattern inlddt_evaluation_script.py. Per issue#138mentioned in past comments, this limitation is tracked for future enhancement.Consider extracting this default to a module-level constant (e.g.,
DEFAULT_TWO_ALTLOC_OCCUPANCIES) shared across files to ensure consistency and simplify future updates when supporting more altlocs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/sampleworks/eval/structure_utils.py` around lines 467 - 468, The hardcoded default occ_list ( [{"A": 0.0, "B": 1.0}, {"A": 1.0, "B": 0.0}] ) should be extracted to a module-level constant named DEFAULT_TWO_ALTLOC_OCCUPANCIES and referenced instead of being inlined where occ_list is set (e.g., in structure_utils.py where occ_list is initialized) so the same constant can be imported and used by other modules (like lddt_evaluation_script.py) to guarantee consistency; add the constant near top-level of the module, replace the inline literal with DEFAULT_TWO_ALTLOC_OCCUPANCIES, and update other files to import and use that constant.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/sampleworks/eval/eval_dataclasses.py`:
- Line 16: The code still accesses the removed Experiment attribute _exp.occ_a;
update rscc_grid_search_script.py to read the renamed field
_exp.altloc_occupancies everywhere you reference occ_a (right now at the several
occurrences noted), ensuring any downstream logic expecting a mapping uses the
dict[str,float] shape of altloc_occupancies; search for _exp.occ_a and replace
with _exp.altloc_occupancies and verify any unpacking, defaulting, or key access
matches the new dict structure (adjust variable names or iteration if needed).
---
Nitpick comments:
In `@src/sampleworks/eval/structure_utils.py`:
- Around line 467-468: The hardcoded default occ_list ( [{"A": 0.0, "B": 1.0},
{"A": 1.0, "B": 0.0}] ) should be extracted to a module-level constant named
DEFAULT_TWO_ALTLOC_OCCUPANCIES and referenced instead of being inlined where
occ_list is set (e.g., in structure_utils.py where occ_list is initialized) so
the same constant can be imported and used by other modules (like
lddt_evaluation_script.py) to guarantee consistency; add the constant near
top-level of the module, replace the inline literal with
DEFAULT_TWO_ALTLOC_OCCUPANCIES, and update other files to import and use that
constant.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 60918848-18d8-4441-8fd0-6410247f1db0
📒 Files selected for processing (7)
scripts/eval/lddt_evaluation_script.pysrc/sampleworks/eval/eval_dataclasses.pysrc/sampleworks/eval/grid_search_eval_utils.pysrc/sampleworks/eval/occupancy_utils.pysrc/sampleworks/eval/rscc_grid_search_analysis.pysrc/sampleworks/eval/structure_utils.pytests/eval/test_structure_utils.py
💤 Files with no reviewable changes (1)
- src/sampleworks/eval/rscc_grid_search_analysis.py
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/eval/test_structure_utils.py
- src/sampleworks/eval/grid_search_eval_utils.py
…of altlocs and associated occupancies
314483e to
6b76fd7
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/sampleworks/eval/occupancy_utils.py (1)
79-97:⚠️ Potential issue | 🟠 MajorCanonicalize occupancies before validating and formatting them.
This still validates the raw float inputs and only rounds inside the output loop. Inputs like
A=0.335, B=0.335, C=0.33pass the sum check, but serialize to0.34occA_0.34occB_0.33occC, which now represents a total occupancy of1.01. Please normalize once first, then use that canonical mapping for both validation and string generation.Proposed fix
- if sum(altloc_occupancies.values()) > 1: - raise ValueError( - "Altloc occupancies cannot sum to more than 1, currently " - f"they sum to {sum(altloc_occupancies.values())}: {altloc_occupancies}" - ) - if any(occ < 0 or occ > 1 for occ in altloc_occupancies.values()): - raise ValueError( - "Altloc occupancies must be between 0 and 1, currently " - f"they don't: {altloc_occupancies}" - ) - parts = [] - for label in sorted(altloc_occupancies, key=lambda label_name: str(label_name).upper()): - occ = altloc_occupancies[label] - occ = round(occ, 2) - if abs(occ) > 1e-6: - label_str = str(label).upper() - parts.append(f"{occ}occ{label_str}") + canonical_occupancies: dict[str, float] = {} + for label, occ in altloc_occupancies.items(): + canonical_occ = round(float(occ), 2) + if canonical_occ < 0 or canonical_occ > 1: + raise ValueError( + "Altloc occupancies must be between 0 and 1, currently " + f"they don't: {altloc_occupancies}" + ) + if abs(canonical_occ) > 1e-6: + canonical_occupancies[str(label).upper()] = canonical_occ + + if not canonical_occupancies: + raise ValueError("At least one altloc must have non-zero occupancy") + + total_occ = sum(canonical_occupancies.values()) + if total_occ > 1 + 1e-6: + raise ValueError( + "Altloc occupancies cannot sum to more than 1, currently " + f"they sum to {total_occ}: {canonical_occupancies}" + ) + + parts = [] + for label in sorted(canonical_occupancies): + parts.append(f"{canonical_occupancies[label]}occ{label}") - if not parts: - raise ValueError("At least one altloc must have non-zero occupancy") return "_".join(parts)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/sampleworks/eval/occupancy_utils.py` around lines 79 - 97, The code must canonicalize (normalize) altloc_occupancies before validating and formatting: compute total = sum(altloc_occupancies.values()), if total == 0 raise, then create a normalized mapping (e.g., norm = {label: val/total for label,val in altloc_occupancies.items()}) and use that mapping for the subsequent checks (sum, bounds check using norm values) and for generating parts; keep rounding (round(occ, 2)) only when producing the output strings (the parts list) so the validation uses canonical values and the serialized labels reflect the normalized occupancies; reference the existing variables altloc_occupancies, occ, parts and label when making these changes.scripts/eval/lddt_evaluation_script.py (1)
222-224:⚠️ Potential issue | 🟠 MajorCanonicalize the occupancy cache key, and stop hard-coding an A/B pair here.
This preload path still stores references under raw
{"A": occ, "B": 1.0 - occ}keys. At the0.0/1.0boundaries that produces keys like(("A", 0.0), ("B", 1.0)), but experiments parsed from filenames like1.0occBuse_exp.occ_key == (("B", 1.0),), so the later lookup misses a valid cached reference. This block also still only models two altlocs, which means the generalized 3+ altloc path is still incomplete here.Also applies to: 280-281
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/lddt_evaluation_script.py` around lines 222 - 224, The occupancy cache key currently hard-codes an A/B pair (altloc_occ = {"A": occ, "B": 1.0 - occ}) and creates occ_key without normalizing absent/zero entries, causing mismatches with keys like (("B", 1.0),). Change the construction so you derive the altloc->occupancy mapping from the actual altloc labels present (or the keys of the occupancy dict) rather than assuming "A"/"B", then create occ_key as a canonical, sorted tuple of (altloc, occ) pairs filtering out entries with zero occupancy (e.g., occ > 0) so keys like (("B", 1.0),) and (("A",0.0),("B",1.0)) canonicalize to the same representation; apply the same fix to the other occurrence of this logic around the later block (lines ~280-281).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/sampleworks/eval/structure_utils.py`:
- Around line 464-465: get_reference_structure_coords currently defaults
occ_list to two hard-coded endpoints [{"A":0.0,"B":1.0}, {"A":1.0,"B":0.0}]
which misses additional altlocs; change the function to either require occ_list
be passed or, better, derive occ_list from the reference structure’s altloc set
(e.g., inspect the reference occupancy keys or altloc identifiers present and
generate a sweep of occupancy dicts covering all present altlocs) so that all
altlocs (A/B/C/...) are considered when building the reference mask used by
scripts like rscc_grid_search_script.py; update the
get_reference_structure_coords signature and callers to ensure occ_list is
provided or autogenerated from the structure’s available occupancies.
---
Duplicate comments:
In `@scripts/eval/lddt_evaluation_script.py`:
- Around line 222-224: The occupancy cache key currently hard-codes an A/B pair
(altloc_occ = {"A": occ, "B": 1.0 - occ}) and creates occ_key without
normalizing absent/zero entries, causing mismatches with keys like (("B",
1.0),). Change the construction so you derive the altloc->occupancy mapping from
the actual altloc labels present (or the keys of the occupancy dict) rather than
assuming "A"/"B", then create occ_key as a canonical, sorted tuple of (altloc,
occ) pairs filtering out entries with zero occupancy (e.g., occ > 0) so keys
like (("B", 1.0),) and (("A",0.0),("B",1.0)) canonicalize to the same
representation; apply the same fix to the other occurrence of this logic around
the later block (lines ~280-281).
In `@src/sampleworks/eval/occupancy_utils.py`:
- Around line 79-97: The code must canonicalize (normalize) altloc_occupancies
before validating and formatting: compute total =
sum(altloc_occupancies.values()), if total == 0 raise, then create a normalized
mapping (e.g., norm = {label: val/total for label,val in
altloc_occupancies.items()}) and use that mapping for the subsequent checks
(sum, bounds check using norm values) and for generating parts; keep rounding
(round(occ, 2)) only when producing the output strings (the parts list) so the
validation uses canonical values and the serialized labels reflect the
normalized occupancies; reference the existing variables altloc_occupancies,
occ, parts and label when making these changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8ea7be87-1598-484e-ba57-8d92bf40b0e4
📒 Files selected for processing (8)
scripts/eval/lddt_evaluation_script.pyscripts/eval/rscc_grid_search_script.pysrc/sampleworks/eval/eval_dataclasses.pysrc/sampleworks/eval/grid_search_eval_utils.pysrc/sampleworks/eval/occupancy_utils.pysrc/sampleworks/eval/rscc_grid_search_analysis.pysrc/sampleworks/eval/structure_utils.pytests/eval/test_structure_utils.py
💤 Files with no reviewable changes (1)
- src/sampleworks/eval/rscc_grid_search_analysis.py
🚧 Files skipped from review as they are similar to previous changes (2)
- src/sampleworks/eval/grid_search_eval_utils.py
- src/sampleworks/eval/eval_dataclasses.py
| if occ_list is None: | ||
| occ_list = [{"A": 0.0, "B": 1.0}, {"A": 1.0, "B": 0.0}] |
There was a problem hiding this comment.
The default occupancy sweep still hard-codes exactly two altlocs.
get_reference_structure_coords() is now part of the generalized altloc path, but its default still only loads {"A": 0.0, "B": 1.0} and {"A": 1.0, "B": 0.0}. scripts/eval/rscc_grid_search_script.py uses this default path, so any protein with a C/D/… altloc never contributes to the reference mask and the RSCC region is built from incomplete coordinates. Please either require occ_list explicitly here or derive it from the available reference occupancies instead of assuming two endpoints.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/sampleworks/eval/structure_utils.py` around lines 464 - 465,
get_reference_structure_coords currently defaults occ_list to two hard-coded
endpoints [{"A":0.0,"B":1.0}, {"A":1.0,"B":0.0}] which misses additional
altlocs; change the function to either require occ_list be passed or, better,
derive occ_list from the reference structure’s altloc set (e.g., inspect the
reference occupancy keys or altloc identifiers present and generate a sweep of
occupancy dicts covering all present altlocs) so that all altlocs (A/B/C/...)
are considered when building the reference mask used by scripts like
rscc_grid_search_script.py; update the get_reference_structure_coords signature
and callers to ensure occ_list is provided or autogenerated from the structure’s
available occupancies.
There was a problem hiding this comment.
leaving this one for now.
There was a problem hiding this comment.
@marcuscollins, understood — I'll leave this one for now. Feel free to revisit the hard-coded two-altloc default in get_reference_structure_coords in a future PR when it becomes relevant.
(╯°□°)╯
🧠 Learnings used
Learnt from: CR
Repo: diff-use/sampleworks PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-04T20:19:01.356Z
Learning: Applies to src/sampleworks/**/*.py : Use AtomReconciler to handle atom count mismatches between model and structure representations
Learnt from: k-chrispens
Repo: diff-use/sampleworks PR: 73
File: src/sampleworks/core/samplers/protocol.py:1-18
Timestamp: 2026-02-02T19:13:51.594Z
Learning: In Python files across the repo, do not suggest or add inline or file-level suppressions for Ruff F722 (jaxtyping dimension strings). Instead rely on a global ignore configured in pyproject.toml (as done by the sampleworks project). Verify that F722 is included in the global ignore, and avoid recommending per-file # ruff: noqa: F722 directives unless explicitly documented as an exception.
Learnt from: marcuscollins
Repo: diff-use/sampleworks PR: 132
File: src/sampleworks/utils/guidance_script_utils.py:585-586
Timestamp: 2026-03-05T16:30:40.514Z
Learning: In Python code, if enums subclassing StrEnum are used (e.g., GuidanceType, StructurePredictor), they serialize to plain strings with json.dump and pickle without special handling. Do not flag these as non-serializable in code reviews. Treat StrEnum values as strings for JSON serialization and ensure tests cover that behavior; no extra pickle handling needed.
marcuscollins
left a comment
There was a problem hiding this comment.
There are a couple CodeRabbit comments that you should probably double check, although I don't think they're big issues. Otherwise looks okay. I will leave it to you to judge and merge when ready.
| Returns | ||
| ------- | ||
| tuple[str, dict[str, float]] | ||
| ``(protein, altloc_occupancies)`` where *altloc_occupancies* maps |
There was a problem hiding this comment.
Oh, I discovered that and meant to remove this comment. Sorry.
| if occ_list is None: | ||
| occ_list = [{"A": 0.0, "B": 1.0}, {"A": 1.0, "B": 0.0}] |
There was a problem hiding this comment.
leaving this one for now.
If we're going to run this script ever again, it would be worth having the occupancy_to_str function generalized to more altlocs, so I did that and updated the scripts. This will now not be compatible with the old 6B8X occupancy strings, which we have abandoned with the new grid search set anyway.
Summary by CodeRabbit
Breaking Changes
Refactor
Bug Fixes
Tests