eval: script for classifying conformational changes#223
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 49 minutes and 57 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdded a new CLI evaluation script to classify contiguous altloc spans in CIF structures by backbone lDDT and span size, a CIF-path resolver helper, and extended Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI
participant CSV as Selections CSV
participant Resolver as resolve_cif_path
participant Loader as Structure Loader
participant LDDT as AllAtomLDDT
participant Output as CSV Writer
CLI->>CSV: read selection rows
loop for each row/selection
CLI->>Resolver: resolve CIF path (row, --cif-root)
Resolver-->>CLI: cif path
CLI->>Loader: load CIF, extract altloc atom arrays
Loader-->>CLI: altloc arrays & metadata
CLI->>LDDT: build pairwise arrays & compute per-residue backbone lDDT
LDDT-->>CLI: per-pair per-residue lDDT scores
CLI->>CLI: classify span (side_chain_only / domain_shift / small_loop / large_loop)
CLI->>Output: write classification row (incl. pair_lddts JSON)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Adds an evaluation-time script to classify contiguous altloc (alternate location) regions into conformational-change bins for downstream analysis, based on backbone altloc presence, span length, and pairwise backbone lDDT between altlocs.
Changes:
- Introduce
scripts/eval/classify_altloc_regions.pyto load structures with altlocs, score pairwise altloc comparisons, and emit a per-span classification CSV. - Update
.gitignoreto ignoreinitial_dataset_40*directories and additional artifact patterns.
Reviewed changes
Copilot reviewed 1 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| scripts/eval/classify_altloc_regions.py | New CLI script to classify altloc spans into side-chain-only / loop-size / domain-shift bins using pairwise backbone lDDT. |
| .gitignore | Expands ignore patterns for large datasets and artifacts (including CSVs). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (4)
scripts/eval/classify_altloc_regions.py (4)
356-360: Shadowing the outerrow: pd.Serieswith the per-spanrow: dictis confusing.
_process_structuretakesrow: pd.Seriesand then rebindsrowon line 358 inside the loop. It happens to be harmless today (the outerrowisn't referenced after the loop), but the name collision makes the function harder to read and one accidental reintroduction ofrow["protein"]below the loop would silently pick up the last span's dict. Rename one of them, e.g.span_row, covered = out.♻️ Proposed rename
- row, covered = out - rows.append(row) + span_row, covered = out + rows.append(span_row) classified_res_ids.update(covered)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/classify_altloc_regions.py` around lines 356 - 360, The loop inside _process_structure shadows the outer parameter row: pd.Series by rebinding row to a per-span dict; rename the inner binding to avoid confusion and potential bugs (e.g., use span_row, covered = out) and update subsequent references in that loop to span_row so the outer row variable remains distinct; ensure variables like classified_res_ids and rows continue to receive the correct values (append span_row to rows or adjust as intended) and run tests to verify behavior unchanged.
134-162: Tighten the pair-array typing.
dict[tuple[str, str], tuple[object, object]]loses all information — every caller immediately uses these asAtomArrays. Annotating astuple[AtomArray, AtomArray](withfrom biotite.structure import AtomArray) makes_mean_residue_lddt_for_pair's signature honest too, and lets the type checker catch theNonebranch on line 286 rather than relying on the runtime guard inside_mean_residue_lddt_for_pair.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/classify_altloc_regions.py` around lines 134 - 162, Change the loose return/type annotation on _build_pairwise_altloc_arrays from dict[tuple[str,str], tuple[object,object]] to dict[tuple[str,str], tuple[AtomArray, AtomArray]] and import AtomArray from biotite.structure; ensure the local pairs variable is typed the same way. Also update the callee signature _mean_residue_lddt_for_pair to accept AtomArray pairs (tuple[AtomArray,AtomArray]) instead of object so the type checker will flag None branches earlier; keep the existing runtime logic that uses select_altloc and filter_to_common_atoms unchanged.
62-63: Import-timesys.path.insertis a hidden side effect; prefer a relative/package import.Mutating
sys.pathat module import is exactly the kind of side effect the coding guidelines ask to be explicit about, and it also makeslddt_evaluation_scriptorder-sensitive (anything earlier onsys.pathwith the same name wins). Since both scripts live underscripts/eval/, either:
- extract
translate_selectionintosrc/sampleworks/...and import it normally, or- invoke via
python -m scripts.eval.classify_altloc_regionswith an__init__.pyso a proper relative import works.As per coding guidelines: "Only comment when truly necessary. ALWAYS annotate complex array shapes and note side effects."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/classify_altloc_regions.py` around lines 62 - 63, Remove the import-time sys.path.insert side-effect and replace it with a proper package-relative import or move of the utility: either (A) make scripts/eval a package (add __init__.py) and use a relative import from within classify_altloc_regions to import translate_selection from lddt_evaluation_script (e.g., from .lddt_evaluation_script import translate_selection), or (B) refactor translate_selection into the project src package (e.g., src.sampleworks...) and import it with an absolute import (e.g., from sampleworks.lddt_evaluation_script import translate_selection); update classify_altloc_regions to remove sys.path mutation and adjust any invocation (use python -m scripts.eval.classify_altloc_regions if relying on package execution).
339-344: Fragile heuristic for skipping the upstream union selection.
if " or " in selection_str: continuerelies on a formatting contract infind_altloc_selections.pythat isn't enforced anywhere — if that script ever emits individual spans with embeddedor(e.g., once it grows residue-range syntax or multi-chain clauses), those spans will be silently dropped. Safer options:
- have
find_altloc_selections.pyemit the union under a distinct column (e.g.,union_selection) and not splice it intoselection, or- tag the union with a sentinel prefix like
UNION:that this script can test for explicitly.At minimum, worth a comment here pointing at the exact contract in
find_altloc_selections.pyso future edits don't diverge.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/classify_altloc_regions.py` around lines 339 - 344, The current heuristic in the loop that skips union selections by checking if " or " appears in selection_str is fragile; update the logic to look for an explicit sentinel (e.g., prefix "UNION:") that find_altloc_selections.py will emit for the combined union selection and skip only when selection_str.startswith("UNION:"), and add a short colocated comment referencing find_altloc_selections.py (and the sentinel contract) so future changes don't reintroduce the fragile " or " check; if you cannot change the producer, add a conservative fallback comment explaining the existing format dependency and assert/raise if a selection unexpectedly contains " or " but is not prefixed to make the contract explicit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.gitignore:
- Around line 227-230: The blanket "*.csv" ignore will cause tracked CSVs to be
ignored on adds; update .gitignore to add a negation for the specific tracked
CSV (protein_configs.csv) so that it remains committable despite the "*.csv"
rule—add a "!<path-to>/protein_configs.csv" negation entry immediately after the
"*.csv" pattern or remove the blanket "*.csv" and instead list only undesired
CSV patterns.
In `@scripts/eval/classify_altloc_regions.py`:
- Around line 292-298: The current RuntimeError raised when all values in
pair_lddts are non-finite aborts the whole batch; change _process_structure so
that instead of raising, it logs an error (including protein, selection_str,
backbone_altloc_res_ids and pair_lddts) and either returns a sentinel result
that marks the span classification as "unknown" or simply skips emitting that
span; optionally add a --strict flag checked in main to preserve the current
RuntimeError behavior when requested. Ensure references: pair_lddts,
finite_vals, _process_structure, main, selection_str, and
backbone_altloc_res_ids are updated so downstream CSV writing handles the
sentinel/unknown case gracefully.
- Around line 231-303: The code silently continues when sel_chain_ids contains
multiple chains which collapses residues across chains and miscomputes
backbone_altloc_res_ids and lDDT (affecting sel_chain_ids,
backbone_altloc_res_ids, covered_altloc_residues, _max_contiguous_run and
_mean_residue_lddt_for_pair); fix by making multi-chain selections fail fast:
replace the current warning branch (len(sel_chain_ids) != 1) with an explicit
error (raise ValueError or RuntimeError) that includes sel_chain_ids and
selection_str so callers must split the selection upstream, ensuring downstream
logic always operates on a single chain and that backbone_altloc_res_ids,
covered_altloc_residues and lDDT computations are correct.
- Around line 306-311: Add NumPy-style docstrings to the functions missing them:
_process_structure and main (and also update the helpers _resolve_cif_path,
_build_pairwise_altloc_arrays, _mean_residue_lddt_for_pair, and
_classify_selection which currently have only prose). For each function include
a one-line summary, a Parameters section listing each argument name, type and
brief description, a Returns section describing the return type and structure
(e.g., list[dict] for _process_structure), and any Raises or Notes if relevant;
keep wording consistent with existing docstrings in the file and ensure
parameter names exactly match the function signatures.
---
Nitpick comments:
In `@scripts/eval/classify_altloc_regions.py`:
- Around line 356-360: The loop inside _process_structure shadows the outer
parameter row: pd.Series by rebinding row to a per-span dict; rename the inner
binding to avoid confusion and potential bugs (e.g., use span_row, covered =
out) and update subsequent references in that loop to span_row so the outer row
variable remains distinct; ensure variables like classified_res_ids and rows
continue to receive the correct values (append span_row to rows or adjust as
intended) and run tests to verify behavior unchanged.
- Around line 134-162: Change the loose return/type annotation on
_build_pairwise_altloc_arrays from dict[tuple[str,str], tuple[object,object]] to
dict[tuple[str,str], tuple[AtomArray, AtomArray]] and import AtomArray from
biotite.structure; ensure the local pairs variable is typed the same way. Also
update the callee signature _mean_residue_lddt_for_pair to accept AtomArray
pairs (tuple[AtomArray,AtomArray]) instead of object so the type checker will
flag None branches earlier; keep the existing runtime logic that uses
select_altloc and filter_to_common_atoms unchanged.
- Around line 62-63: Remove the import-time sys.path.insert side-effect and
replace it with a proper package-relative import or move of the utility: either
(A) make scripts/eval a package (add __init__.py) and use a relative import from
within classify_altloc_regions to import translate_selection from
lddt_evaluation_script (e.g., from .lddt_evaluation_script import
translate_selection), or (B) refactor translate_selection into the project src
package (e.g., src.sampleworks...) and import it with an absolute import (e.g.,
from sampleworks.lddt_evaluation_script import translate_selection); update
classify_altloc_regions to remove sys.path mutation and adjust any invocation
(use python -m scripts.eval.classify_altloc_regions if relying on package
execution).
- Around line 339-344: The current heuristic in the loop that skips union
selections by checking if " or " appears in selection_str is fragile; update the
logic to look for an explicit sentinel (e.g., prefix "UNION:") that
find_altloc_selections.py will emit for the combined union selection and skip
only when selection_str.startswith("UNION:"), and add a short colocated comment
referencing find_altloc_selections.py (and the sentinel contract) so future
changes don't reintroduce the fragile " or " check; if you cannot change the
producer, add a conservative fallback comment explaining the existing format
dependency and assert/raise if a selection unexpectedly contains " or " but is
not prefixed to make the contract explicit.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d085183b-96a0-4a58-8a56-0109b95fa812
📒 Files selected for processing (2)
.gitignorescripts/eval/classify_altloc_regions.py
a58e610 to
815ae32
Compare
| ] | ||
|
|
||
|
|
||
| def _max_contiguous_run(sorted_res_ids: list[int]) -> int: |
There was a problem hiding this comment.
I believe this can be vectorized. Does it get used much?
There was a problem hiding this comment.
no but easy to do with claude
| return best | ||
|
|
||
|
|
||
| def _build_pairwise_altloc_arrays( |
There was a problem hiding this comment.
You can use grid_search_utils.map_altlocs_to_stack instead, if you apply selections first. Are you trying to avoid using selections? It seems like this is used to filter selections, so you could apply the selection then run map_altlocs_to_stack. It would reduce the number of special functions needed.
There was a problem hiding this comment.
that will drop selections where the whole structure has 3 altlocs, but some residues only have 2. Added comment on this.
There was a problem hiding this comment.
Right, so if you do the selection first, then construct the stack on just that selection, it would work, wouldn't it?
|
|
||
|
|
||
| def _build_pairwise_altloc_arrays( | ||
| atom_array, altloc_ids: list[str] |
There was a problem hiding this comment.
Why not automatically do this over all discovered altlocs?
There was a problem hiding this comment.
Do you mean by using detect_altlocs in here?
There was a problem hiding this comment.
Yes. I anticipate this is because you are using the entire atom array here, and not a subset. I have always thought about these issues by localizing to the subset first, and only after that worrying about which altlocs are present. I find it easier to think about this way, especially because I don't think there's any clear way to know whether distant altloc stretches are related to each other.
| return float("nan") | ||
|
|
||
| flat = [ | ||
| v[0] if isinstance(v, (list, tuple, np.ndarray)) else v for v in residue_scores.values() |
There was a problem hiding this comment.
This seems weird, for a couple reasons.
- I don't like the use of .values() here--I guess now it should be ordered, but I prefer to know that order.
- Is it really the case that the value is sometimes a plain number, and other times some list-like thing? If so, I don't recall any good reason for that, and we should fix it upstream.
There was a problem hiding this comment.
OTOH, if there is a good reason for 2), you're silently suppressing that behavior here, which also isn't great.
There was a problem hiding this comment.
I tried to make this clearer? There is no reason the value is a list-like thing, that was overly defensive on my part.
| "span_length": int(len(sel_res_ids)), | ||
| "n_backbone_altloc_residues": n_backbone, | ||
| "n_altlocs": len(altloc_ids), | ||
| "pair_lddts": json.dumps({}), |
There was a problem hiding this comment.
why dump this to a string?
| row["classification"] = "side_chain_only" | ||
| return row, covered_altloc_residues | ||
|
|
||
| # Domain shift: contiguous backbone-altloc run exceeds threshold (default 50). |
There was a problem hiding this comment.
50 residues seems quite big to me--what about a single helix realignment? These are the kinds of hyperparameters I get worried about.
There was a problem hiding this comment.
There is one "domain shift" in the dataset that I care about right now, and it is more like 130 residues, so this successfully detects it. in the future we definitely should make this more robust
| row["classification"] = "domain_shift" | ||
| return row, covered_altloc_residues | ||
|
|
||
| # Single residue backbone altlocs cannot yield a meaningful lDDT, since you need at least two |
There was a problem hiding this comment.
Huh? This comment seems.... wrong. Or I don't understand what it's trying to say.
There was a problem hiding this comment.
After looking into this more, this was a misunderstanding on my part of how the LDDT we were calculating worked, since I expected that we were looking within a segment only, not considering the neighborhood. I realize this is not correct and that the regular LDDT calculation is done with the 15A neighborhood around each resiude
| # Single residue backbone altlocs cannot yield a meaningful lDDT, since you need at least two | ||
| # residues for inter-residue distances. A lone backbone-altloc residue is | ||
| # a minor local perturbation by definition, which we label as small_loop. | ||
| if n_backbone < 2: |
There was a problem hiding this comment.
I had the impression from above that "small_loop" referred to a small backbone displacement between altlocs which might include some more substantial side chain altlocs as well.
There was a problem hiding this comment.
Yes, it should include this as well. I think this should be what it is now? If the LDDT calculation is what I think?
| pair_lddts[f"{altloc_ids[i]}-{altloc_ids[j]}"] = _mean_residue_lddt_for_pair( | ||
| gt, pred, chain, backbone_altloc_res_ids | ||
| ) | ||
| row["pair_lddts"] = json.dumps(pair_lddts) |
There was a problem hiding this comment.
again, why dump to a string?
There was a problem hiding this comment.
We can then read the json string back in when using the CSV downstream. I suppose we could also store new columns in the CSV?
There was a problem hiding this comment.
If this becomes a column in a data frame that you write out to CSV, it will automatically dump a dictionary to a string:
In [1]: import pandas as pd
In [2]: data = [{"thing": 1, "other": {'blah': 2, 'huh': "yesa"}}]
In [3]: df = pd.DataFrame(data)
In [4]: df
Out[4]:
thing other
0 1 {'blah': 2, 'huh': 'yesa'}
In [5]: print(df.to_csv())
Out[5]:
,thing,other
0,1,"{'blah': 2, 'huh': 'yesa'}"
I'm not sure whether it can be coaxed to read it back in as a dictionary.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/eval/classify_altloc_regions.py`:
- Around line 174-185: The code is constructing residue lookup keys as str(r)
but AllAtomLDDT uses chain+res_id (e.g. "A123"), so change the keys generation
to include the chain prefix (e.g. keys = [f"{chain}{r}" for r in residues] or
keys = [chain + str(r) for r in residues]) and use those keys for the missing
check, flat extraction (residue_scores[k][0]) and logging; update any references
to keys/residue_scores in this block (symbols: residue_scores, residues, keys,
chain, flat) so lookups match the AllAtomLDDT key format.
- Around line 162-169: AllAtomLDDT.compute is being passed full
predicted_atom_array_stack and ground_truth_atom_array_stack while relying on
the selection string to report "backbone" scores, but selection is applied after
feature construction so side-chain atoms still influence the lDDT; fix by
pre-filtering predicted_atom_array_stack and ground_truth_atom_array_stack to
only backbone atom rows (atom_name in ['C','CA','N','O']) before calling
AllAtomLDDT().compute, keep the residue/chain filtering (res_clause/chain) for
selection of reported residues, and ensure the filtered stacks preserve
corresponding atom ordering and indexing between prediction and ground truth
when passing them to compute.
In `@src/sampleworks/eval/grid_search_eval_utils.py`:
- Around line 27-38: The guard before constructing Path(pattern) needs to
explicitly detect NaN so NaN doesn't get passed to Path and trigger TypeError;
update the check around row["structure_pattern"] in grid_search_eval_utils.py to
use an explicit NaN check (e.g., pandas.isna(row["structure_pattern"]) or
math.isnan for numeric types) in addition to the existing truthiness test, and
raise the same ValueError if the value is missing/NaN; ensure the condition
references row["structure_pattern"] and the subsequent Path(...) call remains
unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 280a4e1c-c30a-47f2-a7da-afa3a3d742ce
📒 Files selected for processing (3)
.gitignorescripts/eval/classify_altloc_regions.pysrc/sampleworks/eval/grid_search_eval_utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
- .gitignore
9be2ecf to
f9be48e
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
scripts/eval/classify_altloc_regions.py (3)
308-314:⚠️ Potential issue | 🟡 MinorAvoid aborting the whole batch for one unscorable span.
This
RuntimeErrorbubbles out through_process_structure()andmain(), so one problematic structure prevents writing classifications for the rest of the CSV. Consider logging and skipping the span, emitting an"unknown"classification, or gating strict failure behind a flag.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/classify_altloc_regions.py` around lines 308 - 314, The code currently raises a RuntimeError when no finite lDDT values are found (the finite_vals check over pair_lddts), which aborts the whole run via _process_structure() and main(); change this to handle the unscorable span gracefully by logging a warning that includes protein, selection_str and backbone_altloc_res_ids, then either emit an "unknown" classification for that span (so downstream CSV writing can continue) or skip the span and continue processing the next one; if desired, add a strict/fail-fast flag to preserve the current RuntimeError behavior when explicitly requested.
87-103:⚠️ Potential issue | 🟡 MinorBring the new function docstrings up to NumPy style.
Several helpers only have prose summaries, and
_process_structure()/main()have no docstring. Please addParameters,Returns, and relevantRaises/Notessections for the new functions.As per coding guidelines, "Always include NumPy-style docstrings for every function and class."
Also applies to: 114-134, 152-158, 188-222, 322-327, 400-400
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/classify_altloc_regions.py` around lines 87 - 103, Several new helper functions (e.g., _max_contiguous_run and _chain_from_selection) and higher-level functions (_process_structure and main) lack NumPy-style docstrings; update each to include a concise summary plus Parameters, Returns, and Raises (and Notes if relevant). For _max_contiguous_run: document parameter sorted_res_ids (ndarray | list[int]), describe returned int (length of longest contiguous run) and any edge-case behavior (empty input returns 0). For _chain_from_selection: document selection: str, returned Optional[str], and note supported selection syntaxes. Add NumPy-style docstrings similarly for _process_structure and main describing inputs, outputs, side effects, and exceptions; also apply the same docstring format to the other helper functions flagged in the review (lines 114-134, 152-158, 188-222, 322-327, 400) so every function/class follows the project's NumPy-style docstring guideline.
162-169:⚠️ Potential issue | 🟠 MajorPre-filter to backbone atoms before computing “backbone lDDT.”
The selection string limits reported residues, but the full atom arrays are still passed into lDDT feature construction, so side-chain atoms can influence a score reported as backbone-only.
Proposed direction
+ backbone_mask = np.isin(gt_array.atom_name, BACKBONE_ATOM_TYPES) + if not backbone_mask.any(): + return float("nan") + + gt_array = gt_array[backbone_mask] + pred_array = pred_array[backbone_mask] + res_clause = " or ".join(f"res_id == {r}" for r in residues) selection = f"chain_id == '{chain}' and ({res_clause}) and atom_name in ['C','CA','N','O']" try: result = AllAtomLDDT().compute( predicted_atom_array_stack=pred_array,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/classify_altloc_regions.py` around lines 162 - 169, The compute call to AllAtomLDDT uses full predicted and ground-truth atom stacks (predicted_atom_array_stack / ground_truth_atom_array_stack) so side-chain atoms still affect the "backbone" lDDT; filter those stacks to only backbone atoms before calling AllAtomLDDT().compute. Specifically, use the existing res_clause/selection logic (chain, residues, atom_name in ['C','CA','N','O']) to create filtered_pred_array and filtered_gt_array (keeping the same array shapes expected by AllAtomLDDT) and pass those filtered stacks as predicted_atom_array_stack and ground_truth_atom_array_stack to AllAtomLDDT().compute (you can keep or remove the selection arg afterwards).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/sampleworks/eval/grid_search_eval_utils.py`:
- Around line 20-26: The docstring for resolve_cif_path is currently prose;
replace it with a NumPy-style docstring that includes a short summary, a
Parameters section documenting row (pd.Series) and cif_root (Path | None), a
Returns section indicating Path, and a Notes (or Returns/Raises) paragraph
describing the resolution behavior (prefer row["structure"] then
row["structure_pattern"], and when using structure_pattern try cif_root/pattern
and cif_root/protein/pattern). Also document any raised exceptions (e.g.,
FileNotFoundError) if applicable and keep the existing descriptive details from
the current prose.
---
Duplicate comments:
In `@scripts/eval/classify_altloc_regions.py`:
- Around line 308-314: The code currently raises a RuntimeError when no finite
lDDT values are found (the finite_vals check over pair_lddts), which aborts the
whole run via _process_structure() and main(); change this to handle the
unscorable span gracefully by logging a warning that includes protein,
selection_str and backbone_altloc_res_ids, then either emit an "unknown"
classification for that span (so downstream CSV writing can continue) or skip
the span and continue processing the next one; if desired, add a
strict/fail-fast flag to preserve the current RuntimeError behavior when
explicitly requested.
- Around line 87-103: Several new helper functions (e.g., _max_contiguous_run
and _chain_from_selection) and higher-level functions (_process_structure and
main) lack NumPy-style docstrings; update each to include a concise summary plus
Parameters, Returns, and Raises (and Notes if relevant). For
_max_contiguous_run: document parameter sorted_res_ids (ndarray | list[int]),
describe returned int (length of longest contiguous run) and any edge-case
behavior (empty input returns 0). For _chain_from_selection: document selection:
str, returned Optional[str], and note supported selection syntaxes. Add
NumPy-style docstrings similarly for _process_structure and main describing
inputs, outputs, side effects, and exceptions; also apply the same docstring
format to the other helper functions flagged in the review (lines 114-134,
152-158, 188-222, 322-327, 400) so every function/class follows the project's
NumPy-style docstring guideline.
- Around line 162-169: The compute call to AllAtomLDDT uses full predicted and
ground-truth atom stacks (predicted_atom_array_stack /
ground_truth_atom_array_stack) so side-chain atoms still affect the "backbone"
lDDT; filter those stacks to only backbone atoms before calling
AllAtomLDDT().compute. Specifically, use the existing res_clause/selection logic
(chain, residues, atom_name in ['C','CA','N','O']) to create filtered_pred_array
and filtered_gt_array (keeping the same array shapes expected by AllAtomLDDT)
and pass those filtered stacks as predicted_atom_array_stack and
ground_truth_atom_array_stack to AllAtomLDDT().compute (you can keep or remove
the selection arg afterwards).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1fda720c-0dfc-4c3c-bcd2-8305a360aafc
📒 Files selected for processing (2)
scripts/eval/classify_altloc_regions.pysrc/sampleworks/eval/grid_search_eval_utils.py
marcuscollins
left a comment
There was a problem hiding this comment.
I'm okay with it as is, but I would favor a selection-first approach that re-uses existing methods. I'm 99% sure it would work and be less code to maintain.
| return best | ||
|
|
||
|
|
||
| def _build_pairwise_altloc_arrays( |
There was a problem hiding this comment.
Right, so if you do the selection first, then construct the stack on just that selection, it would work, wouldn't it?
|
|
||
|
|
||
| def _build_pairwise_altloc_arrays( | ||
| atom_array, altloc_ids: list[str] |
There was a problem hiding this comment.
Yes. I anticipate this is because you are using the entire atom array here, and not a subset. I have always thought about these issues by localizing to the subset first, and only after that worrying about which altlocs are present. I find it easier to think about this way, especially because I don't think there's any clear way to know whether distant altloc stretches are related to each other.
| pair_lddts[f"{altloc_ids[i]}-{altloc_ids[j]}"] = _mean_residue_lddt_for_pair( | ||
| gt, pred, chain, backbone_altloc_res_ids | ||
| ) | ||
| row["pair_lddts"] = json.dumps(pair_lddts) |
There was a problem hiding this comment.
If this becomes a column in a data frame that you write out to CSV, it will automatically dump a dictionary to a string:
In [1]: import pandas as pd
In [2]: data = [{"thing": 1, "other": {'blah': 2, 'huh': "yesa"}}]
In [3]: df = pd.DataFrame(data)
In [4]: df
Out[4]:
thing other
0 1 {'blah': 2, 'huh': 'yesa'}
In [5]: print(df.to_csv())
Out[5]:
,thing,other
0,1,"{'blah': 2, 'huh': 'yesa'}"
I'm not sure whether it can be coaxed to read it back in as a dictionary.
Evaluation script for classifying conformational changes, which we use for downstream analysis
Summary by CodeRabbit
Chores
New Features