Updates to CIF patching code#104
Conversation
…ccupancies, write to a different file name (refined-patched.cif)
📝 WalkthroughWalkthroughThe CIF patching script now accepts grid-search inputs and PDB patterns, returns errors instead of raising, performs multi-chain residue mapping and NaN-safe coordinate/metadata fixes, and writes patched outputs as Changes
Sequence DiagramsequenceDiagram
participant CLI as CLI Arguments
participant Main as main()
participant Dir as Directory Crawler
participant RefLoad as Reference Loader
participant TargetLoad as Target Loader
participant Mapper as Residue Mapper
participant Patcher as CIF Patcher
participant Writer as File Writer
participant Logger as Logger
CLI->>Main: provide input_dir, grid_search_input_dir, input_pdb_pattern, target_pattern
Main->>Dir: crawl for CIF files matching patterns
Dir->>Logger: log CIF match found
Dir->>Main: return CIF paths
Main->>RefLoad: load reference PDB/CIF from reference_dir
RefLoad->>Main: reference structure
Main->>TargetLoad: load target CIF
TargetLoad->>Main: target structure
Main->>Mapper: compute multi-chain residue mapping
Mapper->>Mapper: validate mapping consistency
Mapper-->>Main: mapping result or error
Main->>Patcher: apply mapping, sync pdbx_poly_seq_scheme, ensure metadata, remove NaN atoms
Patcher->>Patcher: adjust entity_poly and atom IDs
Patcher-->>Main: patched CIF object
Main->>Writer: write patched CIF as "-patched.cif"
Writer->>Logger: log output path
Main->>Logger: aggregate and log any returned errors
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/patch_input_cif_files.py`:
- Around line 192-199: The current backfill only adds occupancy/B-factor when
the entire column is missing; instead, inspect template.block["atom_site"] for
existing "occupancy" and "B_iso_or_equiv" columns and replace any per-row
missing entries (e.g. ".", "?", None, NaN, or empty strings) with the defaults
1.0 for occupancy and 20.0 for B_iso_or_equiv. Implement this by reading the
existing column values from template.block["atom_site"]["occupancy"] and
template.block["atom_site"]["B_iso_or_equiv"], mapping each value to itself
unless it matches a missing sentinel then substituting the default, and then
writing the cleaned list back as a CIFColumn (or mutating the existing
CIFColumn) using the same length based on template.block["atom_site"]["id"] to
preserve alignment.
- Around line 147-155: The current positional zip of cif_keys and ref_keys can
silently remap residues across chains; when building mapping = {cif_key:
ref_key[1] ...} validate chain consistency for each zipped pair (cif_key[0] ==
ref_key[0]) and abort with an error if any pair has mismatched chain ids, or
alternatively build the mapping keyed by the full (chain_id, res_id) tuple so
lookups use chain+residue; ensure atom_keys (from asym_unit.chain_id and
asym_unit.res_id) are looked up with the same tuple shape and keep using
zip(..., strict=True) for length checks.
- Around line 121-134: The remote call to fetch(rcsb_id, ...) should be moved
inside the existing try-except so network/file-not-found errors are caught and
handled like the CIF parsing errors; update the try block to call fetch(...) to
set rcsb_path before computing reference_path, and on exception include the
caught exception details (e) in the msg string returned and logged via
logger.warning; ensure load_any and ensure_atom_array_stack remain inside the
same try so any of fetch, load_any, or ensure_atom_array_stack failures produce
the same logged/returned error message.
k-chrispens
left a comment
There was a problem hiding this comment.
LGTM, coderabbit made a few suggestions that may or may not be worth addressing
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
scripts/patch_input_cif_files.py (1)
102-104: Use logger for each collected error instead ofprint().Line 104 bypasses structured logging and makes batch diagnostics harder to aggregate.
Proposed fix
if results: logger.error("The following errors occurred:") for r in results: - print(r) + logger.error(r)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/patch_input_cif_files.py` around lines 102 - 104, Replace the bare print(r) in scripts/patch_input_cif_files.py with structured logging: iterate over the collected results and call logger.error for each entry (for example logger.error("...: %s", r) or include r as structured data), removing the print; use logger.error(..., exc_info=True) when r contains exception info so errors are recorded consistently via the existing logger and are aggregatable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/patch_input_cif_files.py`:
- Around line 121-132: The except block can reference reference_path that may be
unassigned if input_pdb_pattern.format(...) raises; initialize or build the path
safely before the try or ensure the except uses a safe fallback. Specifically,
declare reference_path = None (or compute a safe reference_str) before the try
that contains input_pdb_pattern.format(...), then inside the except construct
the msg using that safe variable (or include the caught exception e) instead of
directly interpolating reference_path; adjust references around functions fetch,
load_any, cif_file, and ensure_atom_array_stack accordingly so the error path
never raises UnboundLocalError when logger.warning(msg) or return msg executes.
---
Nitpick comments:
In `@scripts/patch_input_cif_files.py`:
- Around line 102-104: Replace the bare print(r) in
scripts/patch_input_cif_files.py with structured logging: iterate over the
collected results and call logger.error for each entry (for example
logger.error("...: %s", r) or include r as structured data), removing the print;
use logger.error(..., exc_info=True) when r contains exception info so errors
are recorded consistently via the existing logger and are aggregatable.
| try: | ||
| reference_path = reference_dir / input_pdb_pattern.format(pdb_id=rcsb_id) | ||
| # fetch only downloads the file if it isn't already present. | ||
| rcsb_path = fetch(rcsb_id, format="cif", target_path=str(SAMPLEWORKS_CACHE)) | ||
|
|
||
| reference = load_any(reference_path) | ||
| asym_unit = load_any(cif_file) | ||
| asym_unit = ensure_atom_array_stack(asym_unit) | ||
| except Exception as e: | ||
| msg = f"Unable to read and parse either/both of {reference_path}, {cif_file}" | ||
| logger.warning(msg) | ||
| return msg |
There was a problem hiding this comment.
Guard the exception path against an unbound reference_path.
If Line 122 fails during input_pdb_pattern.format(...), Line 130 can raise UnboundLocalError, which breaks the intended “return error message” flow.
Proposed fix
- # Get the offset for residue numbering in the reference structure
- try:
- reference_path = reference_dir / input_pdb_pattern.format(pdb_id=rcsb_id)
+ # Get the offset for residue numbering in the reference structure
+ reference_path: Path | None = None
+ try:
+ reference_path = reference_dir / input_pdb_pattern.format(pdb_id=rcsb_id)
# fetch only downloads the file if it isn't already present.
rcsb_path = fetch(rcsb_id, format="cif", target_path=str(SAMPLEWORKS_CACHE))
reference = load_any(reference_path)
asym_unit = load_any(cif_file)
asym_unit = ensure_atom_array_stack(asym_unit)
- except Exception as e:
- msg = f"Unable to read and parse either/both of {reference_path}, {cif_file}"
+ except Exception as exc:
+ msg = (
+ f"Unable to fetch/read one of rcsb_id={rcsb_id}, "
+ f"reference_path={reference_path}, cif_file={cif_file}: {exc}"
+ )
logger.warning(msg)
return msg🧰 Tools
🪛 Ruff (0.15.2)
[warning] 129-129: Do not catch blind exception: Exception
(BLE001)
[error] 129-129: Local variable e is assigned to but never used
Remove assignment to unused variable e
(F841)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/patch_input_cif_files.py` around lines 121 - 132, The except block
can reference reference_path that may be unassigned if
input_pdb_pattern.format(...) raises; initialize or build the path safely before
the try or ensure the except uses a safe fallback. Specifically, declare
reference_path = None (or compute a safe reference_str) before the try that
contains input_pdb_pattern.format(...), then inside the except construct the msg
using that safe variable (or include the caught exception e) instead of directly
interpolating reference_path; adjust references around functions fetch,
load_any, cif_file, and ensure_atom_array_stack accordingly so the error path
never raises UnboundLocalError when logger.warning(msg) or return msg executes.
Reminder: we patch our CIF files to make it easier/possible to work with them using tools from phenix, PDB-REDO, etc...
Changes to CIF patching:
Set the sequence ids to what they are in the input CIF used for guidance
Include the occupancy and B-factor columns, setting B to 20.0 if missing, occupancy to 1.0 if missing
Instead of backing up the original refined.cif, write the new CIF file as refined-patched.cif
Summary by CodeRabbit
New Features
Improvements