Skip to content

Updates to CIF patching code#104

Merged
marcuscollins merged 2 commits intomainfrom
updatd-patch-script
Feb 26, 2026
Merged

Updates to CIF patching code#104
marcuscollins merged 2 commits intomainfrom
updatd-patch-script

Conversation

@marcuscollins
Copy link
Copy Markdown
Collaborator

@marcuscollins marcuscollins commented Feb 26, 2026

Reminder: we patch our CIF files to make it easier/possible to work with them using tools from phenix, PDB-REDO, etc...
Changes to CIF patching:

Set the sequence ids to what they are in the input CIF used for guidance
Include the occupancy and B-factor columns, setting B to 20.0 if missing, occupancy to 1.0 if missing
Instead of backing up the original refined.cif, write the new CIF file as refined-patched.cif

Summary by CodeRabbit

  • New Features

    • Added command-line options to specify a grid-search input directory and PDB filename pattern.
    • Logs when matching input files are found during directory crawl.
  • Improvements

    • Patched CIFs are written with a "-patched" suffix, preserving originals.
    • Aggregated error reporting and more robust handling of read/parse failures.
    • Multi-chain residue mapping validation, coordinate flattening, and consistent removal of NaN atoms.
    • Ensures occupancy/B-factor metadata, synchronized sequence numbers, and unique atom IDs before writing.

…ccupancies, write to a different file name (refined-patched.cif)
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 26, 2026

📝 Walkthrough

Walkthrough

The CIF patching script now accepts grid-search inputs and PDB patterns, returns errors instead of raising, performs multi-chain residue mapping and NaN-safe coordinate/metadata fixes, and writes patched outputs as -patched.cif files rather than modifying originals.

Changes

Cohort / File(s) Summary
CLI & Signatures
scripts/patch_input_cif_files.py
Added grid_search_input_dir and input_pdb_pattern CLI args; updated main() and patch_individual_cif_file() signatures to accept reference dir and input pattern.
Patching & Mapping Logic
scripts/patch_input_cif_files.py
Reworked patch flow to load reference PDB/CIF, perform multi-chain residue mapping and remapping, ensure occupancy/B_iso_or_equiv metadata, flatten coordinates, filter NaNs, align entity_poly when single polymer exists, and write -patched.cif.
Error Handling & Logging
scripts/patch_input_cif_files.py
Refactored to return error messages (no exceptions), aggregate and log multiple errors, and log CIF matches during directory crawl and final write locations.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI Arguments
    participant Main as main()
    participant Dir as Directory Crawler
    participant RefLoad as Reference Loader
    participant TargetLoad as Target Loader
    participant Mapper as Residue Mapper
    participant Patcher as CIF Patcher
    participant Writer as File Writer
    participant Logger as Logger

    CLI->>Main: provide input_dir, grid_search_input_dir, input_pdb_pattern, target_pattern
    Main->>Dir: crawl for CIF files matching patterns
    Dir->>Logger: log CIF match found
    Dir->>Main: return CIF paths

    Main->>RefLoad: load reference PDB/CIF from reference_dir
    RefLoad->>Main: reference structure
    Main->>TargetLoad: load target CIF
    TargetLoad->>Main: target structure

    Main->>Mapper: compute multi-chain residue mapping
    Mapper->>Mapper: validate mapping consistency
    Mapper-->>Main: mapping result or error

    Main->>Patcher: apply mapping, sync pdbx_poly_seq_scheme, ensure metadata, remove NaN atoms
    Patcher->>Patcher: adjust entity_poly and atom IDs
    Patcher-->>Main: patched CIF object

    Main->>Writer: write patched CIF as "-patched.cif"
    Writer->>Logger: log output path
    Main->>Logger: aggregate and log any returned errors
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • k-chrispens

Poem

🐰 In folders where tiny CIFs hide and play,
I hop and match residues, chain by chain,
No more overwrites — I write patched stays,
I log each stumble, gather errors sane,
A rabbit's patch, neat files for your day.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Updates to CIF patching code' is vague and generic, using non-descriptive language that doesn't convey the specific improvements or key changes made to the patching script. Make the title more specific by highlighting a key change, such as 'Add multi-chain residue mapping and occupancy/B-factor handling to CIF patching' or 'Enhance CIF patching with grid-search and metadata improvements'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch updatd-patch-script

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/patch_input_cif_files.py`:
- Around line 192-199: The current backfill only adds occupancy/B-factor when
the entire column is missing; instead, inspect template.block["atom_site"] for
existing "occupancy" and "B_iso_or_equiv" columns and replace any per-row
missing entries (e.g. ".", "?", None, NaN, or empty strings) with the defaults
1.0 for occupancy and 20.0 for B_iso_or_equiv. Implement this by reading the
existing column values from template.block["atom_site"]["occupancy"] and
template.block["atom_site"]["B_iso_or_equiv"], mapping each value to itself
unless it matches a missing sentinel then substituting the default, and then
writing the cleaned list back as a CIFColumn (or mutating the existing
CIFColumn) using the same length based on template.block["atom_site"]["id"] to
preserve alignment.
- Around line 147-155: The current positional zip of cif_keys and ref_keys can
silently remap residues across chains; when building mapping = {cif_key:
ref_key[1] ...} validate chain consistency for each zipped pair (cif_key[0] ==
ref_key[0]) and abort with an error if any pair has mismatched chain ids, or
alternatively build the mapping keyed by the full (chain_id, res_id) tuple so
lookups use chain+residue; ensure atom_keys (from asym_unit.chain_id and
asym_unit.res_id) are looked up with the same tuple shape and keep using
zip(..., strict=True) for length checks.
- Around line 121-134: The remote call to fetch(rcsb_id, ...) should be moved
inside the existing try-except so network/file-not-found errors are caught and
handled like the CIF parsing errors; update the try block to call fetch(...) to
set rcsb_path before computing reference_path, and on exception include the
caught exception details (e) in the msg string returned and logged via
logger.warning; ensure load_any and ensure_atom_array_stack remain inside the
same try so any of fetch, load_any, or ensure_atom_array_stack failures produce
the same logged/returned error message.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 469d8e4 and 30c76ae.

📒 Files selected for processing (1)
  • scripts/patch_input_cif_files.py

Comment thread scripts/patch_input_cif_files.py Outdated
Comment thread scripts/patch_input_cif_files.py
Comment thread scripts/patch_input_cif_files.py
Copy link
Copy Markdown
Collaborator

@k-chrispens k-chrispens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, coderabbit made a few suggestions that may or may not be worth addressing

Comment thread scripts/patch_input_cif_files.py Outdated
Comment thread scripts/patch_input_cif_files.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
scripts/patch_input_cif_files.py (1)

102-104: Use logger for each collected error instead of print().

Line 104 bypasses structured logging and makes batch diagnostics harder to aggregate.

Proposed fix
     if results:
         logger.error("The following errors occurred:")
         for r in results:
-            print(r)
+            logger.error(r)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/patch_input_cif_files.py` around lines 102 - 104, Replace the bare
print(r) in scripts/patch_input_cif_files.py with structured logging: iterate
over the collected results and call logger.error for each entry (for example
logger.error("...: %s", r) or include r as structured data), removing the print;
use logger.error(..., exc_info=True) when r contains exception info so errors
are recorded consistently via the existing logger and are aggregatable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/patch_input_cif_files.py`:
- Around line 121-132: The except block can reference reference_path that may be
unassigned if input_pdb_pattern.format(...) raises; initialize or build the path
safely before the try or ensure the except uses a safe fallback. Specifically,
declare reference_path = None (or compute a safe reference_str) before the try
that contains input_pdb_pattern.format(...), then inside the except construct
the msg using that safe variable (or include the caught exception e) instead of
directly interpolating reference_path; adjust references around functions fetch,
load_any, cif_file, and ensure_atom_array_stack accordingly so the error path
never raises UnboundLocalError when logger.warning(msg) or return msg executes.

---

Nitpick comments:
In `@scripts/patch_input_cif_files.py`:
- Around line 102-104: Replace the bare print(r) in
scripts/patch_input_cif_files.py with structured logging: iterate over the
collected results and call logger.error for each entry (for example
logger.error("...: %s", r) or include r as structured data), removing the print;
use logger.error(..., exc_info=True) when r contains exception info so errors
are recorded consistently via the existing logger and are aggregatable.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30c76ae and 52fb5f2.

📒 Files selected for processing (1)
  • scripts/patch_input_cif_files.py

Comment on lines +121 to +132
try:
reference_path = reference_dir / input_pdb_pattern.format(pdb_id=rcsb_id)
# fetch only downloads the file if it isn't already present.
rcsb_path = fetch(rcsb_id, format="cif", target_path=str(SAMPLEWORKS_CACHE))

reference = load_any(reference_path)
asym_unit = load_any(cif_file)
asym_unit = ensure_atom_array_stack(asym_unit)
except Exception as e:
msg = f"Unable to read and parse either/both of {reference_path}, {cif_file}"
logger.warning(msg)
return msg
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard the exception path against an unbound reference_path.

If Line 122 fails during input_pdb_pattern.format(...), Line 130 can raise UnboundLocalError, which breaks the intended “return error message” flow.

Proposed fix
-    # Get the offset for residue numbering in the reference structure    
-    try:
-        reference_path = reference_dir / input_pdb_pattern.format(pdb_id=rcsb_id)
+    # Get the offset for residue numbering in the reference structure
+    reference_path: Path | None = None
+    try:
+        reference_path = reference_dir / input_pdb_pattern.format(pdb_id=rcsb_id)
         # fetch only downloads the file if it isn't already present.
         rcsb_path = fetch(rcsb_id, format="cif", target_path=str(SAMPLEWORKS_CACHE))
 
         reference = load_any(reference_path)
         asym_unit = load_any(cif_file)
         asym_unit = ensure_atom_array_stack(asym_unit)
-    except Exception as e:
-        msg = f"Unable to read and parse either/both of {reference_path}, {cif_file}"
+    except Exception as exc:
+        msg = (
+            f"Unable to fetch/read one of rcsb_id={rcsb_id}, "
+            f"reference_path={reference_path}, cif_file={cif_file}: {exc}"
+        )
         logger.warning(msg)
         return msg
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 129-129: Do not catch blind exception: Exception

(BLE001)


[error] 129-129: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/patch_input_cif_files.py` around lines 121 - 132, The except block
can reference reference_path that may be unassigned if
input_pdb_pattern.format(...) raises; initialize or build the path safely before
the try or ensure the except uses a safe fallback. Specifically, declare
reference_path = None (or compute a safe reference_str) before the try that
contains input_pdb_pattern.format(...), then inside the except construct the msg
using that safe variable (or include the caught exception e) instead of directly
interpolating reference_path; adjust references around functions fetch,
load_any, cif_file, and ensure_atom_array_stack accordingly so the error path
never raises UnboundLocalError when logger.warning(msg) or return msg executes.

@marcuscollins marcuscollins merged commit d10029b into main Feb 26, 2026
1 check passed
@marcuscollins marcuscollins deleted the updatd-patch-script branch February 26, 2026 02:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants