feat(ciffiles): store metadata about ensemble generation directly in output CIF files. by marcuscollins · Pull Request #209 · diff-use/sampleworks

marcuscollins · 2026-04-10T18:59:33Z

This PR:

adds a new method sampleworks.utils.cif_utils.add_category_to_cif that adds dictionary-like data to a biotite CIFFile object. This data can be a simple dictionary of keys and values, or it can be a dictionary of list-like objects, resulting in a CIF loop_ structure.
Uses that method in sampleworks.utils.guidance_script_utils.save_everything to write the GuidanceConfig.as_dict() object into the output CIF file
Uses that method in scripts/patch_output_cif_files.py to copy the job_metadata.json data in output directories into any patched CIF files, if it isn't already present.
Adds tests for the new add_category_to_cif file, including that any category data so written is read in and written out again correctly.

resolves #208

Summary by CodeRabbit

New Features
- Sampleworks trial metadata is now embedded into output CIFs and also emitted as a JSON sidecar.
Improvements
- Metadata insertion can add or replace structured categories in CIFs with conflict handling and validation.
- CLI help text for input path matching clarified.
- Output saving unified to produce consistent CIF + JSON metadata.
Tests
- Added tests covering metadata insertion, error cases, persistence, and updated save-flow tests.

…ategory to a CIFFile object

…; fixing up Claude's implementation and tests.

… grid search and patching script; resolves #208

coderabbitai · 2026-04-10T18:59:47Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1b5f744d-2144-4ff4-9c48-66697a71d692

📥 Commits

Reviewing files that changed from the base of the PR and between 342058a and e8ed80d.

📒 Files selected for processing (1)

src/sampleworks/utils/guidance_script_utils.py

🚧 Files skipped from review as they are similar to previous changes (1)

src/sampleworks/utils/guidance_script_utils.py

📝 Walkthrough

Walkthrough

Writes GuidanceConfig-derived metadata into output CIFs as a sampleworks CIF category, adds add_category_to_cif() for in-memory CIF mutation, centralizes metadata creation/writing in save_everything, and updates the CIF patching script to preserve or inject sampleworks metadata.

Changes

Cohort / File(s)	Summary
CIF Utilities `src/sampleworks/utils/cif_utils.py`	Add `add_category_to_cif(ciffile, data, category_name, overwrite=False, block_name=None)` to construct and insert a `CIFCategory` into a block; block selection validation, overwrite handling, and `None` -> `"?"` normalization.
Metadata Integration `src/sampleworks/utils/guidance_script_utils.py`	Change `save_everything` to accept `args: GuidanceConfig`; compute `metadata = args.as_dict()`; embed metadata into `refined.cif` via `add_category_to_cif`; write `job_metadata.json` into output dir; tighten related function signatures to require `GuidanceConfig`.
Output-patching script `scripts/patch_output_cif_files.py`	Update `patch_individual_cif_file` to prefer an existing `sampleworks` block from the source CIF, else load `job_metadata.json` from the CIF directory and add it with `add_category_to_cif`; extend `--input-pdb-pattern` CLI help text.
Tests `tests/utils/test_cif_utils.py`, `tests/utils/test_guidance_script_utils.py`, `tests/integration/test_mismatch_integration.py`	Add `TestAddCategoryToCif` covering single/multi-block, overwrite/error cases, persistence, and None/edge inputs; update tests to construct/pass `GuidanceConfig` to `save_everything()` and adjust integration test calls accordingly.

Sequence Diagram(s)

sequenceDiagram
  participant Guidance as Guidance runner
  participant Save as save_everything
  participant CIFUtil as add_category_to_cif
  participant FS as Filesystem
  participant Patcher as patch_output_cif_files

  Guidance->>Save: run with GuidanceConfig args
  Save->>CIFUtil: load template CIF and add "sampleworks" category (metadata)
  CIFUtil-->>Save: modified CIF
  Save->>FS: write refined.cif (with sampleworks)
  Save->>FS: write job_metadata.json
  Note right of FS: Later, patcher may read refined.cif or job_metadata.json
  Patcher->>FS: read refined.cif
  alt refined.cif contains sampleworks
    FS-->>Patcher: return CIF with sampleworks (preserve)
  else
    FS-->>Patcher: return CIF without sampleworks
    FS-->>Patcher: read job_metadata.json
    Patcher->>CIFUtil: add sampleworks category from JSON
    CIFUtil-->>Patcher: patched CIF
    Patcher->>FS: write patched CIF
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

fix(rf3): squashed RF3 bug that was blocking running on models with unresolved residues #140: Modifies guidance_script_utils.py and save_everything behavior — likely overlaps with metadata embedding and signature changes.
Write out job metadata to a file in the same directory as refined.cif #132: Previously wrote job metadata JSON from a different location; related because this PR centralizes JSON writing into save_everything.
Updates to CIF patching code #104: Changes CIF patching and CLI input_pdb_pattern handling; related to the patching script edits.

Suggested reviewers

k-chrispens

Poem

"I hopped through CIFs with a tiny pen,
Tucked guidance secrets inside now and then,
A category placed with a soft little thump,
Metadata snug in the crystalized dump,
🐰✨ — the rabbit winked and hopped again."

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 69.57% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: storing metadata about ensemble generation directly in CIF files.
Linked Issues check	✅ Passed	All PR objectives align with issue `#208` requirements: adds add_category_to_cif utility, writes GuidanceConfig metadata to CIF files, and includes round-trip tests.
Out of Scope Changes check	✅ Passed	All changes directly support the core objective of storing metadata in CIF files; no unrelated refactoring or scope creep detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch mdc/store-metadata-in-cif

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

tests/utils/test_cif_utils.py (1)

221-365: Comprehensive test coverage, but missing test for None value handling.

The test suite covers block selection, overwrite behavior, error cases, and round-trip persistence well. However, there's no explicit test verifying the None → "none" conversion behavior documented in the implementation.

💡 Suggested additional test

def test_none_values_converted(self, tmp_path):
    """None values in data dict should be converted to placeholder string."""
    atoms = [_atom("A", 1, "ALA", False)]
    cif_path = _write_cif(atoms, tmp_path / "test.cif")
    ciffile = CIFFile.read(str(cif_path))

    data = {"present": "value", "missing": None}
    add_category_to_cif(ciffile, data, "test_category")

    block = ciffile[list(ciffile.keys())[0]]
    category = block["test_category"]
    # Verify None was replaced (with "none" or "?" depending on implementation)
    assert "missing" in category

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/utils/test_cif_utils.py` around lines 221 - 365, Add a test that
verifies add_category_to_cif converts None values to the CIF placeholder: create
a CIF with one block, call add_category_to_cif with data containing a None
(e.g., {"present":"value","missing":None}) and assert the resulting category in
the block contains the "missing" key and that its value equals the expected
placeholder used by add_category_to_cif (e.g., "none" or "?"); add this as
test_none_values_converted alongside the other TestAddCategoryToCif methods to
ensure None→placeholder behavior is covered.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/sampleworks/utils/cif_utils.py`:
- Around line 314-317: Replace the non-standard "none" placeholder with the
CIF-standard missing-value marker: when building CIFCategory in the
CIFCategory(...) call (the category variable constructed from data and
category_name), map None values to "?" instead of "none" (i.e., use {k: v if v
is not None else "?" for k, v in data.items()}); if you need to distinguish
inapplicable values later, use "." per CIF rules, but for the current None
mapping use "?" to conform to CIF standard.

In `@src/sampleworks/utils/guidance_script_utils.py`:
- Line 268: The type hint allows argparse.Namespace but the code calls
args.as_dict() (which Namespace lacks), so either restrict the parameter to
GuidanceConfig or add a fallback: change the function handling the args
parameter to detect or coerce Namespace to a dict (e.g., if hasattr(args,
"as_dict"): cfg = args.as_dict() else: cfg = vars(args) or use getattr(args,
"as_dict", lambda: vars(args))()), and update type hint if you choose to forbid
Namespace; reference the GuidanceConfig type and the args.as_dict() call when
making the change.

---

Nitpick comments:
In `@tests/utils/test_cif_utils.py`:
- Around line 221-365: Add a test that verifies add_category_to_cif converts
None values to the CIF placeholder: create a CIF with one block, call
add_category_to_cif with data containing a None (e.g.,
{"present":"value","missing":None}) and assert the resulting category in the
block contains the "missing" key and that its value equals the expected
placeholder used by add_category_to_cif (e.g., "none" or "?"); add this as
test_none_values_converted alongside the other TestAddCategoryToCif methods to
ensure None→placeholder behavior is covered.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 86cd0398-cff0-4e97-81eb-b0f95e64d0bb

📥 Commits

Reviewing files that changed from the base of the PR and between 14ba724 and 33dc325.

📒 Files selected for processing (6)

scripts/patch_output_cif_files.py
src/sampleworks/utils/cif_utils.py
src/sampleworks/utils/guidance_script_utils.py
tests/integration/test_mismatch_integration.py
tests/utils/test_cif_utils.py
tests/utils/test_guidance_script_utils.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

src/sampleworks/utils/guidance_script_utils.py (1)

331-332: ⚠️ Potential issue | 🟠 Major

Potential runtime break if any caller still passes argparse.Namespace.

Line 331 assumes args.as_dict() exists. Existing script paths still using parse_*_args() + direct run_guidance(...) with raw parser output will raise AttributeError here unless they were migrated to GuidanceConfig construction first.

#!/bin/bash
set -euo pipefail

echo "Check parser return style (Namespace-producing parse_args):"
rg -nP --type=py 'def parse_.*_args\(' src/sampleworks/utils/guidance_script_arguments.py -A20 \
  | rg -n 'def |return parser\.parse_args\(\)'

echo
echo "Check run_guidance() call sites in scripts:"
rg -nP --type=py '\brun_guidance\s*\(' scripts -C2

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/sampleworks/utils/guidance_script_utils.py` around lines 331 - 332, The
code unconditionally calls args.as_dict() (in guidance_script_utils.py) which
will raise AttributeError for callers that still pass an argparse.Namespace
(e.g. parse_*_args() + run_guidance(...) sites); change the metadata extraction
to handle all forms: if args has as_dict use that, else if it's an
argparse.Namespace use vars(args), else if it's a GuidanceConfig (or other known
config class) call its appropriate to-dict method, so metadata is always a dict;
update the metadata assignment in run_guidance (or the function that performs
metadata = args.as_dict()) to perform this safe branching and normalize to a
plain dict before further use.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/sampleworks/utils/cif_utils.py`:
- Around line 275-279: The docstring example is non-runnable because
add_category_to_cif expects an existing CIF block; update the example to create
or add a block to the CIFFile before calling add_category_to_cif (e.g., create
CIFFile(), add a block or use CIFFile().new_block(...) / ensure ciffile.block
contains at least one block), then call add_category_to_cif(ciffile, data,
"my_custom_data"); reference the CIFFile constructor and add_category_to_cif
function to locate and fix the example.
- Around line 314-317: The current CIFCategory construction replaces only
top-level None values in data but leaves None inside list-like columns (e.g.,
data entries that are lists/tuples), so update the normalization before creating
CIFCategory: walk each value in the data dict (the variable data used to build
category in code that constructs category = CIFCategory(...)), and if a value is
a list/tuple, map its elements replacing inner None with "?" (recursively if
nested lists are possible); leave non-list values handled as before. After
normalizing list elements, pass the cleaned dict into
CIFCategory(name=category_name, columns=...) so all nulls comply with CIF rules.

---

Duplicate comments:
In `@src/sampleworks/utils/guidance_script_utils.py`:
- Around line 331-332: The code unconditionally calls args.as_dict() (in
guidance_script_utils.py) which will raise AttributeError for callers that still
pass an argparse.Namespace (e.g. parse_*_args() + run_guidance(...) sites);
change the metadata extraction to handle all forms: if args has as_dict use
that, else if it's an argparse.Namespace use vars(args), else if it's a
GuidanceConfig (or other known config class) call its appropriate to-dict
method, so metadata is always a dict; update the metadata assignment in
run_guidance (or the function that performs metadata = args.as_dict()) to
perform this safe branching and normalize to a plain dict before further use.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 940ad7a7-cb49-403a-9e2e-7a5c2eda48a1

📥 Commits

Reviewing files that changed from the base of the PR and between 33dc325 and 02258c5.

📒 Files selected for processing (3)

src/sampleworks/utils/cif_utils.py
src/sampleworks/utils/guidance_script_utils.py
tests/utils/test_cif_utils.py

🚧 Files skipped from review as they are similar to previous changes (1)

tests/utils/test_cif_utils.py

k-chrispens

Looks good! This is a cool new method

marcuscollins added 3 commits April 10, 2026 08:19

Used Claude to add a new method in cif_utils.py to add a Biotite CIFC…

03355dd

…ategory to a CIFFile object

feat(ciffiles): Adding method and tests to add categories to CIFFiles…

98997f8

…; fixing up Claude's implementation and tests.

feat(ciffiles): Write Sampleworks metadata into CIF files produced by…

33dc325

… grid search and patching script; resolves #208

marcuscollins requested a review from k-chrispens April 10, 2026 18:59

marcuscollins temporarily deployed to gpu-testing April 10, 2026 18:59 — with GitHub Actions Inactive

coderabbitai Bot reviewed Apr 10, 2026

View reviewed changes

Comment thread src/sampleworks/utils/guidance_script_utils.py Outdated

Addressing PR 209 automated feedback

02258c5

marcuscollins had a problem deploying to gpu-testing April 10, 2026 21:23 — with GitHub Actions Error

coderabbitai Bot reviewed Apr 10, 2026

View reviewed changes

Comment thread src/sampleworks/utils/cif_utils.py

Comment thread src/sampleworks/utils/cif_utils.py

Better handling of None in CIF data we insert to CIFFile objects

342058a

marcuscollins had a problem deploying to gpu-testing April 10, 2026 21:47 — with GitHub Actions Error

Fix lint issue

e8ed80d

marcuscollins temporarily deployed to gpu-testing April 10, 2026 22:38 — with GitHub Actions Inactive

k-chrispens approved these changes Apr 10, 2026

View reviewed changes

k-chrispens merged commit e66b543 into main Apr 10, 2026
11 checks passed

k-chrispens deleted the mdc/store-metadata-in-cif branch April 10, 2026 23:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ciffiles): store metadata about ensemble generation directly in output CIF files. #209

feat(ciffiles): store metadata about ensemble generation directly in output CIF files. #209
k-chrispens merged 6 commits intomainfrom
mdc/store-metadata-in-cif

marcuscollins commented Apr 10, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 10, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

k-chrispens left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

marcuscollins commented Apr 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

k-chrispens left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

marcuscollins commented Apr 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 10, 2026 •

edited

Loading