Fix path serialization bug and first-run RF3 MSA bug by marcuscollins · Pull Request #176 · diff-use/sampleworks

marcuscollins · 2026-03-17T16:36:29Z

This PR contains two small fixes to our generation code:

At some point we switched our config objects to use Path objects, rather than str for paths to various files. We wrote out our metadata using GuidanceConfig.__dict__ and json.dump, but json doesn't know how to serialize a Path. Therefore I introduced a new method GuidanceConfig.as_dict which converts Path to str.
When running RF3 alone, if the MSA cache is not fully populated, we call _compute_msa to get the MSA files, but didn't make sure to use the .a3m format files. These files are written, but the paths to .csv files are returned. This patch contains a fix so that we get the correct files if we're using MSAManager.get_msa.

Summary by CodeRabbit

Bug Fixes
- Added strict validation for computed MSA cache contents, raising clear errors when expected files or formats are missing or mismatched.
Chores
- Improved job metadata serialization to emit only public, JSON-friendly data.
- Minor wording and comment clarifications in guidance-related utilities.

coderabbitai · 2026-03-17T16:37:16Z

📝 Walkthrough

Walkthrough

Added a public as_dict() to GuidanceConfig to produce a JSON-safe dict (converting density and structure to strings). run_guidance_job_queue now serializes job.as_dict() instead of job.__dict__. MSA handling gains a private validator that checks cached MSA file pairs and content parity.

Changes

Cohort / File(s)	Summary
Guidance config serialization `src/sampleworks/utils/guidance_script_arguments.py`	Added `GuidanceConfig.as_dict(self) -> dict[str, Any]` which returns a copy of the instance dict with `density` and `structure` converted to their string forms for JSON-safe serialization.
Guidance job serialization usage `src/sampleworks/utils/guidance_script_utils.py`	Replaced JSON dumping of `job.__dict__` with `job.as_dict()` in `run_guidance_job_queue` to avoid serializing non-JSON-friendly fields.
MSA validation & flow `src/sampleworks/utils/msa.py`	Added private `_validate_msa_cache_contents(msa_hash: str, msa_dir: Path) -> None` that verifies `{hash}_{n}.csv` / `{hash}_{n}.a3m` pairs, CSV headers, and per-sequence parity; raises on mismatches. `_compute_msa` return is now assigned to `_` where its mapping was previously captured; added invocation of the new validator and minor comment adjustments.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through dicts and strings so bright,

Paths smoothed for JSON, every field polite.
I checked each MSA pair, line by line,
Found matching headers — everything's fine.
A little thump: your data's now alright. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly addresses the two main fixes in the PR: path serialization issue and RF3 MSA bug, matching the core objectives.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix-path-serialization-bug

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/sampleworks/utils/msa.py (1)
290-294: Good defensive validation.

This check ensures that the MSA computation actually produced the expected files, providing a clear error message if something went wrong. This is especially valuable for the RF3 first-run case where we're now relying on side effects.
💡 Optional: Use generator expression instead of list comprehension
-            if not all([m.exists() for m in msa_path_dict.values()]):
+            if not all(m.exists() for m in msa_path_dict.values()):
Generator expressions are slightly more efficient with all() since they short-circuit on the first False without building the full list. The same applies to line 271.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/sampleworks/utils/msa.py` around lines 290 - 294, The all() calls
currently use list comprehensions which allocate a list unnecessarily; replace
all([m.exists() for m in msa_path_dict.values()]) with a generator expression
all(m.exists() for m in msa_path_dict.values()) in the FileNotFoundError check
(and likewise update the earlier similar check around the MSA pre-check),
keeping the same variables (msa_path_dict, m.exists(), self.msa_dir) and
preserving the existing error message and behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/sampleworks/utils/guidance_script_arguments.py`:
- Around line 191-195: The as_dict method in GuidanceScriptArguments builds an
output dictionary from self.__dict__ and converts Path-like fields but never
returns it; update the as_dict function (method name: as_dict) to return the
constructed output dictionary (variable: output) at the end so callers receive
the serialized dict (ensure you keep the existing conversions for density and
structure).

---

Nitpick comments:
In `@src/sampleworks/utils/msa.py`:
- Around line 290-294: The all() calls currently use list comprehensions which
allocate a list unnecessarily; replace all([m.exists() for m in
msa_path_dict.values()]) with a generator expression all(m.exists() for m in
msa_path_dict.values()) in the FileNotFoundError check (and likewise update the
earlier similar check around the MSA pre-check), keeping the same variables
(msa_path_dict, m.exists(), self.msa_dir) and preserving the existing error
message and behavior.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 76bbbe28-f332-49cb-aeab-100c8757b345

📥 Commits

Reviewing files that changed from the base of the PR and between 0969184 and 2dafd3e.

📒 Files selected for processing (3)

src/sampleworks/utils/guidance_script_arguments.py
src/sampleworks/utils/guidance_script_utils.py
src/sampleworks/utils/msa.py

k-chrispens · 2026-03-17T16:55:34Z

            self.use_tweedie = args.use_tweedie
            self.ensemble_size = job.ensemble_size

+    def as_dict(self):


would also be useful to provide return type for completions, etc.

as well as fix the coderabbit coment below

k-chrispens · 2026-03-17T16:57:17Z

-                msa_path_dict = _compute_msa(
+                # do NOT capture this output, it will return paths to .csv files,
+                # which we don't necessarily want.
+                _ = _compute_msa(


If we don't want to capture the output, should this be void? I think this is the only place it is used currently. Either that or we return a dict of paths to all the generated files, something like that.

If you are okay with it, I want to leave that for a bigger cleanup of that method.

k-chrispens · 2026-03-17T16:59:16Z

                self._cache_hits += 1

+            # Check again that the files exist, if not raise an error.
+            if not all([m.exists() for m in msa_path_dict.values()]):


It seems like this might be useful to encapsulate either within the _compute_msa function or its own "private" checker function? What do you think?

Yeah, I can do that.

… JSON serializable.

… correct .a3m MSA files

…mething; add MSA cache validation

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/sampleworks/utils/msa.py`:
- Line 370: The call to _validate_msa_cache_contents(hash_key, self.msa_dir)
forces expensive full-CSV/A3M validation on every cache hit; change the logic in
the function (or the caller that uses hash_key and self.msa_dir) so deep
validation runs only when recomputing or when an explicit debug/integrity flag
is set (e.g., add a parameter like perform_deep_validation=False or check
self.debug_integrity before invoking _validate_msa_cache_contents), and ensure
the fast path for cache hits skips the full read/compare while preserving an
option to trigger the heavy validation during recompute or when the integrity
flag is true.
- Line 92: In the loop that iterates sequences using zip(csv_sequences,
a3m_sequences) (the for seq_idx, (csv_seq, a3m_seq) in enumerate(...) in
src/sampleworks/utils/msa.py), make the length invariant explicit by adding
strict=True to the zip call (i.e., zip(csv_sequences, a3m_sequences,
strict=True)); this enforces that the two iterables have equal length (already
validated earlier) and resolves Ruff B905.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 283066f3-7a70-449e-817f-e81b77a845b8

📥 Commits

Reviewing files that changed from the base of the PR and between 2dafd3e and c4bc7b3.

📒 Files selected for processing (3)

src/sampleworks/utils/guidance_script_arguments.py
src/sampleworks/utils/guidance_script_utils.py
src/sampleworks/utils/msa.py

🚧 Files skipped from review as they are similar to previous changes (2)

src/sampleworks/utils/guidance_script_utils.py
src/sampleworks/utils/guidance_script_arguments.py

coderabbitai · 2026-03-17T23:06:38Z

            else:
                self._cache_hits += 1

+            _validate_msa_cache_contents(hash_key, self.msa_dir)


⚠️ Potential issue | 🟠 Major

Avoid full content validation on every cache hit.

Line 370 forces full CSV/A3M read+compare even on cache hits, which can turn the fast path into expensive I/O for large MSAs. Run deep validation only after recompute (or behind a debug/integrity flag).

Suggested patch

- if not all([m.exists() for m in msa_path_dict.values()]): + if not all(m.exists() for m in msa_path_dict.values()): # this will generate both a3m and csv files for us. # do NOT capture this output, it will return paths to .csv files, # which we don't necessarily want. _ = _compute_msa( data, hash_key, # this is the "target_id" argument to compute_msa self.msa_dir, self.msa_server_url, msa_pairing_strategy, msa_server_username=self.msa_server_username, msa_server_password=self.msa_server_password, api_key_header=self.api_key_header, api_key_value=self.api_key_value, ) self._api_calls += 1 + _validate_msa_cache_contents(hash_key, self.msa_dir) else: self._cache_hits += 1 - - _validate_msa_cache_contents(hash_key, self.msa_dir)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/sampleworks/utils/msa.py` at line 370, The call to _validate_msa_cache_contents(hash_key, self.msa_dir) forces expensive full-CSV/A3M validation on every cache hit; change the logic in the function (or the caller that uses hash_key and self.msa_dir) so deep validation runs only when recomputing or when an explicit debug/integrity flag is set (e.g., add a parameter like perform_deep_validation=False or check self.debug_integrity before invoking _validate_msa_cache_contents), and ensure the fast path for cache hits skips the full read/compare while preserving an option to trigger the heavy validation during recompute or when the integrity flag is true.

k-chrispens

LGTM

marcuscollins requested a review from k-chrispens March 17, 2026 16:36

marcuscollins had a problem deploying to gpu-testing March 17, 2026 16:36 — with GitHub Actions Error

coderabbitai Bot reviewed Mar 17, 2026

View reviewed changes

Comment thread src/sampleworks/utils/guidance_script_arguments.py Outdated

k-chrispens requested changes Mar 17, 2026

View reviewed changes

marcuscollins force-pushed the fix-path-serialization-bug branch from 2dafd3e to c3285f8 Compare March 17, 2026 22:57

marcuscollins had a problem deploying to gpu-testing March 17, 2026 22:57 — with GitHub Actions Error

marcuscollins added 3 commits March 17, 2026 16:00

fix: convert Path to str before writing out metadata since Path isn't…

6b8884a

… JSON serializable.

fix: make sure that if RF3 is run with an empty MSA cache, we use the…

e9b9930

… correct .a3m MSA files

fix:PR 176 feedback from k.chrispens. as_dict method should return so…

c4bc7b3

…mething; add MSA cache validation

marcuscollins force-pushed the fix-path-serialization-bug branch from c3285f8 to c4bc7b3 Compare March 17, 2026 23:01

marcuscollins temporarily deployed to gpu-testing March 17, 2026 23:01 — with GitHub Actions Inactive

coderabbitai Bot reviewed Mar 17, 2026

View reviewed changes

k-chrispens self-requested a review March 18, 2026 00:00

k-chrispens approved these changes Mar 18, 2026

View reviewed changes

marcuscollins merged commit a3dbccb into main Mar 18, 2026
4 checks passed

k-chrispens deleted the fix-path-serialization-bug branch April 22, 2026 00:26

Conversation

marcuscollins commented Mar 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

k-chrispens Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

k-chrispens Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

marcuscollins Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

k-chrispens Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

marcuscollins Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

k-chrispens Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

marcuscollins Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

k-chrispens left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

marcuscollins commented Mar 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 17, 2026 •

edited

Loading