Skip to content

Fix path serialization bug and first-run RF3 MSA bug#176

Merged
marcuscollins merged 3 commits intomainfrom
fix-path-serialization-bug
Mar 18, 2026
Merged

Fix path serialization bug and first-run RF3 MSA bug#176
marcuscollins merged 3 commits intomainfrom
fix-path-serialization-bug

Conversation

@marcuscollins
Copy link
Copy Markdown
Collaborator

@marcuscollins marcuscollins commented Mar 17, 2026

This PR contains two small fixes to our generation code:

  1. At some point we switched our config objects to use Path objects, rather than str for paths to various files. We wrote out our metadata using GuidanceConfig.__dict__ and json.dump, but json doesn't know how to serialize a Path. Therefore I introduced a new method GuidanceConfig.as_dict which converts Path to str.

  2. When running RF3 alone, if the MSA cache is not fully populated, we call _compute_msa to get the MSA files, but didn't make sure to use the .a3m format files. These files are written, but the paths to .csv files are returned. This patch contains a fix so that we get the correct files if we're using MSAManager.get_msa.

Summary by CodeRabbit

  • Bug Fixes

    • Added strict validation for computed MSA cache contents, raising clear errors when expected files or formats are missing or mismatched.
  • Chores

    • Improved job metadata serialization to emit only public, JSON-friendly data.
    • Minor wording and comment clarifications in guidance-related utilities.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 17, 2026

📝 Walkthrough

Walkthrough

Added a public as_dict() to GuidanceConfig to produce a JSON-safe dict (converting density and structure to strings). run_guidance_job_queue now serializes job.as_dict() instead of job.__dict__. MSA handling gains a private validator that checks cached MSA file pairs and content parity.

Changes

Cohort / File(s) Summary
Guidance config serialization
src/sampleworks/utils/guidance_script_arguments.py
Added GuidanceConfig.as_dict(self) -> dict[str, Any] which returns a copy of the instance dict with density and structure converted to their string forms for JSON-safe serialization.
Guidance job serialization usage
src/sampleworks/utils/guidance_script_utils.py
Replaced JSON dumping of job.__dict__ with job.as_dict() in run_guidance_job_queue to avoid serializing non-JSON-friendly fields.
MSA validation & flow
src/sampleworks/utils/msa.py
Added private _validate_msa_cache_contents(msa_hash: str, msa_dir: Path) -> None that verifies {hash}_{n}.csv / {hash}_{n}.a3m pairs, CSV headers, and per-sequence parity; raises on mismatches. _compute_msa return is now assigned to _ where its mapping was previously captured; added invocation of the new validator and minor comment adjustments.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through dicts and strings so bright,

Paths smoothed for JSON, every field polite.
I checked each MSA pair, line by line,
Found matching headers — everything's fine.
A little thump: your data's now alright. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the two main fixes in the PR: path serialization issue and RF3 MSA bug, matching the core objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-path-serialization-bug
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/sampleworks/utils/msa.py (1)

290-294: Good defensive validation.

This check ensures that the MSA computation actually produced the expected files, providing a clear error message if something went wrong. This is especially valuable for the RF3 first-run case where we're now relying on side effects.

💡 Optional: Use generator expression instead of list comprehension
-            if not all([m.exists() for m in msa_path_dict.values()]):
+            if not all(m.exists() for m in msa_path_dict.values()):

Generator expressions are slightly more efficient with all() since they short-circuit on the first False without building the full list. The same applies to line 271.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/sampleworks/utils/msa.py` around lines 290 - 294, The all() calls
currently use list comprehensions which allocate a list unnecessarily; replace
all([m.exists() for m in msa_path_dict.values()]) with a generator expression
all(m.exists() for m in msa_path_dict.values()) in the FileNotFoundError check
(and likewise update the earlier similar check around the MSA pre-check),
keeping the same variables (msa_path_dict, m.exists(), self.msa_dir) and
preserving the existing error message and behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/sampleworks/utils/guidance_script_arguments.py`:
- Around line 191-195: The as_dict method in GuidanceScriptArguments builds an
output dictionary from self.__dict__ and converts Path-like fields but never
returns it; update the as_dict function (method name: as_dict) to return the
constructed output dictionary (variable: output) at the end so callers receive
the serialized dict (ensure you keep the existing conversions for density and
structure).

---

Nitpick comments:
In `@src/sampleworks/utils/msa.py`:
- Around line 290-294: The all() calls currently use list comprehensions which
allocate a list unnecessarily; replace all([m.exists() for m in
msa_path_dict.values()]) with a generator expression all(m.exists() for m in
msa_path_dict.values()) in the FileNotFoundError check (and likewise update the
earlier similar check around the MSA pre-check), keeping the same variables
(msa_path_dict, m.exists(), self.msa_dir) and preserving the existing error
message and behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 76bbbe28-f332-49cb-aeab-100c8757b345

📥 Commits

Reviewing files that changed from the base of the PR and between 0969184 and 2dafd3e.

📒 Files selected for processing (3)
  • src/sampleworks/utils/guidance_script_arguments.py
  • src/sampleworks/utils/guidance_script_utils.py
  • src/sampleworks/utils/msa.py

Comment thread src/sampleworks/utils/guidance_script_arguments.py Outdated
self.use_tweedie = args.use_tweedie
self.ensemble_size = job.ensemble_size

def as_dict(self):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would also be useful to provide return type for completions, etc.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as well as fix the coderabbit coment below

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

derp.

msa_path_dict = _compute_msa(
# do NOT capture this output, it will return paths to .csv files,
# which we don't necessarily want.
_ = _compute_msa(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't want to capture the output, should this be void? I think this is the only place it is used currently. Either that or we return a dict of paths to all the generated files, something like that.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are okay with it, I want to leave that for a bigger cleanup of that method.

Comment thread src/sampleworks/utils/msa.py Outdated
self._cache_hits += 1

# Check again that the files exist, if not raise an error.
if not all([m.exists() for m in msa_path_dict.values()]):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this might be useful to encapsulate either within the _compute_msa function or its own "private" checker function? What do you think?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can do that.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/sampleworks/utils/msa.py`:
- Line 370: The call to _validate_msa_cache_contents(hash_key, self.msa_dir)
forces expensive full-CSV/A3M validation on every cache hit; change the logic in
the function (or the caller that uses hash_key and self.msa_dir) so deep
validation runs only when recomputing or when an explicit debug/integrity flag
is set (e.g., add a parameter like perform_deep_validation=False or check
self.debug_integrity before invoking _validate_msa_cache_contents), and ensure
the fast path for cache hits skips the full read/compare while preserving an
option to trigger the heavy validation during recompute or when the integrity
flag is true.
- Line 92: In the loop that iterates sequences using zip(csv_sequences,
a3m_sequences) (the for seq_idx, (csv_seq, a3m_seq) in enumerate(...) in
src/sampleworks/utils/msa.py), make the length invariant explicit by adding
strict=True to the zip call (i.e., zip(csv_sequences, a3m_sequences,
strict=True)); this enforces that the two iterables have equal length (already
validated earlier) and resolves Ruff B905.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 283066f3-7a70-449e-817f-e81b77a845b8

📥 Commits

Reviewing files that changed from the base of the PR and between 2dafd3e and c4bc7b3.

📒 Files selected for processing (3)
  • src/sampleworks/utils/guidance_script_arguments.py
  • src/sampleworks/utils/guidance_script_utils.py
  • src/sampleworks/utils/msa.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/sampleworks/utils/guidance_script_utils.py
  • src/sampleworks/utils/guidance_script_arguments.py

Comment thread src/sampleworks/utils/msa.py
else:
self._cache_hits += 1

_validate_msa_cache_contents(hash_key, self.msa_dir)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid full content validation on every cache hit.

Line 370 forces full CSV/A3M read+compare even on cache hits, which can turn the fast path into expensive I/O for large MSAs. Run deep validation only after recompute (or behind a debug/integrity flag).

Suggested patch
-            if not all([m.exists() for m in msa_path_dict.values()]):
+            if not all(m.exists() for m in msa_path_dict.values()):
                 # this will generate both a3m and csv files for us.
                 # do NOT capture this output, it will return paths to .csv files,
                 # which we don't necessarily want.
                 _ = _compute_msa(
                     data,
                     hash_key,  # this is the "target_id" argument to compute_msa
                     self.msa_dir,
                     self.msa_server_url,
                     msa_pairing_strategy,
                     msa_server_username=self.msa_server_username,
                     msa_server_password=self.msa_server_password,
                     api_key_header=self.api_key_header,
                     api_key_value=self.api_key_value,
                 )
                 self._api_calls += 1
+                _validate_msa_cache_contents(hash_key, self.msa_dir)
             else:
                 self._cache_hits += 1
-
-            _validate_msa_cache_contents(hash_key, self.msa_dir)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/sampleworks/utils/msa.py` at line 370, The call to
_validate_msa_cache_contents(hash_key, self.msa_dir) forces expensive
full-CSV/A3M validation on every cache hit; change the logic in the function (or
the caller that uses hash_key and self.msa_dir) so deep validation runs only
when recomputing or when an explicit debug/integrity flag is set (e.g., add a
parameter like perform_deep_validation=False or check self.debug_integrity
before invoking _validate_msa_cache_contents), and ensure the fast path for
cache hits skips the full read/compare while preserving an option to trigger the
heavy validation during recompute or when the integrity flag is true.

@k-chrispens k-chrispens self-requested a review March 18, 2026 00:00
Copy link
Copy Markdown
Collaborator

@k-chrispens k-chrispens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@marcuscollins marcuscollins merged commit a3dbccb into main Mar 18, 2026
4 checks passed
@k-chrispens k-chrispens deleted the fix-path-serialization-bug branch April 22, 2026 00:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants