Fix path serialization bug and first-run RF3 MSA bug#176
Conversation
📝 WalkthroughWalkthroughAdded a public Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/sampleworks/utils/msa.py (1)
290-294: Good defensive validation.This check ensures that the MSA computation actually produced the expected files, providing a clear error message if something went wrong. This is especially valuable for the RF3 first-run case where we're now relying on side effects.
💡 Optional: Use generator expression instead of list comprehension
- if not all([m.exists() for m in msa_path_dict.values()]): + if not all(m.exists() for m in msa_path_dict.values()):Generator expressions are slightly more efficient with
all()since they short-circuit on the firstFalsewithout building the full list. The same applies to line 271.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/sampleworks/utils/msa.py` around lines 290 - 294, The all() calls currently use list comprehensions which allocate a list unnecessarily; replace all([m.exists() for m in msa_path_dict.values()]) with a generator expression all(m.exists() for m in msa_path_dict.values()) in the FileNotFoundError check (and likewise update the earlier similar check around the MSA pre-check), keeping the same variables (msa_path_dict, m.exists(), self.msa_dir) and preserving the existing error message and behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/sampleworks/utils/guidance_script_arguments.py`:
- Around line 191-195: The as_dict method in GuidanceScriptArguments builds an
output dictionary from self.__dict__ and converts Path-like fields but never
returns it; update the as_dict function (method name: as_dict) to return the
constructed output dictionary (variable: output) at the end so callers receive
the serialized dict (ensure you keep the existing conversions for density and
structure).
---
Nitpick comments:
In `@src/sampleworks/utils/msa.py`:
- Around line 290-294: The all() calls currently use list comprehensions which
allocate a list unnecessarily; replace all([m.exists() for m in
msa_path_dict.values()]) with a generator expression all(m.exists() for m in
msa_path_dict.values()) in the FileNotFoundError check (and likewise update the
earlier similar check around the MSA pre-check), keeping the same variables
(msa_path_dict, m.exists(), self.msa_dir) and preserving the existing error
message and behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 76bbbe28-f332-49cb-aeab-100c8757b345
📒 Files selected for processing (3)
src/sampleworks/utils/guidance_script_arguments.pysrc/sampleworks/utils/guidance_script_utils.pysrc/sampleworks/utils/msa.py
| self.use_tweedie = args.use_tweedie | ||
| self.ensemble_size = job.ensemble_size | ||
|
|
||
| def as_dict(self): |
There was a problem hiding this comment.
would also be useful to provide return type for completions, etc.
There was a problem hiding this comment.
as well as fix the coderabbit coment below
| msa_path_dict = _compute_msa( | ||
| # do NOT capture this output, it will return paths to .csv files, | ||
| # which we don't necessarily want. | ||
| _ = _compute_msa( |
There was a problem hiding this comment.
If we don't want to capture the output, should this be void? I think this is the only place it is used currently. Either that or we return a dict of paths to all the generated files, something like that.
There was a problem hiding this comment.
If you are okay with it, I want to leave that for a bigger cleanup of that method.
| self._cache_hits += 1 | ||
|
|
||
| # Check again that the files exist, if not raise an error. | ||
| if not all([m.exists() for m in msa_path_dict.values()]): |
There was a problem hiding this comment.
It seems like this might be useful to encapsulate either within the _compute_msa function or its own "private" checker function? What do you think?
There was a problem hiding this comment.
Yeah, I can do that.
2dafd3e to
c3285f8
Compare
… JSON serializable.
… correct .a3m MSA files
…mething; add MSA cache validation
c3285f8 to
c4bc7b3
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/sampleworks/utils/msa.py`:
- Line 370: The call to _validate_msa_cache_contents(hash_key, self.msa_dir)
forces expensive full-CSV/A3M validation on every cache hit; change the logic in
the function (or the caller that uses hash_key and self.msa_dir) so deep
validation runs only when recomputing or when an explicit debug/integrity flag
is set (e.g., add a parameter like perform_deep_validation=False or check
self.debug_integrity before invoking _validate_msa_cache_contents), and ensure
the fast path for cache hits skips the full read/compare while preserving an
option to trigger the heavy validation during recompute or when the integrity
flag is true.
- Line 92: In the loop that iterates sequences using zip(csv_sequences,
a3m_sequences) (the for seq_idx, (csv_seq, a3m_seq) in enumerate(...) in
src/sampleworks/utils/msa.py), make the length invariant explicit by adding
strict=True to the zip call (i.e., zip(csv_sequences, a3m_sequences,
strict=True)); this enforces that the two iterables have equal length (already
validated earlier) and resolves Ruff B905.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 283066f3-7a70-449e-817f-e81b77a845b8
📒 Files selected for processing (3)
src/sampleworks/utils/guidance_script_arguments.pysrc/sampleworks/utils/guidance_script_utils.pysrc/sampleworks/utils/msa.py
🚧 Files skipped from review as they are similar to previous changes (2)
- src/sampleworks/utils/guidance_script_utils.py
- src/sampleworks/utils/guidance_script_arguments.py
| else: | ||
| self._cache_hits += 1 | ||
|
|
||
| _validate_msa_cache_contents(hash_key, self.msa_dir) |
There was a problem hiding this comment.
Avoid full content validation on every cache hit.
Line 370 forces full CSV/A3M read+compare even on cache hits, which can turn the fast path into expensive I/O for large MSAs. Run deep validation only after recompute (or behind a debug/integrity flag).
Suggested patch
- if not all([m.exists() for m in msa_path_dict.values()]):
+ if not all(m.exists() for m in msa_path_dict.values()):
# this will generate both a3m and csv files for us.
# do NOT capture this output, it will return paths to .csv files,
# which we don't necessarily want.
_ = _compute_msa(
data,
hash_key, # this is the "target_id" argument to compute_msa
self.msa_dir,
self.msa_server_url,
msa_pairing_strategy,
msa_server_username=self.msa_server_username,
msa_server_password=self.msa_server_password,
api_key_header=self.api_key_header,
api_key_value=self.api_key_value,
)
self._api_calls += 1
+ _validate_msa_cache_contents(hash_key, self.msa_dir)
else:
self._cache_hits += 1
-
- _validate_msa_cache_contents(hash_key, self.msa_dir)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/sampleworks/utils/msa.py` at line 370, The call to
_validate_msa_cache_contents(hash_key, self.msa_dir) forces expensive
full-CSV/A3M validation on every cache hit; change the logic in the function (or
the caller that uses hash_key and self.msa_dir) so deep validation runs only
when recomputing or when an explicit debug/integrity flag is set (e.g., add a
parameter like perform_deep_validation=False or check self.debug_integrity
before invoking _validate_msa_cache_contents), and ensure the fast path for
cache hits skips the full read/compare while preserving an option to trigger the
heavy validation during recompute or when the integrity flag is true.
This PR contains two small fixes to our generation code:
At some point we switched our config objects to use
Pathobjects, rather thanstrfor paths to various files. We wrote out our metadata usingGuidanceConfig.__dict__andjson.dump, butjsondoesn't know how to serialize aPath. Therefore I introduced a new methodGuidanceConfig.as_dictwhich convertsPathtostr.When running RF3 alone, if the MSA cache is not fully populated, we call
_compute_msato get the MSA files, but didn't make sure to use the.a3mformat files. These files are written, but the paths to.csvfiles are returned. This patch contains a fix so that we get the correct files if we're usingMSAManager.get_msa.Summary by CodeRabbit
Bug Fixes
Chores