feat(synthetic-sf): add script to generate synthetic structure factors#234
feat(synthetic-sf): add script to generate synthetic structure factors#234DorisMai wants to merge 7 commits into
Conversation
📝 WalkthroughWalkthroughAdds shared structure-loading and occupancy utilities, refactors density generation to use them, introduces a new SFC_Torch-based CLI (single and batch) to compute/write synthetic MTZs, updates pyproject dependencies/rules, and adds tests validating conversion and occupancy effects. ChangesSynthetic Structure Factor Generation with Refactored Utilities
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
src/sampleworks/eval/synthetic_utils.py (3)
158-158: 💤 Low valueAdd explanatory comment for type ignore.
Per coding guidelines, type ignores should include explanatory comments. The
# ty: ignore[invalid-argument-type]suppresses a type error but doesn't explain why it's safe.♻️ Proposed fix
- altloc_info = detect_altlocs(atom_array) # ty: ignore[invalid-argument-type] + # ty: ignore[invalid-argument-type] - atom_array is AtomArray after stripping ops, + # detect_altlocs signature may be overly narrow + altloc_info = detect_altlocs(atom_array)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/sampleworks/eval/synthetic_utils.py` at line 158, The call to detect_altlocs(atom_array) uses a type ignore comment (# ty: ignore[invalid-argument-type]) without justification; update the invocation in synthetic_utils.py to keep the ignore but add a short explanatory comment after it explaining why the type error is safe (e.g., atom_array is known at runtime to match the expected type/has required attributes or originates from a validated source) and reference the specific symbols detect_altlocs and atom_array so future readers/linters understand the rationale.
81-91: ⚡ Quick winExtra
occ_valuesare silently ignored.When
len(occ_values) > len(altloc_info.altloc_ids), extra values are silently discarded without warning. This asymmetry with the "too few values" case (which warns) may surprise callers.Consider adding a warning for the excess values case:
♻️ Proposed fix
if len(occ_values) != len(altloc_info.altloc_ids): + if len(occ_values) > len(altloc_info.altloc_ids): + logger.warning( + f"Expected {len(altloc_info.altloc_ids)} occupancy values, got {len(occ_values)}. " + "Extra values will be ignored." + ) + occ_values = occ_values[:len(altloc_info.altloc_ids)] + else: - logger.warning( - f"Expected {len(altloc_info.altloc_ids)} occupancy values, got {len(occ_values)}. " - "The missing values are automatically set to 0." - ) - occ_values = occ_values + [0.0] * (len(altloc_info.altloc_ids) - len(occ_values)) + logger.warning( + f"Expected {len(altloc_info.altloc_ids)} occupancy values, got {len(occ_values)}. " + "The missing values are automatically set to 0." + ) + occ_values = occ_values + [0.0] * (len(altloc_info.altloc_ids) - len(occ_values))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/sampleworks/eval/synthetic_utils.py` around lines 81 - 91, The code currently only warns when occ_values is shorter than altloc_info.altloc_ids but silently ignores extra occ_values; update the logic around occ_values handling (before the for loop that zips sorted(altloc_info.altloc_ids) with occ_values) to detect when len(occ_values) > len(altloc_info.altloc_ids) and emit a logger.warning indicating how many extra occupancy values were provided and that they will be ignored, then trim occ_values to the expected length; keep the existing branch that pads with zeros when occ_values is shorter and continue assigning occupancy using occupancy[altloc_info.atom_masks[altloc]] in the existing for altloc, occ loop so behavior remains consistent.
91-91: 💤 Low valueReturn type cast may be incorrect for
AtomArrayStackinputs.The function signature accepts
AtomArray | AtomArrayStack, butcast(AtomArray, result)always assertsAtomArray. If anAtomArrayStackis passed, this cast is misleading to type checkers and callers.If the function genuinely only handles
AtomArray, narrow the signature. Otherwise, remove the cast or make it conditional.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/sampleworks/eval/synthetic_utils.py` at line 91, The return cast to AtomArray is incorrect when the function accepts AtomArray | AtomArrayStack: update the function (in synthetic_utils.py) so it either narrows the signature to AtomArray if it truly never returns a stack, or remove the unconditional cast and return result with its original type; alternatively, if both are valid, return a union type or perform a runtime check (e.g., isinstance(result, AtomArrayStack)) and cast/annotate conditionally so callers and type checkers see the correct AtomArray vs AtomArrayStack type instead of always using cast(AtomArray, result).tests/eval/test_generate_synthetic_sf.py (1)
39-44: 💤 Low valueAdd return type annotation for consistency.
The
stripped_atom_arrayfixture is missing a return type annotation, unlike the other fixtures.Suggested fix
+from biotite.structure import AtomArray + `@pytest.fixture`(scope="module") -def stripped_atom_array(resources_dir: Path): +def stripped_atom_array(resources_dir: Path) -> AtomArray: arr = load_structure_with_altlocs(resources_dir / "6b8x" / "6b8x_final.pdb")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/eval/test_generate_synthetic_sf.py` around lines 39 - 44, stripped_atom_array fixture is missing a return type; update its signature to include the same return type used by the other fixtures (e.g. def stripped_atom_array(resources_dir: Path) -> AtomArray:) and add the appropriate type import if not present. Keep the body unchanged (calls to load_structure_with_altlocs, remove_hydrogens, remove_waters, keep_amino_acids, keep_polymer) but ensure the declared return type matches the actual returned value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/sampleworks/eval/generate_synthetic_sf.py`:
- Around line 442-460: The Parallel invocation may spawn multiple loky workers
that each initialize CUDA if device is a GPU, causing GPU contention; update the
code around the Parallel(...) call to detect GPU devices (check the passed-in
device variable for values like "cuda" or starting with "cuda:") and if a GPU is
detected and n_jobs > 1 either set n_jobs = 1 (safe default) or emit a clear
warning (e.g., via logging or warnings.warn) and fall back to n_jobs = 1; ensure
this logic is applied before calling Parallel and reference the Parallel(...)
call and the _process_single_row invocation so the change prevents multiple
worker processes from initializing separate CUDA contexts.
In `@tests/eval/test_generate_synthetic_sf.py`:
- Around line 100-122: These two GPU-dependent tests
(test_fprotein_matches_direct_gemmi and test_fprotein_changes_with_occupancy)
should be annotated with the pytest slow marker: add import pytest if missing
and place `@pytest.mark.slow` immediately above each test function definition that
uses the device fixture/SFcalculator so they are excluded from fast CI runs.
Ensure you only add the decorator (no other behavioral changes) above the def
lines for test_fprotein_matches_direct_gemmi and
test_fprotein_changes_with_occupancy.
---
Nitpick comments:
In `@src/sampleworks/eval/synthetic_utils.py`:
- Line 158: The call to detect_altlocs(atom_array) uses a type ignore comment (#
ty: ignore[invalid-argument-type]) without justification; update the invocation
in synthetic_utils.py to keep the ignore but add a short explanatory comment
after it explaining why the type error is safe (e.g., atom_array is known at
runtime to match the expected type/has required attributes or originates from a
validated source) and reference the specific symbols detect_altlocs and
atom_array so future readers/linters understand the rationale.
- Around line 81-91: The code currently only warns when occ_values is shorter
than altloc_info.altloc_ids but silently ignores extra occ_values; update the
logic around occ_values handling (before the for loop that zips
sorted(altloc_info.altloc_ids) with occ_values) to detect when len(occ_values) >
len(altloc_info.altloc_ids) and emit a logger.warning indicating how many extra
occupancy values were provided and that they will be ignored, then trim
occ_values to the expected length; keep the existing branch that pads with zeros
when occ_values is shorter and continue assigning occupancy using
occupancy[altloc_info.atom_masks[altloc]] in the existing for altloc, occ loop
so behavior remains consistent.
- Line 91: The return cast to AtomArray is incorrect when the function accepts
AtomArray | AtomArrayStack: update the function (in synthetic_utils.py) so it
either narrows the signature to AtomArray if it truly never returns a stack, or
remove the unconditional cast and return result with its original type;
alternatively, if both are valid, return a union type or perform a runtime check
(e.g., isinstance(result, AtomArrayStack)) and cast/annotate conditionally so
callers and type checkers see the correct AtomArray vs AtomArrayStack type
instead of always using cast(AtomArray, result).
In `@tests/eval/test_generate_synthetic_sf.py`:
- Around line 39-44: stripped_atom_array fixture is missing a return type;
update its signature to include the same return type used by the other fixtures
(e.g. def stripped_atom_array(resources_dir: Path) -> AtomArray:) and add the
appropriate type import if not present. Keep the body unchanged (calls to
load_structure_with_altlocs, remove_hydrogens, remove_waters, keep_amino_acids,
keep_polymer) but ensure the declared return type matches the actual returned
value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 724f9f85-5506-461e-aa16-c524bebf8acf
⛔ Files ignored due to path filters (1)
pixi.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
pyproject.tomlsrc/sampleworks/eval/generate_synthetic_density.pysrc/sampleworks/eval/generate_synthetic_sf.pysrc/sampleworks/eval/synthetic_utils.pytests/eval/test_generate_synthetic_sf.py
marcuscollins
left a comment
There was a problem hiding this comment.
Please have a look at the comments. I don't see anything major but there are some improvements to make. Also, there are merge conflicts that need to be resolved.
| # This is currently a sort of hacky way to remove ligands by keeping only polymer atoms | ||
| # TODO: there's probably a more robust way to do this |
There was a problem hiding this comment.
This comment should be moved up, I think it applies to line 155.
| ) | ||
| occ_mode = "custom" | ||
| try: | ||
| atom_array = assign_occupancies(atom_array, altloc_info, "custom", occ_values) |
There was a problem hiding this comment.
Having changed occ_mode to "custom", you may as well use that here.
| device=device, | ||
| ) | ||
| sfc.calc_fprotein() | ||
| return assert_numpy(sfc.Fprotein_asu) |
There was a problem hiding this comment.
Are there any numerical tests you can assert? This kind of test really only shows that the code runs without errors, which is a low bar.
There was a problem hiding this comment.
I see some below, it makes me wonder whether this test is redundant.
| def test_space_group_matches_pdb(self, gemmi_structure_from_atomarray, stripped_gemmi): | ||
| assert gemmi_structure_from_atomarray.spacegroup_hm == stripped_gemmi.spacegroup_hm | ||
|
|
||
| def test_atom_count_matches_pdb(self, gemmi_structure_from_atomarray, stripped_gemmi): |
There was a problem hiding this comment.
can you test whether the atoms are actually the same? Does order matter anywhere?
| == stripped_gemmi[0].count_atom_sites() | ||
| ) | ||
|
|
||
| def test_occupancy_change_is_applied(self, stripped_atom_array, stripped_gemmi): |
There was a problem hiding this comment.
you might also want to test some of the logger warnings related to occupancy are correctly generated.
| altloc_info = detect_altlocs(stripped_atom_array) | ||
|
|
||
| arr_default = assign_occupancies(stripped_atom_array, altloc_info, "default") | ||
| f_default = _compute_fprotein( |
There was a problem hiding this comment.
I would explicitly set the occupancy to something different here, just in case for some reason we change the underlying PDB files at some future point.
| [tool.pixi.environments] | ||
| analysis = {features = ["analysis"]} | ||
| analysis-dev = {features = ["analysis", "dev"]} | ||
| analysis-dev-sfc = {features = ["analysis", "dev", "sfc"]} |
There was a problem hiding this comment.
Is there a reason not to just make sfc part of the core requirements (I recall maybe there were dependency conflicts?) At some point we will want to use sfc more generally, right?
…ng gemmi structure.
…rd into synthetic_utils
f07cca8 to
7207a4e
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (3)
src/sampleworks/eval/generate_synthetic_sf.py (3)
491-608: 💤 Low valueAdd docstring to
parse_args.Per coding guidelines, all functions should have NumPy-style docstrings.
def parse_args() -> argparse.Namespace: + """Parse command-line arguments for synthetic structure factor generation. + + Returns + ------- + argparse.Namespace + Parsed command-line arguments + """ parser = argparse.ArgumentParser(As per coding guidelines: "Always include NumPy-style docstrings for every function and class."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/sampleworks/eval/generate_synthetic_sf.py` around lines 491 - 608, The parse_args function lacks a NumPy-style docstring; add a concise docstring immediately under the def parse_args() line describing purpose, parameters (none), returns (argparse.Namespace) and any important behaviour (flags/groups like selection, occupancy-mode, simulate-solvent-and-scale, output vs output-dir, batch vs single-structure modes), following NumPy docstring conventions so tools and readers can understand the CLI interface; ensure the docstring mentions default behaviors and side effects (e.g., base-dir used only in batch mode) and stays brief.
179-212: 💤 Low valueAdd missing Returns section to docstring.
The function signature specifies
-> rs.DataSetbut the docstring lacks a corresponding Returns section. Per coding guidelines, NumPy-style docstrings should document return values.sigma_f_scale: float Scale factor to make a fake sigma column from the structure factor amplitude values so that SFcalculator can load the output MTZ file as synthetic reward without errors. The actual sigma values only matter when computing R-factor. output_path: Path | None If provided, write the dataset to this MTZ file path. + + Returns + ------- + rs.DataSet + Dataset with structure factor amplitudes, synthetic sigma column, and + optional R-free flags. """🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/sampleworks/eval/generate_synthetic_sf.py` around lines 179 - 212, Add a NumPy-style "Returns" section to the docstring of process_amplitudes_to_dataset describing that it returns an rs.DataSet containing the processed reflections (with columns for miller indices, structure factor amplitudes, generated sigma column scaled by sigma_f_scale, and R-free flags), and note that if output_path is provided the dataset is also written to an MTZ; reference the function name process_amplitudes_to_dataset and the return type rs.DataSet in the description and briefly mention any conditions (e.g., test_fraction=0 disables R-free test set).
611-663: 💤 Low valueAdd docstring to
main.Per coding guidelines, all functions should have NumPy-style docstrings.
def main() -> None: + """Entry point for synthetic structure factor generation CLI. + + Supports two modes: + - Single structure: Generate MTZ for one input file + - Batch: Process multiple structures from a CSV file + """ args = parse_args()As per coding guidelines: "Always include NumPy-style docstrings for every function and class."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/sampleworks/eval/generate_synthetic_sf.py` around lines 611 - 663, Add a NumPy-style docstring to the main() function describing purpose, parameters (none taken directly but mention CLI args from parse_args()), return value (None), and side effects (calls parse_args(), try_gpu(), process_batch(), _process_single_row(), exits on error, writes outputs), and include brief examples or usage notes if appropriate; ensure the docstring references behavior such as handling args.batch_csv versus args.structure, creation of BatchRowForMTZ, and that it may call process_batch(...) or _process_single_row(...) and call sys.exit(1) on missing inputs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/sampleworks/eval/generate_synthetic_sf.py`:
- Around line 491-608: The parse_args function lacks a NumPy-style docstring;
add a concise docstring immediately under the def parse_args() line describing
purpose, parameters (none), returns (argparse.Namespace) and any important
behaviour (flags/groups like selection, occupancy-mode,
simulate-solvent-and-scale, output vs output-dir, batch vs single-structure
modes), following NumPy docstring conventions so tools and readers can
understand the CLI interface; ensure the docstring mentions default behaviors
and side effects (e.g., base-dir used only in batch mode) and stays brief.
- Around line 179-212: Add a NumPy-style "Returns" section to the docstring of
process_amplitudes_to_dataset describing that it returns an rs.DataSet
containing the processed reflections (with columns for miller indices, structure
factor amplitudes, generated sigma column scaled by sigma_f_scale, and R-free
flags), and note that if output_path is provided the dataset is also written to
an MTZ; reference the function name process_amplitudes_to_dataset and the return
type rs.DataSet in the description and briefly mention any conditions (e.g.,
test_fraction=0 disables R-free test set).
- Around line 611-663: Add a NumPy-style docstring to the main() function
describing purpose, parameters (none taken directly but mention CLI args from
parse_args()), return value (None), and side effects (calls parse_args(),
try_gpu(), process_batch(), _process_single_row(), exits on error, writes
outputs), and include brief examples or usage notes if appropriate; ensure the
docstring references behavior such as handling args.batch_csv versus
args.structure, creation of BatchRowForMTZ, and that it may call
process_batch(...) or _process_single_row(...) and call sys.exit(1) on missing
inputs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5e61e2df-c13c-4ee7-ac07-954e20a9b9d8
⛔ Files ignored due to path filters (1)
pixi.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
pyproject.tomlsrc/sampleworks/eval/generate_synthetic_density.pysrc/sampleworks/eval/generate_synthetic_sf.pysrc/sampleworks/eval/synthetic_utils.pytests/eval/test_generate_synthetic_sf.py
🚧 Files skipped from review as they are similar to previous changes (4)
- src/sampleworks/eval/synthetic_utils.py
- tests/eval/test_generate_synthetic_sf.py
- pyproject.toml
- src/sampleworks/eval/generate_synthetic_density.py
Summary
generate_synthetic_sf.pyto produce MTZ files of structure factors using SFcalculator (SFC), with optional bulk solvent + default scaling allowed as well as R-free flag generation. Assumes no anomalous scattering for now.generate_synthetic_density.pyin terms of supporting single_structure vs batch_csv generation, altloc occupancy overwrite, and selection and stripping of hydrogens/waters/ligands.assign_occupanciesand selection/stripping fromgenerate_density_sf.pyintoeval/synthetic_utilsfor reuse in the SF scripts.atomarray_to_gemmi. As this function is likely reused in prototyping the new reward function, test is added. Test is specific to SFC environment.Test plan
analysis-dev-sfc,boltz-dev-sfc, andrf3-dev-sfc, that eventually will be merged (or added to workflows).❗ There is dependency conflict with Protenix (Issue Dependency conflict betwen sfcalculator-torch and protenix #235 ), hence no corresponding env.
tests/eval/test_generate_synthetic_sf.pyNext steps
Summary by CodeRabbit
New Features
Refactor
Tests
Chores