Skip to content

Fix duplicate-sequence multimer mmCIF chain coordinates#576

Open
taivu1998 wants to merge 1 commit into
aqlaboratory:mainfrom
taivu1998:tdv/issue-525-mmcif-duplicate-chain-coords
Open

Fix duplicate-sequence multimer mmCIF chain coordinates#576
taivu1998 wants to merge 1 commit into
aqlaboratory:mainfrom
taivu1998:tdv/issue-525-mmcif-duplicate-chain-coords

Conversation

@taivu1998
Copy link
Copy Markdown

Summary

Fixes #525.

This PR fixes DataPipelineMultimer.process_mmcif for multimer mmCIF examples
that contain multiple chains with the same sequence. The pipeline already
deduplicates expensive per-sequence feature generation by caching features under
the sequence string, but the mmCIF path was caching a dictionary after
chain-specific structural labels had been added. Duplicate-sequence chains could
therefore reuse the first chain's all_atom_positions and all_atom_mask.

Root Cause

process_mmcif mixed two feature classes in the same sequence cache:

  • cacheable sequence-derived features from the monomer pipeline;
  • per-chain mmCIF labels from get_mmcif_features.

When a later chain had the same sequence, it deep-copied the cached dictionary
and skipped get_mmcif_features for its own chain ID.

Changes

  • Keep duplicate-sequence caching for sequence-derived features.
  • Add a second pass that attaches mmCIF structural features once per actual
    chain.
  • Refresh auth_chain_id on cache hits so copied intermediate features remain
    internally consistent.
  • Add local shape guards for all_atom_positions and all_atom_mask lengths.
  • Add a focused regression test proving:
    • duplicate sequences still run _process_single_chain only once;
    • get_mmcif_features runs for both chains;
    • duplicate-sequence chains receive distinct positions and masks.

Validation

Passed:

python3 -m compileall openfold/data/data_pipeline.py tests/test_data_pipeline_multimer.py
PYTHONPATH=/private/tmp/openfold-issue-525-test-deps-py310 python3 -m unittest tests.test_data_pipeline_multimer
git diff --check

Also passed a real fixture smoke test on tests/test_data/mmcifs/2q2k.cif,
confirming duplicate-sequence chains A and B keep distinct coordinates in
the merged multimer output.

Attempted broader existing tests, but this local environment is missing optional
runtime pieces unrelated to this patch:

  • tests.test_data_pipeline: missing attn_core_inplace_cuda
  • tests.test_multimer_datamodule: missing pytorch_lightning

@taivu1998 taivu1998 marked this pull request as ready for review May 11, 2026 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Potential bug in DataPipelineMultimer when there are chains with the same sequence

1 participant