Skip to content

C++ acceleration for mmCIF parsing (native "fast path" for coordinates) #5156

@Felixburton7

Description

@Felixburton7

Hi all,

I've worked with large datasets (ribosomes, viral capsids) and hitting some performance limits with the current MMCIFParser. Even with FastMMCIFParser, the overhead of creating Python Atom objects for 200k+ entries is a bottleneck.

I'm interested in contributing a native C/C++ extension to speed this up.

Update suggestion and context:

This challenge was tackled in the AlphaFold 3 data pipeline (source). The biggest performance gains came from:

  1. Moving tokenization out of Python.
  2. The "Fast Path": Bypassing the creation of millions of Python objects when we only needed coordinates for ML models.

Proposal of changes

It could be good to implement a lightweight extension (e.g., Bio.PDB._cmmcif) using the standard Python C API—similar to the existing ccealignmodule.c.

This could would offer two modes:

  1. A drop-in Replacement: Accelerates the standard get_structure() by doing tokenization in C (est. 1.5-2x speedup).
  2. A NumPy "Fast Path": an additional method (e.g., parse_to_arrays()) that extracts coordinates/B-factors directly into NumPy arrays, skipping the object overhead entirely.
    • Target Speedup: ~50-100x (based on previous work).

Other things considered

  • Gemmi: is an excellent library, but it follows a DOM approach (creating objects for everything) and would introduce a new dependency. I am proposing a native solution that fits Biopython's existing build system.

quick questions

  1. Is there interest in a "NumPy-only" fast path for ML workflows?
  2. Are there any blockers to adding a new C extension for parsing?

Note: This would be a clean-room implementation (written from scratch) to ensure full compatibility.

Happy to write the code or give more detail.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions