-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Open
Description
Hi all,
I've worked with large datasets (ribosomes, viral capsids) and hitting some performance limits with the current MMCIFParser. Even with FastMMCIFParser, the overhead of creating Python Atom objects for 200k+ entries is a bottleneck.
I'm interested in contributing a native C/C++ extension to speed this up.
Update suggestion and context:
This challenge was tackled in the AlphaFold 3 data pipeline (source). The biggest performance gains came from:
- Moving tokenization out of Python.
- The "Fast Path": Bypassing the creation of millions of Python objects when we only needed coordinates for ML models.
Proposal of changes
It could be good to implement a lightweight extension (e.g., Bio.PDB._cmmcif) using the standard Python C API—similar to the existing ccealignmodule.c.
This could would offer two modes:
- A drop-in Replacement: Accelerates the standard
get_structure()by doing tokenization in C (est. 1.5-2x speedup). - A NumPy "Fast Path": an additional method (e.g.,
parse_to_arrays()) that extracts coordinates/B-factors directly into NumPy arrays, skipping the object overhead entirely.- Target Speedup: ~50-100x (based on previous work).
Other things considered
- Gemmi: is an excellent library, but it follows a DOM approach (creating objects for everything) and would introduce a new dependency. I am proposing a native solution that fits Biopython's existing build system.
quick questions
- Is there interest in a "NumPy-only" fast path for ML workflows?
- Are there any blockers to adding a new C extension for parsing?
Note: This would be a clean-room implementation (written from scratch) to ensure full compatibility.
Happy to write the code or give more detail.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels