C++ acceleration for mmCIF parsing (native "fast path" for coordinates)

Hi all,

I've worked with large datasets (ribosomes, viral capsids) and hitting some performance limits with the current `MMCIFParser`. Even with `FastMMCIFParser`, the overhead of creating Python `Atom` objects for 200k+ entries is a bottleneck. 

I'm interested in contributing a native C/C++ extension to speed this up.

### Update suggestion and context: 

This challenge was tackled in the AlphaFold 3 data pipeline ([source](https://github.com/google-deepmind/alphafold3/blob/main/src/alphafold3/parsers/cpp/cif_dict_lib.cc)). The biggest performance gains came from:

1. Moving tokenization out of Python.
2. The "Fast Path": Bypassing the creation of millions of Python objects when we only needed coordinates for ML models.

### Proposal of changes

It could be good to implement a lightweight extension (e.g., `Bio.PDB._cmmcif`) using the standard Python C API—similar to the existing `ccealignmodule.c`.

This could would offer two modes:

1. **A drop-in Replacement:** Accelerates the standard `get_structure()` by doing tokenization in C (est. 1.5-2x speedup).
2. **A NumPy "Fast Path":** an additional method (e.g., `parse_to_arrays()`) that extracts coordinates/B-factors directly into NumPy arrays, skipping the object overhead entirely.
    - *Target Speedup:* **~50-100x** (based on previous work).

### Other things considered

- **Gemmi:** is an excellent library, but it follows a DOM approach (creating objects for everything) and would introduce a new dependency. I am proposing a native solution that fits Biopython's existing build system.

### quick questions

1. Is there interest in a "NumPy-only" fast path for ML workflows?
2. Are there any blockers to adding a new C extension for parsing?

*Note: This would be a clean-room implementation (written from scratch) to ensure full compatibility.*

Happy to write the code or give more detail. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C++ acceleration for mmCIF parsing (native "fast path" for coordinates) #5156

Update suggestion and context:

Proposal of changes

Other things considered

quick questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

C++ acceleration for mmCIF parsing (native "fast path" for coordinates) #5156

Description

Update suggestion and context:

Proposal of changes

Other things considered

quick questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions