A Python package for processing omol-25 data using MPI.
You can install this package locally:
pip install -e .This package provides three primary command-line interfaces:
Extract, process, and combine molecular data from an S3 bucket (or local directory):
process_omol25 --help- MPI Support: Add
--mpiand run viampirunto distribute tasks across multiple workers natively via hybrid RMA. - Smart Restart: Add
--restartto automatically sweep the output directory, recover orphaned Parquet/XYZ pairs, and pick up right where you left off. - Logging: Specify
--log-file my_log.logto write text streams to disk (existing logs are automatically appended to, not overwritten). - Batch Flushing: Use
--batch-size Nto control disk writes. If not specified, workers dynamically flush at 1% increments (with a strict minimum of 100 output structures).
Download original raw orca.out datasets from S3 without running processing logic natively on them:
download_omol25 --helpCross-reference a generated Parquet dataset with its respective ExtXYZ file to guarantee absolutely zero data corruption or structural mismatching:
verify_processed_omol25 --parquet props_group.parquet --extxyz structs_group.xyz- This rigorously structurally aligns both tables via
geom_sha1and flags any mathematically misassigned properties. - Embedded timing metadata such as
process_time_sare strictly and unconditionally excluded to prevent false-positive errors.
This project is licensed under the BSD 3-Clause License - see the LICENSE file for details.