Dataset.py refactoring#358
Merged
chrisiacovella merged 61 commits intochoderalab:mainfrom Jun 3, 2025
Merged
Conversation
…5dataset. moved units from curate to modelforge.utils.units
…t factory as it no longer really serves a purpose (DataModule, required by pytorch lightning, basically functions in the same way). This also include a separate dataset_cache_dir to save datasets, separate from the local_cache_dir. local_cache_dir should be unique for each run.
… the key info for the yaml files.
…openff code updated.
…energies for openff datasets
… read data toml files as we require properties of interest and properties assignment now (no longer any defaults).
…ing in conversions. mostly applies to adding energy units to dataset statistics and converting cutoffs to consistent units. partially updated documentation
… MultiplexedPath instead of path
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR refactors the dataset handling workflow by removing dataset‐specific child classes in favor of YAML–configured datasets, while also updating various curation scripts, documentation, and CI workflows.
- Integrates a VersionMetadata class for automated metadata creation in the curation scripts
- Updates dataset scripts to support local dataset definitions and new dataset versions
- Revises documentation and CI configurations to include Python 3.12 and improve instructions
Reviewed Changes
Copilot reviewed 142 out of 142 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| modelforge-curate/curate/datasets/scripts/curate_qm9.py | Updated version numbers and metadata generation for full and subset datasets |
| modelforge-curate/curate/datasets/scripts/curate_fe_II.py | Renamed version variable; adjusted output directories; updated metadata sections |
| modelforge-curate/curate/datasets/scripts/curate_ani2x.py | Updated version number and metadata creation; simplified comments |
| modelforge-curate/curate/datasets/scripts/curate_ani1x.py | New script to curate the ANI-1x dataset with VersionMetadata integration |
| modelforge-curate/curate/datasets/scripts/curate_PhAlkEthOH.py | Updated output paths and metadata blocks for multiple dataset subsets |
| modelforge-curate/curate/datasets/qm9_curation.py | Added new energy_of_lumo property parsing for QM9 records |
| modelforge-curate/curate/datasets/curation_baseclass.py | Fixed module import to new utils.units location |
| docs/, .github/workflows/, MANIFEST.in | Documentation and CI updates including Python 3.12 support and minor text improvements |
Comments suppressed due to low confidence (1)
modelforge-curate/modelforge/curate/datasets/scripts/curate_ani2x.py:117
- [nitpick] Consider simplifying the printed message to 'full dataset' to avoid redundancy.
print("full dataset dataset")
…_II.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
MarshallYan
approved these changes
May 30, 2025
This was referenced Jun 3, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request Summary
The general idea here is to refactor the way we handle datasets such that we no longer are required to create a unique child class of the HDF5Dataset class for each dataset. Rather than a unique class for each dataset, the code now relies upon yaml files that contain the key information (available properties, atomic self energies, etc.). These yaml files still contain the various versions of the dataset and link/doi to fetch the dataset. Note, this switch removes default properties of interest, which is keeping with the philosophy of requiring users to define all inputs. (note, this however, broke all the tests that relied on this implicit information).
We still maintain a list of "built-in" datasets, but this structure now allows us to support "local" datasets (i.e., those we haven't uploaded to zenodo). The yaml file allows a local_datset to be defined, with the same basic syntax, just you provide a path to the dataset on your local system.
This additionally drops support for the older hdf5 datafiles, now switching to those generated using the new dataset.curate module (i.e., that shifts validation to the time of creation of the dataset). These datasets provide a bit more information (specifically the property_type, i.e., length, energy, force, etc.) that is useful for performing any unit conversion that is necessary when loading. All supported datasets have been recurated and uploaded to zenodo, and yaml files updated to support new format.
To this end, the global unit system class has been moved from curate to the main module of modelforge to streamline the unit conversion/checking when loading in the datafile.
Key changes
Notable points that this PR has either accomplished or will accomplish.
Associated Issue(s)
Pull Request Checklist