Skip to content

Dataset.py refactoring#358

Merged
chrisiacovella merged 61 commits intochoderalab:mainfrom
chrisiacovella:ref-dataset
Jun 3, 2025
Merged

Dataset.py refactoring#358
chrisiacovella merged 61 commits intochoderalab:mainfrom
chrisiacovella:ref-dataset

Conversation

@chrisiacovella
Copy link
Copy Markdown
Member

@chrisiacovella chrisiacovella commented May 27, 2025

Pull Request Summary

The general idea here is to refactor the way we handle datasets such that we no longer are required to create a unique child class of the HDF5Dataset class for each dataset. Rather than a unique class for each dataset, the code now relies upon yaml files that contain the key information (available properties, atomic self energies, etc.). These yaml files still contain the various versions of the dataset and link/doi to fetch the dataset. Note, this switch removes default properties of interest, which is keeping with the philosophy of requiring users to define all inputs. (note, this however, broke all the tests that relied on this implicit information).

We still maintain a list of "built-in" datasets, but this structure now allows us to support "local" datasets (i.e., those we haven't uploaded to zenodo). The yaml file allows a local_datset to be defined, with the same basic syntax, just you provide a path to the dataset on your local system.

This additionally drops support for the older hdf5 datafiles, now switching to those generated using the new dataset.curate module (i.e., that shifts validation to the time of creation of the dataset). These datasets provide a bit more information (specifically the property_type, i.e., length, energy, force, etc.) that is useful for performing any unit conversion that is necessary when loading. All supported datasets have been recurated and uploaded to zenodo, and yaml files updated to support new format.

To this end, the global unit system class has been moved from curate to the main module of modelforge to streamline the unit conversion/checking when loading in the datafile.

Key changes

Notable points that this PR has either accomplished or will accomplish.

  • Set up curate scripts for a few additional datasets (ani1x, spice1 openff, spice2 openff)
  • regenerate and upload all new versions of the datasets to zenodo
  • switch all public facing calls that require units to using the global units (e.g., so a cutoff is converted to whatever unit is in the global units. this will allow us to presumably experiment with different unit systems and their impact on convergence)
  • add in helper functions that call the dataset .toml files (in tests/data/datasets) to define default properties of interests and properties associations for testing (previously these defaults were just part of the dataset class). This also makes it such that fewer pieces of code need to be updated if we want to test with a different version.
  • Update all the tests to work with the new scheme
  • Add in a class to make it easier to generate the version information that goes into a yaml file.

Associated Issue(s)

Pull Request Checklist

  • Issue(s) raised/addressed and linked
  • Includes appropriate unit test(s)
  • Appropriate docstring(s) added/updated
  • Appropriate .rst doc file(s) added/updated
  • PR is ready for review

…5dataset. moved units from curate to modelforge.utils.units
…t factory as it no longer really serves a purpose (DataModule, required by pytorch lightning, basically functions in the same way). This also include a separate dataset_cache_dir to save datasets, separate from the local_cache_dir. local_cache_dir should be unique for each run.
… read data toml files as we require properties of interest and properties assignment now (no longer any defaults).
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the dataset handling workflow by removing dataset‐specific child classes in favor of YAML–configured datasets, while also updating various curation scripts, documentation, and CI workflows.

  • Integrates a VersionMetadata class for automated metadata creation in the curation scripts
  • Updates dataset scripts to support local dataset definitions and new dataset versions
  • Revises documentation and CI configurations to include Python 3.12 and improve instructions

Reviewed Changes

Copilot reviewed 142 out of 142 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
modelforge-curate/curate/datasets/scripts/curate_qm9.py Updated version numbers and metadata generation for full and subset datasets
modelforge-curate/curate/datasets/scripts/curate_fe_II.py Renamed version variable; adjusted output directories; updated metadata sections
modelforge-curate/curate/datasets/scripts/curate_ani2x.py Updated version number and metadata creation; simplified comments
modelforge-curate/curate/datasets/scripts/curate_ani1x.py New script to curate the ANI-1x dataset with VersionMetadata integration
modelforge-curate/curate/datasets/scripts/curate_PhAlkEthOH.py Updated output paths and metadata blocks for multiple dataset subsets
modelforge-curate/curate/datasets/qm9_curation.py Added new energy_of_lumo property parsing for QM9 records
modelforge-curate/curate/datasets/curation_baseclass.py Fixed module import to new utils.units location
docs/, .github/workflows/, MANIFEST.in Documentation and CI updates including Python 3.12 support and minor text improvements
Comments suppressed due to low confidence (1)

modelforge-curate/modelforge/curate/datasets/scripts/curate_ani2x.py:117

  • [nitpick] Consider simplifying the printed message to 'full dataset' to avoid redundancy.
print("full dataset dataset")

Comment thread modelforge-curate/modelforge/curate/datasets/scripts/curate_fe_II.py Outdated
…_II.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Comment thread modelforge/custom_types.py
@chrisiacovella chrisiacovella merged commit 6d3272c into choderalab:main Jun 3, 2025
17 checks passed
@chrisiacovella chrisiacovella deleted the ref-dataset branch August 27, 2025 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants