Dataset.py refactoring by chrisiacovella · Pull Request #358 · choderalab/modelforge

chrisiacovella · 2025-05-27T06:49:17Z

Pull Request Summary

The general idea here is to refactor the way we handle datasets such that we no longer are required to create a unique child class of the HDF5Dataset class for each dataset. Rather than a unique class for each dataset, the code now relies upon yaml files that contain the key information (available properties, atomic self energies, etc.). These yaml files still contain the various versions of the dataset and link/doi to fetch the dataset. Note, this switch removes default properties of interest, which is keeping with the philosophy of requiring users to define all inputs. (note, this however, broke all the tests that relied on this implicit information).

We still maintain a list of "built-in" datasets, but this structure now allows us to support "local" datasets (i.e., those we haven't uploaded to zenodo). The yaml file allows a local_datset to be defined, with the same basic syntax, just you provide a path to the dataset on your local system.

This additionally drops support for the older hdf5 datafiles, now switching to those generated using the new dataset.curate module (i.e., that shifts validation to the time of creation of the dataset). These datasets provide a bit more information (specifically the property_type, i.e., length, energy, force, etc.) that is useful for performing any unit conversion that is necessary when loading. All supported datasets have been recurated and uploaded to zenodo, and yaml files updated to support new format.

To this end, the global unit system class has been moved from curate to the main module of modelforge to streamline the unit conversion/checking when loading in the datafile.

Key changes

Notable points that this PR has either accomplished or will accomplish.

Set up curate scripts for a few additional datasets (ani1x, spice1 openff, spice2 openff)
regenerate and upload all new versions of the datasets to zenodo
switch all public facing calls that require units to using the global units (e.g., so a cutoff is converted to whatever unit is in the global units. this will allow us to presumably experiment with different unit systems and their impact on convergence)
add in helper functions that call the dataset .toml files (in tests/data/datasets) to define default properties of interests and properties associations for testing (previously these defaults were just part of the dataset class). This also makes it such that fewer pieces of code need to be updated if we want to test with a different version.
Update all the tests to work with the new scheme
Add in a class to make it easier to generate the version information that goes into a yaml file.

Associated Issue(s)

Revise dataset.py structure #349

Pull Request Checklist

Issue(s) raised/addressed and linked
Includes appropriate unit test(s)
Appropriate docstring(s) added/updated
Appropriate .rst doc file(s) added/updated
PR is ready for review

…5dataset. moved units from curate to modelforge.utils.units

…t factory as it no longer really serves a purpose (DataModule, required by pytorch lightning, basically functions in the same way). This also include a separate dataset_cache_dir to save datasets, separate from the local_cache_dir. local_cache_dir should be unique for each run.

… the key info for the yaml files.

…openff code updated.

…energies for openff datasets

… read data toml files as we require properties of interest and properties assignment now (no longer any defaults).

…fixtures)

…ing in conversions. mostly applies to adding energy units to dataset statistics and converting cutoffs to consistent units. partially updated documentation

… MultiplexedPath instead of path

Copilot

Pull Request Overview

This PR refactors the dataset handling workflow by removing dataset‐specific child classes in favor of YAML–configured datasets, while also updating various curation scripts, documentation, and CI workflows.

Integrates a VersionMetadata class for automated metadata creation in the curation scripts
Updates dataset scripts to support local dataset definitions and new dataset versions
Revises documentation and CI configurations to include Python 3.12 and improve instructions

Reviewed Changes

Copilot reviewed 142 out of 142 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
modelforge-curate/curate/datasets/scripts/curate_qm9.py	Updated version numbers and metadata generation for full and subset datasets
modelforge-curate/curate/datasets/scripts/curate_fe_II.py	Renamed version variable; adjusted output directories; updated metadata sections
modelforge-curate/curate/datasets/scripts/curate_ani2x.py	Updated version number and metadata creation; simplified comments
modelforge-curate/curate/datasets/scripts/curate_ani1x.py	New script to curate the ANI-1x dataset with VersionMetadata integration
modelforge-curate/curate/datasets/scripts/curate_PhAlkEthOH.py	Updated output paths and metadata blocks for multiple dataset subsets
modelforge-curate/curate/datasets/qm9_curation.py	Added new energy_of_lumo property parsing for QM9 records
modelforge-curate/curate/datasets/curation_baseclass.py	Fixed module import to new utils.units location
docs/, .github/workflows/, MANIFEST.in	Documentation and CI updates including Python 3.12 support and minor text improvements

Comments suppressed due to low confidence (1)

modelforge-curate/modelforge/curate/datasets/scripts/curate_ani2x.py:117

[nitpick] Consider simplifying the printed message to 'full dataset' to avoid redundancy.

print("full dataset dataset")

…_II.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…elper

chrisiacovella added 30 commits May 1, 2025 15:31

Lots of changes; initial mockup of new yaml file, restructring of HDF…

40c97a6

…5dataset. moved units from curate to modelforge.utils.units

restructuring of from_hdf5.

bcbdeb5

Adding in new tmqm datasets

40e231f

Adding in new new ani2x.yaml

ec7273a

fe_II.yaml updated for new format.

fcc61c5

adding in gzipping function to make automation of yaml file

76b08ef

adding in gzipping function to make automation of yaml file

39eb78d

adding tests for metadata

ce2655a

tests for version metadata creation

409ef61

adding in additional curation scripts now that automatically generate…

22428d3

… the key info for the yaml files.

Merge remote-tracking branch 'origin/main' into ref-dataset

535f459

updating metadata generation class to minimize work by the user.

6cd4eb2

setting up qm9 dataset for new version

001b0c4

updating curation scripts to use metadata generation.

9c41b0c

modifying spice1 and 2 curation scripts

a5fd0d9

adding in PhAlkEthOH.yaml dataset for the new scheme

657349a

adding in spice 1 and 2 openff curation.

c4de076

ani1x curation scripts and yaml updated. spice2.yaml updated. spice1 …

05a5ed7

…openff code updated.

adding in spice1 openff yaml (and updated curation code).

67bb572

spice 2 openff yaml and datasets uploaded; still need to update self …

956e7cf

…energies for openff datasets

remove "unit." prefix in yaml files

3bd4279

adding correct self energies to spice1openff.yaml

bc22f31

adding correct self energies to spice2openff.yaml

6e60ba0

updating descriptions in spice yaml files.

4e7e976

fixing qm9 test .toml for ani tests

8d91407

fixing up toml and yaml files. changing structure of batch helpers to…

099967e

… read data toml files as we require properties of interest and properties assignment now (no longer any defaults).

fixing test_dataset.py to work with new scheme (including new pytest …

2700409

…fixtures)

fixed test_Dataset_Generation

12b2687

fixing test_energy processing

445e69f

chrisiacovella added 14 commits May 28, 2025 17:03

Updating everything to use the global unit system rather than hardcod…

f99d987

…ing in conversions. mostly applies to adding energy units to dataset statistics and converting cutoffs to consistent units. partially updated documentation

updating documentation.

811472d

fixing lightning test...some how missed this.

9ae200d

fixed typo in docs

0bbd3ba

further updating documentation.

282a75d

further updating documentation.

7380311

turning on python 3.12 to see if things pass/fail

96c7e29

turning on python 3.12 to see if things pass/fail

332e9ee

updated test_unzip_file. for some reason resources.files is returning…

20f2f06

… MultiplexedPath instead of path

added in helper function to deal with MultiplexPath objects

04295b3

added in helper function to deal with MultiplexPath objects

9b2baf2

added in helper function to deal with MultiplexPath objects

2ac8020

added in helper function to deal with MultiplexPath objects

8a5dd02

missed one change for getting path

79f2599

chrisiacovella requested review from MarshallYan and Copilot May 30, 2025 07:34

Copilot AI reviewed May 30, 2025

View reviewed changes

Comment thread modelforge-curate/modelforge/curate/datasets/scripts/curate_fe_II.py Outdated

Update modelforge-curate/modelforge/curate/datasets/scripts/curate_fe…

a05beae

…_II.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

MarshallYan approved these changes May 30, 2025

View reviewed changes

Comment thread modelforge/custom_types.py

chrisiacovella added 6 commits May 31, 2025 00:47

adding in curation for openff tmqm prelim dataset

2f261de

adding in tests for additional properties.

3b23316

compute history isn't available for some records.

4ff5a8f

fixed typo with spice2 HCNOF subset (it was the full dataset)

bbe7fc6

fixed typo in yaml

6414594

local_yaml file was not being passed to one of the setup_datamodule h…

15269ce

…elper

chrisiacovella merged commit 6d3272c into choderalab:main Jun 3, 2025
17 checks passed

This was referenced Jun 3, 2025

Python 3.12 and 3.13 support #355

Closed

Revise dataset.py structure #349

Closed

chrisiacovella mentioned this pull request Jun 27, 2025

tmQM dataset dipole moment property error #356

Closed

chrisiacovella deleted the ref-dataset branch August 27, 2025 18:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset.py refactoring#358

Dataset.py refactoring#358
chrisiacovella merged 61 commits intochoderalab:mainfrom
chrisiacovella:ref-dataset

chrisiacovella commented May 27, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

chrisiacovella commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Summary

Key changes

Associated Issue(s)

Pull Request Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chrisiacovella commented May 27, 2025 •

edited

Loading