Dataset api by chrisiacovella · Pull Request #340 · choderalab/modelforge

chrisiacovella · 2025-02-14T07:04:55Z

Pull Request Summary

This creates an API for dataset curation that relies on pydantic to ensure we have lots of validation at the time of dataset construction.

Note: Even though new curation scripts have been added as part of this PR, I won't remove the old ones (or replace yaml files) in this PR. That will be done in a separate PR.

Key changes

Notable points that this PR has either accomplished or will accomplish.

Creates a submodule modelforge.curate with the API and scripts for datasets.
updated dataset.py to work with either old or new datasets (new datasets change format options)

To Do:

While there is a Jupyter notebook on usage, still need to generate read the docs documentation
add initial tmqm-xtb datasets

Associated Issue(s)

Dataset API #321

Pull Request Checklist

Issue(s) raised/addressed and linked
Includes appropriate unit test(s)
Appropriate docstring(s) added/updated
Appropriate .rst doc file(s) added/updated
PR is ready for review

…naming. Add in per-atom versions of dipole, quadrupoles and octupoles

… to the changes to the api

codecov-commenter · 2025-02-14T07:06:45Z

Codecov Report

Attention: Patch coverage is 91.05012% with 75 lines in your changes missing coverage. Please review.

Project coverage is 81.19%. Comparing base (ac5a77c) to head (da87b49).

Additional details and impacted files

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…cases and failures

…l validation for several properties

…mpares against scf dipole for a molecule from spice2

…r of conformers) to the sourcedataset, so they can be easily and uniformly used in the codes.

…n that takes as input the file to output and all the max number of conformers. the idea is to operate on the entirely processed dataset (which should be faster for the scripts running curation). This also simplifies the code that needs to be written

…nality to check species added to record, and sourcedataset function to return subset of records that match). Also added to functions that limit configurations.

…ing in routine to record to remove high force configurations.

…des total_records to include, total_conformers, max_conformers_per_record, atomic_species_to_limit, and max_force. These are part of the SourceDataset and can be automatically applied when writing to the hdf5 file from the baseclass. These routines do not need to be written for each dataset.

…than python memory.

…ed up curation

…ow accepts the max_force_key in case a different name is used for the forces. tests for this added.

…nge.

… reading from file if prepare_dataset has been called. Also, explicitly state weights_only=False, as that is now necessary.

…, not number of configurations (initially was implemented/tested for qm9, where those are the same). added self energies for tmqm xtb

chrisiacovella · 2025-03-13T23:16:49Z

Note this also addresses Issue #342 , whereby the atomic self energy regression only worked when n_configs = 1 for a dataset.

chrisiacovella added 8 commits February 6, 2025 20:08

reorganize curation module into separate files for clarity. Refactor …

e8689f5

…naming. Add in per-atom versions of dipole, quadrupoles and octupoles

updated baseclass and ani2xcuration

da48713

updated phalkethoh curation

bf28890

updated qm9 curation and properties.py

9aedc09

updated curation scripts to auto generate a summary file. All updated…

6eb0b75

… to the changes to the api

updated tmqm to new structure of api.

78c52e8

updated tests to reflect additional changes and additions.

20610e2

added curate CI

a825d3e

chrisiacovella added 21 commits February 13, 2025 23:08

Merge branch 'main' into dataset_api

b2f5790

renamed curate CI yaml and other updates to yaml

daa4def

updating CI yaml files codecov settings

b516831

increased code coverage of tests, primarily related to catching edge …

1e8fdcf

…cases and failures

Added more unit testing; split up unit testing files; added additiona…

554a33c

…l validation for several properties

Added more unit testing; split up unit testing files; added additiona…

c9cb6de

…l validation for several properties

reformatting some value error messages in properties.py

14e4bc7

added in testing for dipole moment computation for baseclass. this co…

1a592ae

…mpares against scf dipole for a molecule from spice2

added routines useful for generating test datasets (i.e., fixed numbe…

e2c8dd1

…r of conformers) to the sourcedataset, so they can be easily and uniformly used in the codes.

Added functionality to limit atomic species in a general way (functio…

90c686c

…nality to check species added to record, and sourcedataset function to return subset of records that match). Also added to functions that limit configurations.

updating curation classes to use new, unified functions/approach. Add…

914ab3e

…ing in routine to record to remove high force configurations.

Revamped SourceDataset to store records in an sqlite database rather …

218adc1

…than python memory.

Updating tests and scripts to handle db bag end. Small revamps to spe…

664080c

…ed up curation

More refactoring for consistent naming. curation baseclass .to_hdf5 n…

ef5f544

…ow accepts the max_force_key in case a different name is used for the forces. tests for this added.

Updating scripts.

a14551a

Fixing back refactoring.

355c979

additional tests for increased coverage

420d97a

additional tests for increased coverage

495091e

Additional testing and validation added. bugs fixed due to naming cha…

b074e1c

…nge.

chrisiacovella added 19 commits March 6, 2025 21:23

Ensure that classification and property_type are not changed.

2284b6a

updated notebooks; added print function to print individual records.

4edf1be

Adding jupyter notebooks to docs.

fc22f89

Fix linting issues.

77b2da0

fixing docs issue

984799c

needed to update black to fix linting complaint.

9e97867

trying to fix readthedocs

4d29003

fix linting

8c65ab6

fixing readthedocs

3954ac1

left out colon in .yaml file

6c52bb6

trying to fix readthedocs

60dc424

trying to fix readthedocs

52da2d2

trying to fix readthedocs

c769c83

trying to fix readthedocs

bd8c89d

trying to fix readthedocs

00c77ec

trying to fix readthedocs

b2daaca

trying to fix readthedocs

52f3a91

adding missing notebooks

d7406c1

fixing minor warnings

44a872d

chrisiacovella requested a review from MarshallYan March 10, 2025 17:22

MarshallYan requested changes Mar 11, 2025

View reviewed changes

Comment thread .readthedocs.yaml

Comment thread modelforge-curate/modelforge/curate/datasets/qm9_curation.py Outdated

Comment thread modelforge-curate/modelforge/curate/utils.py

chrisiacovella added 5 commits March 12, 2025 19:40

tmqm-xtb curation/datafiles

e5f8195

changing datamodule setup to store torch_dataset locally, rather than…

5d2d2c7

… reading from file if prepare_dataset has been called. Also, explicitly state weights_only=False, as that is now necessary.

fix bug in atomic self energy regression; it used number of molecules…

69612e3

…, not number of configurations (initially was implemented/tested for qm9, where those are the same). added self energies for tmqm xtb

fixing CI tests

feb3c38

added single configuration dataset for tmqm xtb

da87b49

chrisiacovella merged commit 6753333 into choderalab:main Mar 13, 2025
14 checks passed

chrisiacovella deleted the dataset_api branch May 7, 2025 16:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset api#340

Dataset api#340
chrisiacovella merged 53 commits intochoderalab:mainfrom
chrisiacovella:dataset_api

chrisiacovella commented Feb 14, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Feb 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chrisiacovella commented Mar 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chrisiacovella commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Summary

Key changes

To Do:

Associated Issue(s)

Pull Request Checklist

Uh oh!

codecov-commenter commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chrisiacovella commented Mar 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chrisiacovella commented Feb 14, 2025 •

edited

Loading

codecov-commenter commented Feb 14, 2025 •

edited

Loading