Skip to content

Dataset api#340

Merged
chrisiacovella merged 53 commits intochoderalab:mainfrom
chrisiacovella:dataset_api
Mar 13, 2025
Merged

Dataset api#340
chrisiacovella merged 53 commits intochoderalab:mainfrom
chrisiacovella:dataset_api

Conversation

@chrisiacovella
Copy link
Copy Markdown
Member

@chrisiacovella chrisiacovella commented Feb 14, 2025

Pull Request Summary

This creates an API for dataset curation that relies on pydantic to ensure we have lots of validation at the time of dataset construction.

Note: Even though new curation scripts have been added as part of this PR, I won't remove the old ones (or replace yaml files) in this PR. That will be done in a separate PR.

Key changes

Notable points that this PR has either accomplished or will accomplish.

  • Creates a submodule modelforge.curate with the API and scripts for datasets.
  • updated dataset.py to work with either old or new datasets (new datasets change format options)

To Do:

  • While there is a Jupyter notebook on usage, still need to generate read the docs documentation
  • add initial tmqm-xtb datasets

Associated Issue(s)

Pull Request Checklist

  • Issue(s) raised/addressed and linked
  • Includes appropriate unit test(s)
  • Appropriate docstring(s) added/updated
  • Appropriate .rst doc file(s) added/updated
  • PR is ready for review

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Feb 14, 2025

Codecov Report

Attention: Patch coverage is 91.05012% with 75 lines in your changes missing coverage. Please review.

Project coverage is 81.19%. Comparing base (ac5a77c) to head (da87b49).

Additional details and impacted files
🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…mpares against scf dipole for a molecule from spice2
…r of conformers) to the sourcedataset, so they can be easily and uniformly used in the codes.
…n that takes as input the file to output and all the max number of conformers. the idea is to operate on the entirely processed dataset (which should be faster for the scripts running curation). This also simplifies the code that needs to be written
…nality to check species added to record, and sourcedataset function to return subset of records that match). Also added to functions that limit configurations.
…ing in routine to record to remove high force configurations.
…des total_records to include, total_conformers, max_conformers_per_record, atomic_species_to_limit, and max_force. These are part of the SourceDataset and can be automatically applied when writing to the hdf5 file from the baseclass. These routines do not need to be written for each dataset.
…ow accepts the max_force_key in case a different name is used for the forces. tests for this added.
Comment thread .readthedocs.yaml
Comment thread modelforge-curate/modelforge/curate/datasets/qm9_curation.py Outdated
Comment thread modelforge-curate/modelforge/curate/utils.py
… reading from file if prepare_dataset has been called. Also, explicitly state weights_only=False, as that is now necessary.
…, not number of configurations (initially was implemented/tested for qm9, where those are the same). added self energies for tmqm xtb
@chrisiacovella
Copy link
Copy Markdown
Member Author

Note this also addresses Issue #342 , whereby the atomic self energy regression only worked when n_configs = 1 for a dataset.

@chrisiacovella chrisiacovella merged commit 6753333 into choderalab:main Mar 13, 2025
14 checks passed
@chrisiacovella chrisiacovella deleted the dataset_api branch May 7, 2025 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants