Skip to content

Improved dataset control#327

Merged
chrisiacovella merged 10 commits intochoderalab:mainfrom
chrisiacovella:improved_dataset_control
Dec 5, 2024
Merged

Improved dataset control#327
chrisiacovella merged 10 commits intochoderalab:mainfrom
chrisiacovella:improved_dataset_control

Conversation

@chrisiacovella
Copy link
Copy Markdown
Member

@chrisiacovella chrisiacovella commented Dec 3, 2024

Pull Request Summary

This provides changes aimed at improving the control over datasets. Specifically, the switch to .toml files meant we could not easily toggle which properties to load from the hdf5 files nor which properties to assign to the key internal variables (energy, force, dipole_moment, etc.).

This updates the pydantic dataset model whereby users now additionally define properties of interest (as a list) and then the association of these properties to the internal variable names (i.e used by the Properties class). The pydantic model validates that anything in the Properties association also exists in the properties_of_interest list. We do not validate that the properties_of_interest are actually in the hdf5 file via the pydantic model; this is handled when we assign those parameters while calling the DataModule class. Since this is one of the earliest steps in the process, failure to properly define these should lead to a rapid failure. this saves considerable extra code in the pydantic model, and having to make separate models for each dataset.

Key changes

Notable points that this PR has either accomplished or will accomplish.

  • Allow better control over datasets via .toml files
  • Fix buffer issue when logging to wandb
  • Fix logic whereby using dipole moment in the loss required total charge as well (since dipole moment requires charges. This is fixed such that charges are calculated if dipole moment is requested, even if total charge is not in the loss).
  • bug fix to spice1_openff curation (wrong tag for scf_dipole).
  • change backend for matplotlib to prevent issues when using debugger

Associated Issue(s)

Pull Request Checklist

  • Issue(s) raised/addressed and linked
  • Includes appropriate unit test(s)
  • Appropriate docstring(s) added/updated
  • Appropriate .rst doc file(s) added/updated
  • PR is ready for review

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Dec 3, 2024

Codecov Report

Attention: Patch coverage is 92.42424% with 5 lines in your changes missing coverage. Please review.

Project coverage is 78.83%. Comparing base (f8c7834) to head (3ac9e6c).

Additional details and impacted files

Comment thread modelforge/dataset/parameters.py
Comment thread modelforge/dataset/parameters.py
Comment thread modelforge/tests/data/config.toml
Comment thread modelforge/tests/data/config.toml
Comment thread modelforge/train/training.py
Comment thread modelforge/train/training.py Outdated
Comment thread modelforge/train/training.py
@chrisiacovella chrisiacovella merged commit f8503c2 into choderalab:main Dec 5, 2024
@chrisiacovella chrisiacovella deleted the improved_dataset_control branch May 7, 2025 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants