Skip to content

epsilon value to norm function in aimnet2#391

Merged
chrisiacovella merged 17 commits intochoderalab:mainfrom
chrisiacovella:dev-helper_scripts
Jan 12, 2026
Merged

epsilon value to norm function in aimnet2#391
chrisiacovella merged 17 commits intochoderalab:mainfrom
chrisiacovella:dev-helper_scripts

Conversation

@chrisiacovella
Copy link
Copy Markdown
Member

@chrisiacovella chrisiacovella commented Jan 7, 2026

Pull Request Summary

As discussed in #390 aimnet2 was failing when training with forces. I tracked this down to be where we take the norm of the vector contributions. Adding in a small epsilon value (1e-8) seems to correct the issue.

This also includes a few additional example scripts, e.g., to calculate the self-energy of a dataset and plotting a dataset.

Note: CI was failing due to an update to pytorch lightning. See #392 . Basically, we now also need to pass weights_only to various pytorch lightning functions we are calling.

Key changes

Notable points that this PR has either accomplished or will accomplish.

  • fix issue causing crashes when training with force

Associated Issue(s)

Pull Request Checklist

  • Issue(s) raised/addressed and linked
  • Includes appropriate unit test(s)
  • Appropriate docstring(s) added/updated
  • Appropriate .rst doc file(s) added/updated
  • PR is ready for review

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jan 7, 2026

Codecov Report

❌ Patch coverage is 77.35849% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.84%. Comparing base (8995841) to head (6719d6d).

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@chrisiacovella
Copy link
Copy Markdown
Member Author

chrisiacovella commented Jan 7, 2026

Running any real training tests (with forces) that would help to detect these errors is going to be too expensive for CI and require too much memory. I think I should create a separate test (that will be skipped on CI) to do these checks...they will not necessarily be that expensive to run on a machine with GPU locally. While I was seeing failure during the first epoch with the full dataset, I'm not sure if the n=1000 test set will also fail so quickly (or fail at all). I need to do some additional local tests on the unmodified version to ensure failure with the test set (so that any actual testing with the patched version is actually meaningful).

Edit: The smaller subsets do not seem to fail..right now the failure in the unpatched version is only occurring when using the "full" tmqm openff dataset...I will test all the potentials using this (separately) to see if any others fail for similar reasons (i.e., dividing by zero).

…(as this caused a very old version to be installed to work with the newest version of pint, yet those two are incompatible). Update the trainer model to specify weights_only for the calls to pytorch lightning (may need to pin this to the latest).
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a critical bug in AIMNet2 that caused training failures when using forces, by adding numerical stability to the vector norm calculation. Additionally, it includes PyTorch Lightning compatibility updates for the weights_only parameter, updates atomic self-energies in dataset YAML files from hartree to kilojoule_per_mole units, and adds example scripts for dataset visualization and self-energy calculation.

  • Fixed numerical stability issue in AIMNet2's vector contribution calculation by adding epsilon to norm computation
  • Added weights_only=False parameter to PyTorch Lightning's validate and test methods to address compatibility issues
  • Updated atomic self-energy values in dataset YAML files (spice1, spice2, phalkethoh) with unit conversions

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 17 comments.

Show a summary per file
File Description
modelforge/potential/aimnet2.py Added epsilon value to torch.norm for numerical stability and NaN detection
modelforge/train/training.py Added weights_only=False parameter to trainer.validate() and trainer.test() calls
modelforge/train/losses.py Added scale_by_number_of_atoms=True to PerAtomChargeError initialization
modelforge/potential/potential.py Added jit parameter to load_from_wandb method signature
modelforge/potential/featurization.py Added commented debugging code for anomaly detection
modelforge/utils/misc.py Added file path expansion for tilde and environment variables in OpenWithLock
modelforge/dataset/utils.py Fixed typo in comment ("lineare" to "linear")
modelforge/dataset/dataset.py Added histogram plotting for shifted energies (appears to be debug code)
modelforge/dataset/yaml_files/spice1.yaml Updated atomic self-energies from hartree to kilojoule_per_mole units
modelforge/dataset/yaml_files/spice2.yaml Updated atomic self-energies from hartree to kilojoule_per_mole units
modelforge/dataset/yaml_files/spice2_openff.yaml Added trailing comma to Br self-energy entry
modelforge/dataset/yaml_files/phalkethoh.yaml Updated atomic self-energies from hartree to kilojoule_per_mole units
modelforge-curate/modelforge/curate/sourcedataset.py Added iter_records and records_names methods, file locking for HDF5 reads
modelforge-curate/modelforge/curate/record.py Fixed docstring indentation, added calculate_reference_energy function
modelforge-curate/modelforge/curate/datasets/tmqm_openff_curation.py Added fallback logic for molecule name parsing with underscore separator
modelforge-curate/modelforge/curate/datasets/yaml_files/tmqm_openff_curation.yaml Added v0_ext_sm1 version configuration
scripts/self_energy_calculation.py New script for calculating self-energies across multiple datasets
scripts/plot_dataset.py New script for plotting dataset energy distributions
devtools/conda-envs/*.yaml Updated openff-units version requirement to >=0.3.1 across all environments
devtools/conda-envs/env.yaml Renamed environment, added multiple new dependencies for testing and JAX support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/plot_dataset.py
Comment thread modelforge/dataset/dataset.py Outdated
Comment thread devtools/conda-envs/env.yaml Outdated
Comment thread devtools/conda-envs/env.yaml
Comment thread modelforge/dataset/dataset.py Outdated
Comment thread modelforge/dataset/dataset.py Outdated
Comment thread modelforge/dataset/yaml_files/phalkethoh.yaml Outdated
Comment thread modelforge-curate/modelforge/curate/sourcedataset.py Outdated
Comment thread scripts/plot_dataset.py
Comment thread scripts/plot_dataset.py Outdated
chrisiacovella and others added 14 commits January 8, 2026 08:08
fixing typo

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
renaming full environment file

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
removing commas from yaml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
updating file locking for reading in source datasets

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
removing unused imports

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…wever, it might be good to add in a function and/or example script that wraps this functionality (calling the torch dataset) so that it is easy for users to interrogate what happens to a dataset after it has been loaded and preprocessed.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@chrisiacovella chrisiacovella merged commit 0f4ed92 into choderalab:main Jan 12, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants