epsilon value to norm function in aimnet2#391
epsilon value to norm function in aimnet2#391chrisiacovella merged 17 commits intochoderalab:mainfrom
Conversation
Codecov Report❌ Patch coverage is 🚀 New features to boost your workflow:
|
|
Running any real training tests (with forces) that would help to detect these errors is going to be too expensive for CI and require too much memory. I think I should create a separate test (that will be skipped on CI) to do these checks...they will not necessarily be that expensive to run on a machine with GPU locally. While I was seeing failure during the first epoch with the full dataset, I'm not sure if the n=1000 test set will also fail so quickly (or fail at all). I need to do some additional local tests on the unmodified version to ensure failure with the test set (so that any actual testing with the patched version is actually meaningful). Edit: The smaller subsets do not seem to fail..right now the failure in the unpatched version is only occurring when using the "full" tmqm openff dataset...I will test all the potentials using this (separately) to see if any others fail for similar reasons (i.e., dividing by zero). |
…(as this caused a very old version to be installed to work with the newest version of pint, yet those two are incompatible). Update the trainer model to specify weights_only for the calls to pytorch lightning (may need to pin this to the latest).
There was a problem hiding this comment.
Pull request overview
This PR addresses a critical bug in AIMNet2 that caused training failures when using forces, by adding numerical stability to the vector norm calculation. Additionally, it includes PyTorch Lightning compatibility updates for the weights_only parameter, updates atomic self-energies in dataset YAML files from hartree to kilojoule_per_mole units, and adds example scripts for dataset visualization and self-energy calculation.
- Fixed numerical stability issue in AIMNet2's vector contribution calculation by adding epsilon to norm computation
- Added
weights_only=Falseparameter to PyTorch Lightning's validate and test methods to address compatibility issues - Updated atomic self-energy values in dataset YAML files (spice1, spice2, phalkethoh) with unit conversions
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 17 comments.
Show a summary per file
| File | Description |
|---|---|
| modelforge/potential/aimnet2.py | Added epsilon value to torch.norm for numerical stability and NaN detection |
| modelforge/train/training.py | Added weights_only=False parameter to trainer.validate() and trainer.test() calls |
| modelforge/train/losses.py | Added scale_by_number_of_atoms=True to PerAtomChargeError initialization |
| modelforge/potential/potential.py | Added jit parameter to load_from_wandb method signature |
| modelforge/potential/featurization.py | Added commented debugging code for anomaly detection |
| modelforge/utils/misc.py | Added file path expansion for tilde and environment variables in OpenWithLock |
| modelforge/dataset/utils.py | Fixed typo in comment ("lineare" to "linear") |
| modelforge/dataset/dataset.py | Added histogram plotting for shifted energies (appears to be debug code) |
| modelforge/dataset/yaml_files/spice1.yaml | Updated atomic self-energies from hartree to kilojoule_per_mole units |
| modelforge/dataset/yaml_files/spice2.yaml | Updated atomic self-energies from hartree to kilojoule_per_mole units |
| modelforge/dataset/yaml_files/spice2_openff.yaml | Added trailing comma to Br self-energy entry |
| modelforge/dataset/yaml_files/phalkethoh.yaml | Updated atomic self-energies from hartree to kilojoule_per_mole units |
| modelforge-curate/modelforge/curate/sourcedataset.py | Added iter_records and records_names methods, file locking for HDF5 reads |
| modelforge-curate/modelforge/curate/record.py | Fixed docstring indentation, added calculate_reference_energy function |
| modelforge-curate/modelforge/curate/datasets/tmqm_openff_curation.py | Added fallback logic for molecule name parsing with underscore separator |
| modelforge-curate/modelforge/curate/datasets/yaml_files/tmqm_openff_curation.yaml | Added v0_ext_sm1 version configuration |
| scripts/self_energy_calculation.py | New script for calculating self-energies across multiple datasets |
| scripts/plot_dataset.py | New script for plotting dataset energy distributions |
| devtools/conda-envs/*.yaml | Updated openff-units version requirement to >=0.3.1 across all environments |
| devtools/conda-envs/env.yaml | Renamed environment, added multiple new dependencies for testing and JAX support |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
fixing typo Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
renaming full environment file Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
removing commas from yaml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
updating file locking for reading in source datasets Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
removing unused imports Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…wever, it might be good to add in a function and/or example script that wraps this functionality (calling the torch dataset) so that it is easy for users to interrogate what happens to a dataset after it has been loaded and preprocessed.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Pull Request Summary
As discussed in #390 aimnet2 was failing when training with forces. I tracked this down to be where we take the norm of the vector contributions. Adding in a small epsilon value (1e-8) seems to correct the issue.
This also includes a few additional example scripts, e.g., to calculate the self-energy of a dataset and plotting a dataset.
Note: CI was failing due to an update to pytorch lightning. See #392 . Basically, we now also need to pass weights_only to various pytorch lightning functions we are calling.
Key changes
Notable points that this PR has either accomplished or will accomplish.
Associated Issue(s)
Pull Request Checklist