Merged
Conversation
…urning a bool of the status (make it easier to fix errors in an interactive setting).
…stem in the jupyter notebook
…jupyter notebook .
…ization of a Record. Added tests.
…ystem rather than having to worry about passing around units to every function. This might be slightly non-ideal (I generally do not like global variables), but I think for the case of curation it might make the most sense given the hierarchical nature of construction and the need to validate at the level of properties.
Member
Author
|
The NPZ file that is generated from the dataset is somewhat dynamic (and we can't just look at the md5 checksum). To see if the code needs to regenerate that npz file or just use the one already that exists, we look at a metadata file that is written out (as a json file). This include the md5checksum of the hdf5 file and also the properties of interest that were used to generate the npz file. We need to also put the element filter in there and look at it. If the element filter has changed, we need to regenerate that npz file. |
… from partial charges (and rescaling)
…fixed small bugs in hdf5 writer. hdf5 writer now also includes property type, which should make it easier upon reading to know what a property represents and how to convert it to a desired set of units.
…t sufficiently unique due to saturation of rings in some cases.
…ords are actually are strings.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request Summary
This adds in a new module called "curate" that provides an API for dataset curation.
The general hierarchy is that each dataset contains records, and each record contains properties. In this, properties are defined using pydantic models (e.g.,
AtomicNumbers,Energy,Forces,PartialCharge, etc.) . The pydantic property models ensure that for each value we ensure: units, property type (e.g., length, energy, force), and classification (per_atom, per_system), and also validates that the shape of the inputted array matches the expectation related to the classification. Records collect the properties and validate that consistent number of configurations and atoms exist across properties. The dataset class provides functions to write to an hdf5 file, converting units to the specified unit system.The dataset also can be initialize in an "append" mode whereby properties associated with individual configurations can be added separately, and the code will automatically append to the internal numpy array (performing any unit conversion necessary).
This PR is still a WIP, as it will require also converting old curation scripts to use the new api, and to revise the dataset class in model forget to accept the new format (minor changes to the terminology we use for classification).
Key changes
Notable points that this PR has either accomplished or will accomplish.
Questions
Associated Issue(s)
Pull Request Checklist