Curate: different strategies for generating records with a set maximum number of configurations#375
Conversation
…r generating different subsets for max records (i.e., start, end, random).
|
I had initially pinned tad-mctc as there was a bug introduced in the newest version, but hadn't pinned tad-dftd3; the newer version of tad-dftd3 is now looking for functions that don't exist in the older pinned version of tad-mctc. It appears that the newest versions of each have resolved the initial reason for pinning (#368). I'm going to try removing the pin to see if that also resolves the issues on CI (fixes locally). |
Codecov Report✅ All modified and coverable lines are covered by tests. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull Request Overview
This PR enhances the SourceDataset class by adding flexible strategies for selecting configurations when limiting records to a maximum number of configurations. Previously, configurations were always selected from the start of the array. Now users can choose to select from the start, end, or randomly, with optional seeding for reproducibility.
Key Changes
- Added
max_configurations_per_record_orderparameter to support "start", "end", and "random" configuration selection strategies - Added
seedparameter for reproducible random configuration selection - Enhanced test coverage for the new configuration selection strategies
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| modelforge/curate/sourcedataset.py | Added new parameters and logic for flexible configuration selection strategies |
| modelforge/curate/tests/test_curate.py | Added comprehensive tests for all three configuration selection strategies |
| modelforge/curate/examples/record_and_sourcedataset.ipynb | Updated documentation and timestamps |
| modelforge/curate/datasets/tmqm_openff_curation.py | Minor logging change (commented out debug line) |
| modelforge/curate/datasets/scripts/curate_tmqm_openff.py | Updated version and switched processing method |
| devtools/conda-envs/*.yaml | Relaxed tad-mctc version constraint |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
tidying up Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…if a bad value is added.
max_configurations_per_record_order
…the test to not actually capture the stdout)
Pull Request Summary
This adds functionality to the SourceDataset class to allow us to use different strategies for restricting a record to at max "N" configurations.
Previously, this simply restricted to the first N configurations in the array; this PR now adds functionality to extract starting from the end of the array or randomly select records.
Users just need to specify "max_configurations_per_record_order" to change from the default behavior of "start".
To ensure both reproducibility and allow for unique datasets, a seed can be passed as well. This initializes a numpy random number generator instance (i.e., does not use the global state).
Key changes
Notable points that this PR has either accomplished or will accomplish.
Associated Issue(s)
Pull Request Checklist