Enable defining a fixed test subset, while still randomizing train/val split#376
Conversation
…n training/val can be randomized. This now accepts a second seed to define the (train+val)/test split; the original seed is then used to split train and val in the second stage.
Codecov Report❌ Patch coverage is 🚀 New features to boost your workflow:
|
…etry if the download fails (as is common when downloading from zenodo on CI). Adding in a new test that simply downloads the datasets first, so they are accessible to all.
|
Several of the tests were failing due to timeout of downloading from zenodo. As described in #377 , I wrapped the download_from_url in a loop to allow us to retry (max 3 times). I've also added in a test function that runs first that downloads the datasets initially (making them available to the rest of the runs)...if we can't download the tests will fail very fast so we don't waste time and can just restart the test (as there is nothing necessarily wrong with test, just an issue hammering on zenodo). |
…que folder shared by all workers (not a separate worker for each). This should ensure that all workers have access to the datasets and reduce extra downloads.
spelling mistakes in a comment Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
spelling mistake sin a comment Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
I copy and pasted a spelling mistake. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
spelling mistake Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ction that doesn't depend upon the dataset itself, to make testing easier.
…age splitting indices in the random record splitting function.
There was a problem hiding this comment.
Pull Request Overview
This PR enables defining a fixed test subset while still randomizing train/val splits by introducing an optional test_seed parameter to random splitting algorithms. This allows for consistent test sets across different runs while varying the training and validation data distribution.
- Adds
test_seedparameter to random splitting strategies for two-stage randomization - Implements two-stage split logic that first separates test data, then randomizes train/val
- Adds comprehensive test coverage and documentation for the new functionality
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| modelforge/utils/remote.py | Adds retry functionality with configurable delays for download robustness |
| modelforge/train/training.py | Passes test_seed parameter to splitting strategy during datamodule setup |
| modelforge/train/parameters.py | Adds optional test_seed field to SplittingStrategy configuration |
| modelforge/tests/test_dataset.py | Adds comprehensive tests for split calculations and two-stage random splitting |
| modelforge/tests/test_aaa_setup.py | New test file for initial dataset download and setup |
| modelforge/tests/conftest.py | Improves shared temporary directory handling for parallel testing |
| modelforge/dataset/utils.py | Implements core two-stage splitting logic and utility functions |
| docs/training.rst | Updates documentation with new splitting strategy parameters and examples |
| docs/potentials.rst | Minor documentation fix removing ZBL energy from example |
| docs/datasets.rst | Minor documentation improvements and formatting fixes |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Pull Request Summary
In some cases, we may wish to have a fixed subset of test systems, while allowing the remaining molecules to be able to be randomly distributed between training and validation.
This is likely not particularly useful for production level training runs, but could be especially useful for testing purposes.
E.g., this could be useful for allowing us to get error bars (by changing the split between training/validation) while exploring the impact of hyper parameters, while removing any impacts on fitness that could result form having different molecules in the test set.
To enable this, the random splitting algorithms now accept an optional second seed,
test_seed.If
test_seedis defined, randomization will be done in two steps.-Step one, randomize the indices using a random generator based on
test_seed. Here we will split the dataset into a test subset and a subset contains training+val. Runs that use the sametest_seed(on the same dataset) will therefore have test subsets with identical systems.-Step two, randomize the indices in the training+val subset (using a random generator associated with the main
seed), splitting this into the appropriately sized training and validation subsets. Changing the value ofseedwill randomize the molecules in these two subsets.Associated Issue(s)
Pull Request Checklist