Skip to content

Enable defining a fixed test subset, while still randomizing train/val split#376

Merged
chrisiacovella merged 13 commits intochoderalab:mainfrom
chrisiacovella:dev-fixed_test
Aug 29, 2025
Merged

Enable defining a fixed test subset, while still randomizing train/val split#376
chrisiacovella merged 13 commits intochoderalab:mainfrom
chrisiacovella:dev-fixed_test

Conversation

@chrisiacovella
Copy link
Copy Markdown
Member

@chrisiacovella chrisiacovella commented Aug 28, 2025

Pull Request Summary

In some cases, we may wish to have a fixed subset of test systems, while allowing the remaining molecules to be able to be randomly distributed between training and validation.

This is likely not particularly useful for production level training runs, but could be especially useful for testing purposes.
E.g., this could be useful for allowing us to get error bars (by changing the split between training/validation) while exploring the impact of hyper parameters, while removing any impacts on fitness that could result form having different molecules in the test set.

To enable this, the random splitting algorithms now accept an optional second seed, test_seed.

If test_seed is defined, randomization will be done in two steps.

-Step one, randomize the indices using a random generator based on test_seed. Here we will split the dataset into a test subset and a subset contains training+val. Runs that use the same test_seed (on the same dataset) will therefore have test subsets with identical systems.

-Step two, randomize the indices in the training+val subset (using a random generator associated with the main seed), splitting this into the appropriately sized training and validation subsets. Changing the value of seed will randomize the molecules in these two subsets.

Associated Issue(s)

Pull Request Checklist

  • Issue(s) raised/addressed and linked
  • Includes appropriate unit test(s)
  • Appropriate docstring(s) added/updated
  • Appropriate .rst doc file(s) added/updated
  • PR is ready for review

…n training/val can be randomized. This now accepts a second seed to define the (train+val)/test split; the original seed is then used to split train and val in the second stage.
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Aug 28, 2025

Codecov Report

❌ Patch coverage is 88.57143% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.82%. Comparing base (102545e) to head (7f29ec5).

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…etry if the download fails (as is common when downloading from zenodo on CI).

Adding in a new test that simply downloads the datasets first, so they are accessible to all.
@chrisiacovella
Copy link
Copy Markdown
Member Author

Several of the tests were failing due to timeout of downloading from zenodo. As described in #377 , I wrapped the download_from_url in a loop to allow us to retry (max 3 times). I've also added in a test function that runs first that downloads the datasets initially (making them available to the rest of the runs)...if we can't download the tests will fail very fast so we don't waste time and can just restart the test (as there is nothing necessarily wrong with test, just an issue hammering on zenodo).

…que folder shared by all workers (not a separate worker for each). This should ensure that all workers have access to the datasets and reduce extra downloads.

This comment was marked as outdated.

chrisiacovella and others added 6 commits August 29, 2025 11:58
spelling mistakes in a comment

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
spelling mistake sin a comment

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
I copy and pasted a spelling mistake.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
spelling mistake

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ction that doesn't depend upon the dataset itself, to make testing easier.
…age splitting indices in the random record splitting function.
@chrisiacovella chrisiacovella requested a review from Copilot August 29, 2025 21:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enables defining a fixed test subset while still randomizing train/val splits by introducing an optional test_seed parameter to random splitting algorithms. This allows for consistent test sets across different runs while varying the training and validation data distribution.

  • Adds test_seed parameter to random splitting strategies for two-stage randomization
  • Implements two-stage split logic that first separates test data, then randomizes train/val
  • Adds comprehensive test coverage and documentation for the new functionality

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
modelforge/utils/remote.py Adds retry functionality with configurable delays for download robustness
modelforge/train/training.py Passes test_seed parameter to splitting strategy during datamodule setup
modelforge/train/parameters.py Adds optional test_seed field to SplittingStrategy configuration
modelforge/tests/test_dataset.py Adds comprehensive tests for split calculations and two-stage random splitting
modelforge/tests/test_aaa_setup.py New test file for initial dataset download and setup
modelforge/tests/conftest.py Improves shared temporary directory handling for parallel testing
modelforge/dataset/utils.py Implements core two-stage splitting logic and utility functions
docs/training.rst Updates documentation with new splitting strategy parameters and examples
docs/potentials.rst Minor documentation fix removing ZBL energy from example
docs/datasets.rst Minor documentation improvements and formatting fixes

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread modelforge/utils/remote.py Outdated
Comment thread modelforge/dataset/utils.py Outdated
Comment thread modelforge/dataset/utils.py Outdated
Comment thread docs/training.rst Outdated
Comment thread modelforge/dataset/utils.py
Comment thread modelforge/tests/test_aaa_setup.py
@chrisiacovella chrisiacovella merged commit 4f7c718 into choderalab:main Aug 29, 2025
17 checks passed
@chrisiacovella chrisiacovella deleted the dev-fixed_test branch August 29, 2025 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants