Enable defining a fixed test subset, while still randomizing train/val split by chrisiacovella · Pull Request #376 · choderalab/modelforge

chrisiacovella · 2025-08-28T07:22:29Z

Pull Request Summary

In some cases, we may wish to have a fixed subset of test systems, while allowing the remaining molecules to be able to be randomly distributed between training and validation.

This is likely not particularly useful for production level training runs, but could be especially useful for testing purposes.
E.g., this could be useful for allowing us to get error bars (by changing the split between training/validation) while exploring the impact of hyper parameters, while removing any impacts on fitness that could result form having different molecules in the test set.

To enable this, the random splitting algorithms now accept an optional second seed, test_seed.

If test_seed is defined, randomization will be done in two steps.

-Step one, randomize the indices using a random generator based on test_seed. Here we will split the dataset into a test subset and a subset contains training+val. Runs that use the same test_seed (on the same dataset) will therefore have test subsets with identical systems.

-Step two, randomize the indices in the training+val subset (using a random generator associated with the main seed), splitting this into the appropriately sized training and validation subsets. Changing the value of seed will randomize the molecules in these two subsets.

Associated Issue(s)

Pull Request Checklist

Issue(s) raised/addressed and linked
Includes appropriate unit test(s)
Appropriate docstring(s) added/updated
Appropriate .rst doc file(s) added/updated
PR is ready for review

…n training/val can be randomized. This now accepts a second seed to define the (train+val)/test split; the original seed is then used to split train and val in the second stage.

codecov-commenter · 2025-08-28T07:24:31Z

Codecov Report

❌ Patch coverage is 88.57143% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.82%. Comparing base (102545e) to head (7f29ec5).

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…etry if the download fails (as is common when downloading from zenodo on CI). Adding in a new test that simply downloads the datasets first, so they are accessible to all.

chrisiacovella · 2025-08-28T20:59:59Z

Several of the tests were failing due to timeout of downloading from zenodo. As described in #377 , I wrapped the download_from_url in a loop to allow us to retry (max 3 times). I've also added in a test function that runs first that downloads the datasets initially (making them available to the rest of the runs)...if we can't download the tests will fail very fast so we don't waste time and can just restart the test (as there is nothing necessarily wrong with test, just an issue hammering on zenodo).

…que folder shared by all workers (not a separate worker for each). This should ensure that all workers have access to the datasets and reduce extra downloads.

… ones)

spelling mistakes in a comment Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

spelling mistake sin a comment Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

I copy and pasted a spelling mistake. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

spelling mistake Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…ction that doesn't depend upon the dataset itself, to make testing easier.

…age splitting indices in the random record splitting function.

Copilot

Pull Request Overview

This PR enables defining a fixed test subset while still randomizing train/val splits by introducing an optional test_seed parameter to random splitting algorithms. This allows for consistent test sets across different runs while varying the training and validation data distribution.

Adds test_seed parameter to random splitting strategies for two-stage randomization
Implements two-stage split logic that first separates test data, then randomizes train/val
Adds comprehensive test coverage and documentation for the new functionality

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
modelforge/utils/remote.py	Adds retry functionality with configurable delays for download robustness
modelforge/train/training.py	Passes test_seed parameter to splitting strategy during datamodule setup
modelforge/train/parameters.py	Adds optional test_seed field to SplittingStrategy configuration
modelforge/tests/test_dataset.py	Adds comprehensive tests for split calculations and two-stage random splitting
modelforge/tests/test_aaa_setup.py	New test file for initial dataset download and setup
modelforge/tests/conftest.py	Improves shared temporary directory handling for parallel testing
modelforge/dataset/utils.py	Implements core two-stage splitting logic and utility functions
docs/training.rst	Updates documentation with new splitting strategy parameters and examples
docs/potentials.rst	Minor documentation fix removing ZBL energy from example
docs/datasets.rst	Minor documentation improvements and formatting fixes

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Adding in ability to specify a fixed test set, where the split betwee…

5eef82f

…n training/val can be randomized. This now accepts a second seed to define the (train+val)/test split; the original seed is then used to split train and val in the second stage.

chrisiacovella added 3 commits August 28, 2025 10:23

Initializing the generation to None if test_seed is not set.

fd3ce7d

updating the documentation.

3357397

slight reworking of the download_from_url, adding in the ability to r…

17bb5f1

…etry if the download fails (as is common when downloading from zenodo on CI). Adding in a new test that simply downloads the datasets first, so they are accessible to all.

chrisiacovella mentioned this pull request Aug 28, 2025

CI tests sometimes fail that rely on zenodo #377

Closed

chrisiacovella added 2 commits August 28, 2025 16:48

refactoring the dataset_temp_dir factory such that it is a single uni…

92fefbe

…que folder shared by all workers (not a separate worker for each). This should ensure that all workers have access to the datasets and reduce extra downloads.

Adding other version we need to download for the tests (various spice…

c1e9d30

… ones)

chrisiacovella requested review from MarshallYan and Copilot August 29, 2025 02:36

This comment was marked as outdated.

Sign in to view

chrisiacovella and others added 6 commits August 29, 2025 11:58

Update modelforge/dataset/utils.py

b0f0fce

spelling mistakes in a comment Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update modelforge/dataset/utils.py

f397875

spelling mistake sin a comment Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update modelforge/dataset/utils.py

473c543

I copy and pasted a spelling mistake. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update docs/training.rst

b1c158c

spelling mistake Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Revised structure of fixed test splitting: making separate helper fun…

1cd5554

…ction that doesn't depend upon the dataset itself, to make testing easier.

Using separate (more easily testable) function for determining two st…

ab585e6

…age splitting indices in the random record splitting function.

chrisiacovella requested a review from Copilot August 29, 2025 21:22

Copilot AI reviewed Aug 29, 2025

View reviewed changes

Comment thread modelforge/utils/remote.py Outdated

Comment thread modelforge/dataset/utils.py Outdated

Comment thread modelforge/dataset/utils.py Outdated

Comment thread docs/training.rst Outdated

MarshallYan reviewed Aug 29, 2025

View reviewed changes

Comment thread modelforge/dataset/utils.py

Comment thread modelforge/tests/test_aaa_setup.py

fixing some typos in the docs.

7f29ec5

chrisiacovella merged commit 4f7c718 into choderalab:main Aug 29, 2025
17 checks passed

chrisiacovella deleted the dev-fixed_test branch August 29, 2025 22:57

chrisiacovella mentioned this pull request Aug 29, 2025

fixed sets for train/val/test #373

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable defining a fixed test subset, while still randomizing train/val split#376

Enable defining a fixed test subset, while still randomizing train/val split#376
chrisiacovella merged 13 commits intochoderalab:mainfrom
chrisiacovella:dev-fixed_test

chrisiacovella commented Aug 28, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Aug 28, 2025 •

edited

Loading

Uh oh!

chrisiacovella commented Aug 28, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

chrisiacovella commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Summary

Associated Issue(s)

Pull Request Checklist

Uh oh!

codecov-commenter commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

chrisiacovella commented Aug 28, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chrisiacovella commented Aug 28, 2025 •

edited

Loading

codecov-commenter commented Aug 28, 2025 •

edited

Loading