Skip to content

Curate: different strategies for generating records with a set maximum number of configurations#375

Merged
chrisiacovella merged 11 commits intochoderalab:mainfrom
chrisiacovella:dev-random_slice
Aug 26, 2025
Merged

Curate: different strategies for generating records with a set maximum number of configurations#375
chrisiacovella merged 11 commits intochoderalab:mainfrom
chrisiacovella:dev-random_slice

Conversation

@chrisiacovella
Copy link
Copy Markdown
Member

@chrisiacovella chrisiacovella commented Aug 25, 2025

Pull Request Summary

This adds functionality to the SourceDataset class to allow us to use different strategies for restricting a record to at max "N" configurations.
Previously, this simply restricted to the first N configurations in the array; this PR now adds functionality to extract starting from the end of the array or randomly select records.

Users just need to specify "max_configurations_per_record_order" to change from the default behavior of "start".

To ensure both reproducibility and allow for unique datasets, a seed can be passed as well. This initializes a numpy random number generator instance (i.e., does not use the global state).

Key changes

Notable points that this PR has either accomplished or will accomplish.

  • Add in different options to the subsetting routines in a SourceDataset.

Associated Issue(s)

Pull Request Checklist

  • Issue(s) raised/addressed and linked
  • Includes appropriate unit test(s)
  • Appropriate docstring(s) added/updated
  • Appropriate .rst doc file(s) added/updated
  • PR is ready for review

@chrisiacovella
Copy link
Copy Markdown
Member Author

I had initially pinned tad-mctc as there was a bug introduced in the newest version, but hadn't pinned tad-dftd3; the newer version of tad-dftd3 is now looking for functions that don't exist in the older pinned version of tad-mctc. It appears that the newest versions of each have resolved the initial reason for pinning (#368). I'm going to try removing the pin to see if that also resolves the issues on CI (fixes locally).

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Aug 25, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.87%. Comparing base (ef9c8be) to head (cc0f28c).
⚠️ Report is 211 commits behind head on main.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the SourceDataset class by adding flexible strategies for selecting configurations when limiting records to a maximum number of configurations. Previously, configurations were always selected from the start of the array. Now users can choose to select from the start, end, or randomly, with optional seeding for reproducibility.

Key Changes

  • Added max_configurations_per_record_order parameter to support "start", "end", and "random" configuration selection strategies
  • Added seed parameter for reproducible random configuration selection
  • Enhanced test coverage for the new configuration selection strategies

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
modelforge/curate/sourcedataset.py Added new parameters and logic for flexible configuration selection strategies
modelforge/curate/tests/test_curate.py Added comprehensive tests for all three configuration selection strategies
modelforge/curate/examples/record_and_sourcedataset.ipynb Updated documentation and timestamps
modelforge/curate/datasets/tmqm_openff_curation.py Minor logging change (commented out debug line)
modelforge/curate/datasets/scripts/curate_tmqm_openff.py Updated version and switched processing method
devtools/conda-envs/*.yaml Relaxed tad-mctc version constraint

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread modelforge-curate/modelforge/curate/sourcedataset.py Outdated
Comment thread modelforge-curate/modelforge/curate/sourcedataset.py Outdated
Comment thread modelforge-curate/modelforge/curate/tests/test_curate.py Outdated
Comment thread modelforge-curate/modelforge/curate/sourcedataset.py
@chrisiacovella chrisiacovella merged commit 102545e into choderalab:main Aug 26, 2025
17 checks passed
@chrisiacovella chrisiacovella deleted the dev-random_slice branch August 27, 2025 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants