Fixes to k-fold fingerprint splitting #3038
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request Template
Description
There are two issues when attempting to k-fold split via fingerprint similarity:
N
does not divide evenly intok
folds (i.eN mod k != 0
) the extra data points are discarded. Astest_size != 0
whenN mod k != 0
, there is an attempted second split for the test indices.k_fold_split
forFingerprintSplitter
only collects training and validation data, and therefore the test indices are lost. The proposed fix by adding checks onfrac_valid
andfrac_test
ensure that the extra data points are merged appropriately into the test and validation sets respectively.k
th fold,split
calls with argumentsfrac_train=1.0
,frac_valid=0.0
andfrac_test=0.0
._split_fingerprints
is then called withsize1=N/k
andsize2=0
, which will fail as the grouping algorithm attempts to divide by zero. A straightforward fix is to return the identity (i.e all of the unsplit indices), because the final fold is already considered split by process of elimination.To test on both these issues, I have also written a unit test that checks for both issues.
Extra line changes were caused by running
yapf
on bothsplitter.py
andtest_splitter.py
.Type of change
Please check the option that is related to your PR.
Checklist
yapf -i <modified file>
and check no errors (yapf version must be 0.32.0)mypy -p deepchem
and check no errorsflake8 <modified file> --count
and check no errorspython -m doctest <modified file>
and check no errors