Add Lo splitter #9

SteshinSS · 2024-02-27T12:56:49Z

Hi, all. @cwognum suggested adding Lo-Hi splitter into Splito. Here is the first part of it.

Changelogs

Add Lo splitter (from Lo-Hi splitter)

If the code seems ok, I'll add documentation and update this PR.

Checklist:

Was this PR discussed in an issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
Add tests to cover the fixed bug(s) or the newly introduced feature(s) (if appropriate).
Update the API documentation if a new function is added, or an existing one is deleted.
Write concise and explanatory changelogs above.
If possible, assign one of the following labels to the PR: feature, fix or test (or ask a maintainer to do it for you).

SteshinSS · 2024-02-27T12:57:59Z

Also, comment please on the code style. I didn't use any formatter yet.

cwognum

Thanks for the work @SteshinSS ! I'm very grateful for the documentation and test cases! 🚀

My main question is whether we can simplify the code by using some of the functionality of Datamol.

For the formatting, we use ruff, which is (mostly) the same as black.

splito/lohi/_utils.py

splito/lohi/_lo.py

SteshinSS · 2024-02-28T14:21:15Z

Thank you for the thorough review. The code is very complex, so any way to simplify it would be helpful. I'll return with the revision.

SteshinSS · 2024-03-02T10:50:45Z

Changes in the new commit:

Added support for duplicates in the SMILES list.
Changed the style of the code and docstrings.
Used a logger instead of print statements.

Please review this version, and if it looks fine, I will include documentation in this PR.

zhu0619

Thank you very much for your contribution @SteshinSS!
I really like the LO splitting method.

I left few comments.

splito/lohi/_lo.py

SteshinSS · 2024-03-24T11:43:03Z

Thanks, guys, for the review! It seems that the PR has mostly converged.
In this case, I'm going to add documentation to this PR.

Please check my old tutorial on the Lo splitter. I'm going to add more details, remove the k-fold examples, and rewrite the code using Splito and datamol datasets. What do you think about that? Is there anything that you would like to see in the documentation? Is there anything that could be clearer?

cwognum · 2024-03-24T15:34:36Z

Hi @SteshinSS, thanks for the nice work so far! As to the documentation, I would focus on the why rather than on the what. For a large part I find the code self-explanatory and won't need comments that explain me what's happening. Instead, I would like to understand why it has been implemented in a certain way and - zooming out - what the reasoning is behind this splitting method to begin with. You can of course always refer to your paper, but I would assume people have not read it.

SteshinSS · 2024-04-06T10:21:48Z

Done :)

cwognum

Wow, @SteshinSS ! You went above and beyond for the docs! Thanks a lot for the great work! 📚🚀

I have some suggestions to simplify the code for the test case and then one final comment: I recently merged #12. Could you integrate the Lo splitter with the changes from this PR? It would be great if people could do:

X_train, X_test, y_train, y_test = train_test_split(X, y, smiles=smiles, method="lo")

One question here: Given that the Lo splitter returns multiple test sets, what would you suggest the value of X_test and y_test to look like? Should it just be lists of arrays?

tests/test_lo.py

cwognum · 2024-04-06T16:03:09Z

@SteshinSS Maybe worth saying that I didn't test the suggested changes above. It's from the top of my head.

SteshinSS · 2024-04-13T06:49:46Z

I just checked, and I'm not sure we want to add LoSplitter into train_test_split().

There are many splitters with a similar interface, accepting test_size and seed as parameters and returning an iterator. This function allows them to work in the same manner, which might be beneficial. However, LoSplitter:

Does not return an iterator (because there are a lot of different clusters in the test set).
Takes neither test_size nor seed.
Accepts other arguments such as threshold, min_cluster_size, max_clusters, and std_threshold.

If we add LoSplitter to the train_test_split(), we will need:

To raise an error if test_size or seed are passed, because users will expect these parameters to work.
To handle LoSplitter as a special case in the return statement (because next(splitter.split(X, y)) won't work).
In this case, the common interface isn't so common, and the utility of this umbrella function isn't clear.

As an alternative, train_test_split() could take **kwargs and pass them into splitters as _ = train_test_split(X, y, method, **kwargs). However, this approach kinda make the train_test_split() pointless because users could simply use splitters directly.

Have you considered a functional approach? For example, splito could offer not only classes but also pure functions for splitting, similar to torch.nn.functional

cwognum · 2024-04-13T14:30:36Z

I like the idea of a functional module! That's indeed maybe the better way to go. Thanks for the elaborate suggestion and considerations!

However, I don't want to scope creep this PR too much, so could you maybe open an issue with your idea and then we can merge this PR! Thanks again 🙏

cwognum · 2024-04-13T14:32:23Z

Please just make sure the code is properly formatted with Ruff!

SteshinSS · 2024-04-19T10:12:38Z

I believe it could be merged :)

Add Lo splitter

7775cee

SteshinSS requested a review from hadim as a code owner February 27, 2024 12:56

zhu0619 requested review from zhu0619 and cwognum and removed request for hadim February 27, 2024 15:22

cwognum reviewed Feb 27, 2024

View reviewed changes

Add support for non-unique smiles to the Lo splitter

3d8677f

zhu0619 reviewed Mar 11, 2024

View reviewed changes

splito/lohi/_lo.py Outdated Show resolved Hide resolved

splito/lohi/_lo.py Outdated Show resolved Hide resolved

splito/lohi/_lo.py Show resolved Hide resolved

Lo splitter: return the centroids more explicitly

b4050eb

Add docs to Lo splitter

8af228c

cwognum reviewed Apr 6, 2024

View reviewed changes

tests/test_lo.py Outdated Show resolved Hide resolved

tests/test_lo.py Outdated Show resolved Hide resolved

cwognum assigned SteshinSS Apr 6, 2024

cwognum added the feature label Apr 6, 2024

Merge remote-tracking branch 'upstream/main'

a31ea1a

Simplify LoSplitter tests

2614030

cwognum approved these changes Apr 13, 2024

View reviewed changes

Remove unused imports

c40f82a

SteshinSS mentioned this pull request Apr 14, 2024

Functional forms of splitters #14

Open

cwognum merged commit 654e427 into datamol-io:main Apr 19, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Lo splitter #9

Add Lo splitter #9

SteshinSS commented Feb 27, 2024 •

edited by cwognum

Loading

SteshinSS commented Feb 27, 2024

cwognum left a comment

SteshinSS commented Feb 28, 2024

SteshinSS commented Mar 2, 2024

zhu0619 left a comment

SteshinSS commented Mar 24, 2024 •

edited

Loading

cwognum commented Mar 24, 2024

SteshinSS commented Apr 6, 2024

cwognum left a comment

cwognum commented Apr 6, 2024

SteshinSS commented Apr 13, 2024

cwognum commented Apr 13, 2024

cwognum commented Apr 13, 2024

SteshinSS commented Apr 19, 2024

Add Lo splitter #9

Add Lo splitter #9

Conversation

SteshinSS commented Feb 27, 2024 • edited by cwognum Loading

Changelogs

SteshinSS commented Feb 27, 2024

cwognum left a comment

Choose a reason for hiding this comment

SteshinSS commented Feb 28, 2024

SteshinSS commented Mar 2, 2024

zhu0619 left a comment

Choose a reason for hiding this comment

SteshinSS commented Mar 24, 2024 • edited Loading

cwognum commented Mar 24, 2024

SteshinSS commented Apr 6, 2024

cwognum left a comment

Choose a reason for hiding this comment

cwognum commented Apr 6, 2024

SteshinSS commented Apr 13, 2024

cwognum commented Apr 13, 2024

cwognum commented Apr 13, 2024

SteshinSS commented Apr 19, 2024

SteshinSS commented Feb 27, 2024 •

edited by cwognum

Loading

SteshinSS commented Mar 24, 2024 •

edited

Loading