-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Clustering module refactor #1864
Conversation
I cant see how it needs deprecating, maybe @chrisholder can comment |
The module was never 'public' so nothing needs to be deprecated - it isnt even included on the sktime docs if you look here: https://www.sktime.org/en/stable/api_reference.html so I don't really see or think anything needs deprecating. The only way of seeing it is search for it. It's also marked with the 'Experimental feature' and therefore any users who had used it will know everything is subject to change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should have a default predict_proba here, as with BaseClassifier
What do you want that to return? For fuzzy clustering I could see the use but for default non fuzzy based clustering what do you want it to return? |
Probabilistic cluster assignment? That would make sense in clusterers which, for instance, fit a full mixture distribution. Then, the concept of conditional likelihood of cluster membership makes sense (which would be a pmf). |
so the default method should call predict, then sign cluster probability of 1 to predicted cluster, 0 to others (see BaseClassifier). I'm assuming the number of clusters is set in predict and stored of course |
actually, on discussion, predict_proba is problematic for some potential clustering algorithms. so we will defer this feature until we actually need it. This is good to go now, last orders @fkiraly, I want to put it in when it passes tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Sorry to block, the docstrings in the extension template seems to be incorrect.
X in _fit
should say that it´s guaranteed to be of X_inner_mtype
- if it´s not numpy, the implementation will break despite assurances by the template.
See the BaseClassifier
for a formulation that people thought wasn´t too obscure (could be obscure if you just talk about mtypes without ref) and also correct.
Oh ok I'll look next week |
@fkiraly I've changed the comments in the extension templates, let me know if they are ok now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I think this is now good enough.
I have some suggestions on how to make the extension templates clearer, but let´s do this in a separate PR.
Further, is _predict
really mandatory for clusterers?
I don´t think it should be, not all clustering algorithms result in a unique cluster assignment!
Example: probabilistic/likelihood based; hierarchical or dendrogram clustering which results in a tree but not a cluster assignment.
Reference Issues/PRs
This pr relies on pr #1847
What does this implement/fix? Explain your changes.
This pr refactors the clustering module base class and removes the use of cython distance metrics.
The base class refactor is aimed to be essentially the same as classification for consistency.
With the base class and distances now properly supporting multivariate problems the two implemented algorithms (k-means and k-medoids) have been refactored to properly support this. This includes removing cython distances and greatly simplifying the implementation of both algorithms.
In addition huge performance benefits from refactoring the distances can be seen.
This pr is only missing DBA as a mean averaging for k-means but this will be a large change and therefore this pr is split up from it (refactor for DBA coming in the next few days).
Does your contribution introduce a new dependency? If yes, which one?
What should a reviewer concentrate their feedback on?
Any other comments?
PR checklist
For all contributions
For new estimators