Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Clustering module refactor #1864

Merged
merged 157 commits into from
Jan 16, 2022
Merged

[ENH] Clustering module refactor #1864

merged 157 commits into from
Jan 16, 2022

Conversation

chrisholder
Copy link
Collaborator

@chrisholder chrisholder commented Jan 11, 2022

Reference Issues/PRs

This pr relies on pr #1847

What does this implement/fix? Explain your changes.

This pr refactors the clustering module base class and removes the use of cython distance metrics.

The base class refactor is aimed to be essentially the same as classification for consistency.

With the base class and distances now properly supporting multivariate problems the two implemented algorithms (k-means and k-medoids) have been refactored to properly support this. This includes removing cython distances and greatly simplifying the implementation of both algorithms.

In addition huge performance benefits from refactoring the distances can be seen.

This pr is only missing DBA as a mean averaging for k-means but this will be a large change and therefore this pr is split up from it (refactor for DBA coming in the next few days).

Does your contribution introduce a new dependency? If yes, which one?

What should a reviewer concentrate their feedback on?

Any other comments?

PR checklist

For all contributions
  • I've added myself to the list of contributors.
  • Optionally, I've updated sktime's CODEOWNERS to receive notifications about future changes to these files.
  • I've added unit tests and made sure they pass locally.
For new estimators
  • I've added the estimator to the online documentation.
  • I've updated the existing example notebooks or provided a new one to showcase how my estimator works.

@TonyBagnall
Copy link
Contributor

I cant see how it needs deprecating, maybe @chrisholder can comment

@chrisholder
Copy link
Collaborator Author

chrisholder commented Jan 13, 2022

The module was never 'public' so nothing needs to be deprecated - it isnt even included on the sktime docs if you look here: https://www.sktime.org/en/stable/api_reference.html so I don't really see or think anything needs deprecating. The only way of seeing it is search for it. It's also marked with the 'Experimental feature' and therefore any users who had used it will know everything is subject to change.

Copy link
Contributor

@TonyBagnall TonyBagnall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have a default predict_proba here, as with BaseClassifier

@chrisholder
Copy link
Collaborator Author

What do you want that to return? For fuzzy clustering I could see the use but for default non fuzzy based clustering what do you want it to return?

@fkiraly
Copy link
Collaborator

fkiraly commented Jan 14, 2022

we should have a default predict_proba here, as with BaseClassifier
What do you want that to return?

Probabilistic cluster assignment? That would make sense in clusterers which, for instance, fit a full mixture distribution. Then, the concept of conditional likelihood of cluster membership makes sense (which would be a pmf).

TonyBagnall
TonyBagnall previously approved these changes Jan 14, 2022
@TonyBagnall
Copy link
Contributor

so the default method should call predict, then sign cluster probability of 1 to predicted cluster, 0 to others (see BaseClassifier). I'm assuming the number of clusters is set in predict and stored of course

@TonyBagnall
Copy link
Contributor

actually, on discussion, predict_proba is problematic for some potential clustering algorithms. so we will defer this feature until we actually need it. This is good to go now, last orders @fkiraly, I want to put it in when it passes tests

@TonyBagnall TonyBagnall self-requested a review January 14, 2022 13:26
TonyBagnall
TonyBagnall previously approved these changes Jan 14, 2022
Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Sorry to block, the docstrings in the extension template seems to be incorrect.
X in _fit should say that it´s guaranteed to be of X_inner_mtype - if it´s not numpy, the implementation will break despite assurances by the template.

See the BaseClassifier for a formulation that people thought wasn´t too obscure (could be obscure if you just talk about mtypes without ref) and also correct.

@TonyBagnall
Copy link
Contributor

Oh ok I'll look next week

@TonyBagnall
Copy link
Contributor

@fkiraly I've changed the comments in the extension templates, let me know if they are ok now

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I think this is now good enough.
I have some suggestions on how to make the extension templates clearer, but let´s do this in a separate PR.

Further, is _predict really mandatory for clusterers?
I don´t think it should be, not all clustering algorithms result in a unique cluster assignment!
Example: probabilistic/likelihood based; hierarchical or dendrogram clustering which results in a tree but not a cluster assignment.

@TonyBagnall TonyBagnall merged commit 79cc513 into main Jan 16, 2022
@TonyBagnall TonyBagnall deleted the clustering-update branch January 16, 2022 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:clustering clustering module: time series clustering
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants