[ENH] Clustering module refactor #1864

chrisholder · 2022-01-11T14:31:36Z

Reference Issues/PRs

This pr relies on pr #1847

What does this implement/fix? Explain your changes.

This pr refactors the clustering module base class and removes the use of cython distance metrics.

The base class refactor is aimed to be essentially the same as classification for consistency.

With the base class and distances now properly supporting multivariate problems the two implemented algorithms (k-means and k-medoids) have been refactored to properly support this. This includes removing cython distances and greatly simplifying the implementation of both algorithms.

In addition huge performance benefits from refactoring the distances can be seen.

This pr is only missing DBA as a mean averaging for k-means but this will be a large change and therefore this pr is split up from it (refactor for DBA coming in the next few days).

Does your contribution introduce a new dependency? If yes, which one?

What should a reviewer concentrate their feedback on?

Any other comments?

PR checklist

For all contributions

I've added myself to the list of contributors.
Optionally, I've updated sktime's CODEOWNERS to receive notifications about future changes to these files.
I've added unit tests and made sure they pass locally.

For new estimators

I've added the estimator to the online documentation.
I've updated the existing example notebooks or provided a new one to showcase how my estimator works.

…to distance

TonyBagnall · 2022-01-13T21:28:39Z

I cant see how it needs deprecating, maybe @chrisholder can comment

chrisholder · 2022-01-13T22:11:28Z

The module was never 'public' so nothing needs to be deprecated - it isnt even included on the sktime docs if you look here: https://www.sktime.org/en/stable/api_reference.html so I don't really see or think anything needs deprecating. The only way of seeing it is search for it. It's also marked with the 'Experimental feature' and therefore any users who had used it will know everything is subject to change.

TonyBagnall

we should have a default predict_proba here, as with BaseClassifier

chrisholder · 2022-01-13T22:22:55Z

What do you want that to return? For fuzzy clustering I could see the use but for default non fuzzy based clustering what do you want it to return?

fkiraly · 2022-01-14T00:38:55Z

we should have a default predict_proba here, as with BaseClassifier
What do you want that to return?

Probabilistic cluster assignment? That would make sense in clusterers which, for instance, fit a full mixture distribution. Then, the concept of conditional likelihood of cluster membership makes sense (which would be a pmf).

TonyBagnall · 2022-01-14T10:20:05Z

so the default method should call predict, then sign cluster probability of 1 to predicted cluster, 0 to others (see BaseClassifier). I'm assuming the number of clusters is set in predict and stored of course

…sktime into clustering-update

TonyBagnall · 2022-01-14T13:26:45Z

actually, on discussion, predict_proba is problematic for some potential clustering algorithms. so we will defer this feature until we actually need it. This is good to go now, last orders @fkiraly, I want to put it in when it passes tests

fkiraly

Thanks!

Sorry to block, the docstrings in the extension template seems to be incorrect.
X in _fit should say that it´s guaranteed to be of X_inner_mtype - if it´s not numpy, the implementation will break despite assurances by the template.

See the BaseClassifier for a formulation that people thought wasn´t too obscure (could be obscure if you just talk about mtypes without ref) and also correct.

TonyBagnall · 2022-01-14T21:49:59Z

Oh ok I'll look next week

TonyBagnall · 2022-01-15T11:26:02Z

@fkiraly I've changed the comments in the extension templates, let me know if they are ok now

fkiraly

ok, I think this is now good enough.
I have some suggestions on how to make the extension templates clearer, but let´s do this in a separate PR.

Further, is _predict really mandatory for clusterers?
I don´t think it should be, not all clustering algorithms result in a unique cluster assignment!
Example: probabilistic/likelihood based; hierarchical or dendrogram clustering which results in a tree but not a cluster assignment.

chrisholder added 30 commits October 25, 2021 17:40

euclidean distance

70b1a64

euclidean distance

b64b6e3

Euclidean clean up

3a410fe

test utils

239a609

tests written

b7054a3

fixed bug with creation of test distance

f356bbf

added cache

6bf382a

pairwise docstring added

14e177f

update docstring

eb3dec8

init

a1a2b0f

update docstring

ea30687

fixed bug with euclidean and updated tests

a291853

squared distance added and cleaned up

e274d2c

improved testing and added expected results

8bbc68f

rename

08d63db

rename refactor

ca770c9

rename refactor

ba92c22

utils

64edddf

base class defined

ed34fcd

euclidean distance

107d544

main distance functionality

c4a55f4

pairwise distance

d52a710

tests rewritten

efa4d42

added missed type to base

25ab634

updated incorrect expected result

56b0994

added squared distance support

62cb5fb

squared distance

6870cb8

docstring

086f34f

added assertion error messages

f58fed7

added the ability to pass the python method of the distance function …

5f3b3a6

…to distance

Tony Bagnall added 2 commits January 13, 2022 21:59

refinement of extension guidelines for classification and clustering

42f9643

refinement of extension guidelines for classification and clustering

6966c07

TonyBagnall suggested changes Jan 13, 2022

View reviewed changes

refinement of extension guidelines for classification and clustering

cc2f129

TonyBagnall previously approved these changes Jan 14, 2022

View reviewed changes

TonyBagnall requested a review from fkiraly January 14, 2022 10:20

chrisholder dismissed TonyBagnall’s stale review via e519647 January 14, 2022 12:36

chrisholder added 2 commits January 14, 2022 12:36

added predict proba and pass distance params to constructor

d0ef545

Merge branch 'clustering-update' of github.com:alan-turing-institute/…

e519647

…sktime into clustering-update

TonyBagnall self-requested a review January 14, 2022 13:26

TonyBagnall previously approved these changes Jan 14, 2022

View reviewed changes

Merge branch 'main' into clustering-update

33f5615

fkiraly requested changes Jan 14, 2022

View reviewed changes

new classifier notebook

260e65b

TonyBagnall dismissed their stale review via 260e65b January 15, 2022 10:52

new classifier notebook

c07ab9d

use get_tags rather than _tags

2bf72f9

TonyBagnall requested a review from fkiraly January 15, 2022 13:57

Merge branch 'main' into clustering-update

b3f641d

fkiraly approved these changes Jan 15, 2022

View reviewed changes

TonyBagnall merged commit 79cc513 into main Jan 16, 2022

TonyBagnall deleted the clustering-update branch January 16, 2022 12:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Clustering module refactor #1864

[ENH] Clustering module refactor #1864

chrisholder commented Jan 11, 2022 •

edited

Loading

TonyBagnall commented Jan 13, 2022

chrisholder commented Jan 13, 2022 •

edited

Loading

TonyBagnall left a comment

chrisholder commented Jan 13, 2022

fkiraly commented Jan 14, 2022

TonyBagnall commented Jan 14, 2022

TonyBagnall commented Jan 14, 2022

fkiraly left a comment

TonyBagnall commented Jan 14, 2022

TonyBagnall commented Jan 15, 2022

fkiraly left a comment

[ENH] Clustering module refactor #1864

[ENH] Clustering module refactor #1864

Conversation

chrisholder commented Jan 11, 2022 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Does your contribution introduce a new dependency? If yes, which one?

What should a reviewer concentrate their feedback on?

Any other comments?

PR checklist

For all contributions

For new estimators

TonyBagnall commented Jan 13, 2022

chrisholder commented Jan 13, 2022 • edited Loading

TonyBagnall left a comment

Choose a reason for hiding this comment

chrisholder commented Jan 13, 2022

fkiraly commented Jan 14, 2022

TonyBagnall commented Jan 14, 2022

TonyBagnall commented Jan 14, 2022

fkiraly left a comment

Choose a reason for hiding this comment

TonyBagnall commented Jan 14, 2022

TonyBagnall commented Jan 15, 2022

fkiraly left a comment

Choose a reason for hiding this comment

chrisholder commented Jan 11, 2022 •

edited

Loading

chrisholder commented Jan 13, 2022 •

edited

Loading