Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask: KMeans #6277

Merged
merged 3 commits into from Jun 28, 2023
Merged

Dask: KMeans #6277

merged 3 commits into from Jun 28, 2023

Conversation

noahnovsak
Copy link
Contributor

@noahnovsak noahnovsak commented Jan 5, 2023

Add dask_ml.cluster.KMeans as an alternative to sklearn.cluster.KMeans for dask arrays.

Probably worth mentioning, dask_ml adds the "k-means||" init option but defaults to the sklearn implementation in the case of "k-means++" or "random" initialization. So using "k-means||" is what enables working with larger datasets at all. However, smaller datasets are still processed much faster just using sklearn directly as there seems to be a lot of overhead with dask_ml.

# Do not needlessly recluster the data if X hasn't changed
if old_data and self.data and array_equal(self.data.X, old_data.X):
if old_data and self.data and array_equal(self.data.X, old_data.X): # could be an issue for dask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good find. I remember you telling me about it, but then I forgot. I do not have a good solution for now, but we would need to ensure (and test) at the minimum that array_equal does not load the whole data set into memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out np.allclose works fine on dask. Comparing an 8GB array against itself took a little over two seconds (this is worst case scenario, runing in an ipython shell that doesn't play well with dask) and memory usage stayed very much within reasonable limits.

@markotoplak
Copy link
Member

Please add dask-ml to requirements (that means tox.ini as the oldest version specification, requirements-?.txt, and meta.yaml).

Next, this does not actually use context settings: they would always need a context handler (in this case it should detect whether you have a dask table or not), then they need a openContext and closeContext calls.

Maybe we do not need them here and they were a bad idea I had... But what needs to work is the following:

  • use kmeans with ordinary data, select any option, lets call it O
  • change input to dask data, select anything
  • change back to ordinary data, the option O needs to be selected

From the code I'd guess this does not work at least for some possible O. :)

And then we have to remember that we still need settings migration.

@janezd
Copy link
Contributor

janezd commented Jan 25, 2023

  1. In Sample Data we have an checkbox Stratify (if possible). Is this an option here? Combos (I would actually prefer radio buttons, but this may be my current temporary preference) and the one that is not always applicable would indicate what it defaults to. At least in a tooltip.

    Option "O" would remain chosen, but a warning would indicate that it was not used.

Another thing: this is only available for Dask, so it will be off for majority of users. If so, I would disable it when inapplicable. If chosen, I would keep it chosen (but disabled!), and have a warning etc. So the user may manually choose another option, or keep the disabled one, which won't work anyway.

An alternative to above would be to hide the option if it is unavailable and not chosen.

@codecov
Copy link

codecov bot commented Jan 25, 2023

Codecov Report

Merging #6277 (ace9abb) into dask (2f0bec2) will increase coverage by 0.00%.
The diff coverage is 95.23%.

Additional details and impacted files
@@           Coverage Diff           @@
##             dask    #6277   +/-   ##
=======================================
  Coverage   87.64%   87.65%           
=======================================
  Files         322      322           
  Lines       69601    69637   +36     
=======================================
+ Hits        61002    61040   +38     
+ Misses       8599     8597    -2     

k = Setting(3)
k_from = Setting(2)
k_to = Setting(8)
optimize_k = Setting(False)
max_iterations = Setting(300)
n_init = Setting(10)
smart_init = Setting(0) # KMeans++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change, if we go with it, also requires settings migration. Just leaving it here as a note.

@markotoplak
Copy link
Member

Thanks. We need at least some basic tests for this.

@noahnovsak noahnovsak force-pushed the kmeans-dask branch 2 times, most recently from e01318f to b13b1d8 Compare June 22, 2023 12:28
@markotoplak markotoplak added the dask Related (discovered in or needed) to the Dask adaptation label Jun 28, 2023
@markotoplak markotoplak merged commit f25a2a9 into biolab:dask Jun 28, 2023
21 of 27 checks passed
markotoplak added a commit that referenced this pull request Jun 28, 2023
@noahnovsak noahnovsak deleted the kmeans-dask branch July 4, 2023 08:54
markotoplak added a commit that referenced this pull request Jul 12, 2023
markotoplak added a commit that referenced this pull request Jul 14, 2023
markotoplak added a commit that referenced this pull request Jul 20, 2023
markotoplak added a commit to markotoplak/orange3 that referenced this pull request Jul 26, 2023
markotoplak added a commit that referenced this pull request Aug 15, 2023
markotoplak added a commit that referenced this pull request Aug 17, 2023
markotoplak added a commit that referenced this pull request Aug 21, 2023
markotoplak added a commit that referenced this pull request Sep 4, 2023
markotoplak added a commit that referenced this pull request Sep 14, 2023
markotoplak added a commit to markotoplak/orange3 that referenced this pull request Sep 14, 2023
markotoplak added a commit that referenced this pull request Sep 18, 2023
markotoplak added a commit that referenced this pull request Sep 26, 2023
markotoplak added a commit that referenced this pull request Oct 10, 2023
markotoplak added a commit that referenced this pull request Oct 13, 2023
markotoplak added a commit that referenced this pull request Oct 21, 2023
markotoplak added a commit that referenced this pull request Oct 29, 2023
markotoplak added a commit that referenced this pull request Nov 6, 2023
markotoplak added a commit that referenced this pull request Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask Related (discovered in or needed) to the Dask adaptation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants