-
-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k-means Memory usage #39
Comments
@mrocklin I don't see "failed to deserialize" error every time. Adding diff --git a/distributed/protocol/serialize.py b/distributed/protocol/serialize.py
index d657dab..b7b492e 100644
--- a/distributed/protocol/serialize.py
+++ b/distributed/protocol/serialize.py
@@ -87,6 +87,8 @@ def typename(typ):
def _find_lazy_registration(typename):
toplevel, _, _ = typename.partition('.')
if toplevel in lazy_registrations:
+ import time
+ time.sleep(.1)
lazy_registrations.pop(toplevel)()
return True
else: to |
Collecting some some debugging observations: The (slow on the left, normal on the right) When slicing with multiple (e.g. slow: Not sure if this is meaningful or not. I suspect it is. I assumed the order of operations would be
But if it's
that would explain the slowdown and memory increase. |
|
If your getitem/transfer vs transfer/getitem question refers to the above code:
Then you should be fine. Dask.array definitely does the intelligent thing here. |
Interestingly, whether or not the indexer is sorted seems to matter. Adding a centers = c.compute(X[sorted(idx)]) outputs: 3.84 GB
3.84 GB
3.84 GB
3.84 GB
3.84 GB Trying it out in k_init now. |
There is a special fast-path within dask/array/slicing.py for when the index is sorted. You might search that file for |
Or just always sort |
Just sorting works for me here. I'll still take a look in slicing to see if anything weird is going on. |
Thanks! |
Seems to help the task scheduler in some situations Closes dask#39
* PERF: Sort indexes before slicing Seems to help the task scheduler in some situations Closes #39 * BUG: Pass through random_state to sample points
* First pass at re-adding feature union support Still needs tests, and could be more efficient. * Add tests for feature_unions * Simplify the code a bit
Relax requirements
Debugging a memory usage issue I'm seeing with k-means initialization. The issue is in this loop. A small example is
Which outputs
@mrocklin is that increasing memory usage expected? Inside that for loop, the result is always going to be a small NumPy array.
(side-note, that generates some exceptions I've left out of the output):
looking into those now.
The text was updated successfully, but these errors were encountered: