k-means Memory usage #39

TomAugspurger · 2017-10-16T20:35:52Z

Debugging a memory usage issue I'm seeing with k-means initialization. The issue is in this loop. A small example is

import numpy as np
import dask.array as da
from dask import delayed
from distributed import Client, wait

from sklearn.datasets import make_classification


if __name__ == '__main__':

    c = Client()
    s = c.cluster.scheduler
    N_SAMPLES = 1_000_000
    N_BLOCKS = 24

    def mem():
        print("{:.2f}".format(sum(s.worker_bytes.values()) / 10**9), "GB")

    def make_block(n_samples):
        X, y = make_classification(n_samples=n_samples)
        return X

    blocks = [delayed(make_block)(N_SAMPLES) for i in range(N_BLOCKS)]

    arrays = [da.from_delayed(block, dtype='f8', shape=(N_SAMPLES, 20))
              for block in blocks]
    stacked = da.vstack(arrays)

    print(stacked.nbytes / 10**9, "GB")
    X = c.persist(stacked)
    wait(X)

    for i in range(5):
        idx = np.random.randint(0, len(X), size=5)
        centers = X[idx].compute()
        mem()

Which outputs

3.84 GB
4.32 GB
4.80 GB
5.12 GB
5.60 GB
6.08 GB

@mrocklin is that increasing memory usage expected? Inside that for loop, the result is always going to be a small NumPy array.

(side-note, that generates some exceptions I've left out of the output):

distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
  File "/Users/taugspurger/.virtualenvs/dask-dev/lib/python3.6/site-packages/distributed/distributed/protocol/core.py", line 122, in loads
    value = _deserialize(head, fs)
  File "/Users/taugspurger/.virtualenvs/dask-dev/lib/python3.6/site-packages/distributed/distributed/protocol/serialize.py", line 160, in deserialize
    f = deserializers[header.get('type')]
KeyError: 'numpy.ndarray'

looking into those now.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2017-10-16T20:59:14Z

@mrocklin I don't see "failed to deserialize" error every time. Adding

diff --git a/distributed/protocol/serialize.py b/distributed/protocol/serialize.py
index d657dab..b7b492e 100644
--- a/distributed/protocol/serialize.py
+++ b/distributed/protocol/serialize.py
@@ -87,6 +87,8 @@ def typename(typ):
 def _find_lazy_registration(typename):
     toplevel, _, _ = typename.partition('.')
     if toplevel in lazy_registrations:
+        import time
+        time.sleep(.1)
         lazy_registrations.pop(toplevel)()
         return True
     else:

to serialize.py makes it consistent. Perhaps a race condition in the lazy importer? I'll see if I can narrow down the example further.

TomAugspurger · 2017-10-17T15:01:10Z

Collecting some some debugging observations:

The getitem takes ~0.01s normally with an indexer of length 1. There's no increase in memory usage. When things slow down, things take ~.5s. The slower tasks include a disk-read-getitem:

(slow on the left, normal on the right)

When slicing with multiple (e.g. size=2) the graphs sometimes look different. Fast:

slow:

Not sure if this is meaningful or not. I suspect it is. I assumed the order of operations would be

getitem for each worker -> concatenate results.

But if it's

transfer blocks to single worker -> getitem

that would explain the slowdown and memory increase.

mrocklin · 2017-10-17T15:13:13Z

disk-read- time blocks are due to getting elements out of worker.data. This could mean that there are many elements in worker.data that are in memory, or more likely that there are a few elements that are stored on disk. This also corresponds to colored bars in the upper left memory use plot in the scheduler.

mrocklin · 2017-10-17T15:18:26Z

If your getitem/transfer vs transfer/getitem question refers to the above code:

for i in range(5):
    idx = np.random.randint(0, len(X), size=5)
    centers = X[idx].compute()
    mem()

Then you should be fine. Dask.array definitely does the intelligent thing here.

TomAugspurger · 2017-10-17T15:31:34Z

Interestingly, whether or not the indexer is sorted seems to matter. Adding a sorted(idx) before indexing:

centers = c.compute(X[sorted(idx)])

outputs:

3.84 GB
3.84 GB
3.84 GB
3.84 GB
3.84 GB

Trying it out in k_init now.

mrocklin · 2017-10-17T15:34:01Z

There is a special fast-path within dask/array/slicing.py for when the index is sorted. You might search that file for issorted to find what the other non-fast-path is doing

mrocklin · 2017-10-17T15:34:06Z

Or just always sort

TomAugspurger · 2017-10-17T15:34:38Z

Just sorting works for me here. I'll still take a look in slicing to see if anything weird is going on.

TomAugspurger · 2017-10-17T15:35:09Z

Thanks!

Seems to help the task scheduler in some situations Closes dask#39

* PERF: Sort indexes before slicing Seems to help the task scheduler in some situations Closes #39 * BUG: Pass through random_state to sample points

* First pass at re-adding feature union support Still needs tests, and could be more efficient. * Add tests for feature_unions * Simplify the code a bit

Relax requirements

TomAugspurger mentioned this issue Oct 17, 2017

PERF: Sort indexes before slicing #40

Merged

TomAugspurger closed this as completed in #40 Oct 17, 2017

TomAugspurger added a commit to TomAugspurger/dask-ml that referenced this issue Oct 17, 2017

PERF: Sort indexes before slicing

0be8d80

Seems to help the task scheduler in some situations Closes dask#39

TomAugspurger added a commit that referenced this issue Oct 17, 2017

PERF: Sort indexes before slicing (#40)

094baa5

* PERF: Sort indexes before slicing Seems to help the task scheduler in some situations Closes #39 * BUG: Pass through random_state to sample points

TomAugspurger pushed a commit to TomAugspurger/dask-ml that referenced this issue Jun 28, 2018

Feature unions (dask#39)

fe79fe1

* First pass at re-adding feature union support Still needs tests, and could be more efficient. * Add tests for feature_unions * Simplify the code a bit

TomAugspurger pushed a commit to TomAugspurger/dask-ml that referenced this issue Oct 17, 2019

Merge pull request dask#39 from mrocklin/requirements

88854df

Relax requirements

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k-means Memory usage #39

k-means Memory usage #39

TomAugspurger commented Oct 16, 2017

TomAugspurger commented Oct 16, 2017 •

edited

Loading

TomAugspurger commented Oct 17, 2017

mrocklin commented Oct 17, 2017

mrocklin commented Oct 17, 2017

TomAugspurger commented Oct 17, 2017

mrocklin commented Oct 17, 2017

mrocklin commented Oct 17, 2017

TomAugspurger commented Oct 17, 2017

TomAugspurger commented Oct 17, 2017

k-means Memory usage #39

k-means Memory usage #39

Comments

TomAugspurger commented Oct 16, 2017

TomAugspurger commented Oct 16, 2017 • edited Loading

TomAugspurger commented Oct 17, 2017

mrocklin commented Oct 17, 2017

mrocklin commented Oct 17, 2017

TomAugspurger commented Oct 17, 2017

mrocklin commented Oct 17, 2017

mrocklin commented Oct 17, 2017

TomAugspurger commented Oct 17, 2017

TomAugspurger commented Oct 17, 2017

TomAugspurger commented Oct 16, 2017 •

edited

Loading