MemoryError with `dask.array.from_tiledb` #5915

DPeterK · 2020-02-18T11:18:16Z

TileDB allows you to create arrays with an effectively unlimited domain. If you try to create a dask array from such a TileDB array, however, you unsurprisingly hit issues. For example, if you run the code below you get a Segmentation Fault, which I think is hiding a MemoryError (I've seen this error before with similar code to this, but unfortunately can't reproduce it here).

Here's some code to demonstrate the problem:

import dask.array as da
import numpy as np
import tiledb

array_name = './tdb/example'
attr_name = 'test'

x_dim = tiledb.Dim(name="x", domain=(0, np.iinfo(np.uint64).max-1), tile=1, dtype=np.uint64)
y_dim = tiledb.Dim(name="y", domain=(0, 100), tile=10, dtype=np.uint64)
domain = tiledb.Domain(x_dim, y_dim)

attr = tiledb.Attr(name=attr_name, dtype=np.int64)
schema = tiledb.ArraySchema(domain=domain, sparse=False, attrs=[attr])
tiledb.Array.create(array_name, schema)

data = np.arange(100).reshape(10, 10)
with tiledb.open(array_name, 'w') as A:
    A[0:10, 0:10] = data
    
with tiledb.open(array_name, 'r') as A:
    print(f'Nonempty domain: {A.nonempty_domain()}')
    
array = da.from_tiledb(array_name, attribute=attr_name)

And here's an example of running the code:

$ python dask_memerror_code.py 
Nonempty domain: ((0, 9), (0, 9))
Segmentation fault (core dumped)

I think dask is unable to create a Python array large enough for the size of the TileDB array, regardless of the TileDB array's nonempty domain.

A possible solution for this would be to add functionality to da.from_tiledb to allow the TileDB array to be subset when it's read into a dask array, such as:

inds = (slice(0, 10, None), slice(0, 10, None))
array = da.from_tiledb(array_name, attribute='attr', inds=inds)

which could then be used when creating the dask array to subset the TileDB array:

    return core.from_array(tdb[inds], chunks, name="tiledb-%s" % uri)

I'm not sure, however, whether this would cause the data from the TileDB array to be loaded into local memory and potentially cause a different problem with memory being flooded if the indexed array were itself large.

The text was updated successfully, but these errors were encountered:

jsignell · 2020-02-21T21:08:44Z

Hi Peter, thanks for raising an issue. I think it would probably be helpful to pull in @ihnorton here since it looks like they did most of the TileDB implementation.

ihnorton · 2020-02-21T21:26:40Z

Thanks for the ping @jsignell, I will take a look.

ihnorton · 2020-02-22T15:18:44Z

Some quick notes here, will look further on Monday:

@DPeterK the first issue is that you are passing attribute='attr', which should instead be attribute='test'. This seems to be failing internally to TileDB-Py, I believe in the Cython layer, and ending up with the segfault -- I will dig in to this in TileDB-Py to improve the failure mode and error reporting here.
after fixing the above, the second issue is a memory error when dask creates the chunk list:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/cmn/condaenvs/tiledb/lib/python3.7/site-packages/dask/array/tiledb_io.py", line 72, in from_tiledb
    return core.from_array(tdb, chunks, name="tiledb-%s" % uri)
  File "/cmn/condaenvs/tiledb/lib/python3.7/site-packages/dask/array/core.py", line 2725, in from_array
    chunks, x.shape, dtype=x.dtype, previous_chunks=previous_chunks
  File "/cmn/condaenvs/tiledb/lib/python3.7/site-packages/dask/array/core.py", line 2459, in normalize_chunks
    (),
  File "/cmn/condaenvs/tiledb/lib/python3.7/site-packages/dask/array/core.py", line 2457, in <genexpr>
    for s, c in zip(shape, chunks)
  File "/cmn/condaenvs/tiledb/lib/python3.7/site-packages/dask/array/core.py", line 954, in blockdims_from_blockshape
    for d, bd in zip(shape, chunks)
  File "/cmn/condaenvs/tiledb/lib/python3.7/site-packages/dask/array/core.py", line 954, in <genexpr>
    for d, bd in zip(shape, chunks)
MemoryError

It looks like this happens because dask.array is trying to cover the shape of the tiledb array -- which is the entire requested domain:

> /cmn/condaenvs/tiledb/lib/python3.7/site-packages/dask/array/tiledb_io.py(72)from_tiledb()
-> return core.from_array(tdb, chunks, name="tiledb-%s" % uri)
(Pdb) p tdb.shape
(18446744073709550615, 101)
(Pdb) p chunks
[1000, 10]

Dask supports np.nan as a dimension shape, but then it seems the array can't be sliced on that dimension. I'll have to take a deeper look at if/how this is handled in Dask for other storage backends.

Alternatively, we may be able to add an option in TileDB-Py to take a view on a subset of a dimension (as suggested by @DPeterK), which would make the .shape of the TileDB.Array object reasonably small enough for dask to cover with chunks. Will think this through a bit on Monday w/ @stavrospapadopoulos.

ihnorton · 2020-02-27T16:18:59Z

For the MemoryError relating to the chunk list, I believe the dimension size for non-delayed arrays is inherently limited by the memory available on the main node, because chunk lists are calculated eagerly (there are some open issues about task graph size which might also come in to play). Hopefully a dask dev will comment if I missed something.

The large dimension scenario might be a good fit for a delayed array or delayed dataframe (given that you will be creating arrays with multiple attributes).

The first issue w/ attribute name mismatch leading to crash is now fixed in the TileDB-Py dev tree and will be included in the next release.

DPeterK · 2020-02-28T11:30:23Z

Thanks @jsignell and @ihnorton for engaging with this issue, and apologies for the slight mistake in the example code! I've updated the code example so that the array name is consistent - but I'm glad that mistake led to some improved error reporting.

I've just tried using the indexing approach that I tentatively suggested above:

tdb = tiledb.open(tdb_data_path, attr='surface_air_pressure')
points = da.from_array(tdb[0:60096, 0:219, 0:286], chunks)

Given that my Jupyter kernal keeps dying, it seems this approach is trying to realise all of the indexed TileDB array contents in memory then write it to a dask array, rather than streaming or proxying the data into the TileDB - as I was concerned about above. And yes this is a large volume of data, but that's sort of the point here!

DPeterK · 2020-03-02T14:03:25Z

I've realised that Iris already has a solution for the problem I described just above, in the form of the data proxy classes. Data proxy classes duck-type to being array-like, but only maintain a reference to the array on disk, rather than actually loading the array into memory. The only time the array is loaded into memory is on an indexing request, when only the subset defined by the index is loaded. I've a functional implementation of this for TileDB at https://github.com/informatics-lab/tiledb_netcdf/blob/master/nctotdb/readers.py#L38.

jsignell · 2020-03-02T21:26:17Z

Great! So are you all good? If so I'll close the issue.

DPeterK · 2020-03-03T08:46:40Z

I personally am sorted, but others using dask with TileDB in cases like I've described here will still hit memory errors 😉 I feel this might still be worth a fix in dask itself.

jsignell · 2020-03-03T14:37:25Z

Fair enough. I'll leave it open in case it helps others.

TomAugspurger · 2020-03-03T18:33:21Z

What's the summary of the current understanding? Is it the case that

tdb = tiledb.open(tdb_data_path, attr='surface_air_pressure')
points = da.from_array(tdb[0:60096, 0:219, 0:286], chunks)

involves materializing tdb in memory? Do we know why that is? With, for example, an h5py.H5File object there, you wouldn't have any issues I think.

What's the next step for Dask to take?

ihnorton · 2020-03-03T21:09:05Z

Hi @TomAugspurger,

with regard to Peter's suggestion, slicing tdb[0:60096, 0:219, 0:286] does eagerly materialize the result as a 28GB, in-memory, ndarray. I believe h5py and most other NumPy-like objects would do the same -- e.g. see this test script using h5py.H5File:

import h5py
import numpy as np

data = np.zeros((60097, 220, 287), dtype=np.uint64)

with h5py.File('/tmp/data.h5', 'w') as hf:
    dset = hf.create_dataset('test', data=data)

a = h5py.File('/tmp/data.h5', 'r')

s = a['test'][0:60096, 0:219, 0:286]
print("type: ", type(s))
print("dtype: ", s.dtype)
print("shape: ", s.shape)
print("nbytes: ", s.nbytes, " MB: ", s.nbytes / 1024**2)

--- output:

type:  <class 'numpy.ndarray'>
dtype:  uint64
shape:  (60096, 219, 286)
nbytes:  30112422912  MB:  28717.4443359375

however, note that opening the TileDB.Array tdb object a few lines above does not materialize anything substantial into memory (just a wrapper object)
I am looking at adding a subset option to tiledb.array.open which would allow the array to present a smaller .shape than is saved in the underlying schema/data (restricting indexing to that shape, etc.). I will submit a PR to support this in dask, if/when it is available.

Some other notes below:

the tdb object is array-like and reports a .shape of (18446744073709550615, 101)
Regarding the MemoryError that led to Peter's proposal, based on the reported .shape, dask tries to create a very large chunk list and hits a MemoryError (see this comment). The same thing happens with Zarr or h5py. Here's an example:

import h5py
import dask.array as da

chunks = (1000,100)

shape = (9223372036854775807, 100)

with h5py.File('data.h5', 'w') as h:
    dset = h.create_dataset('/data', shape=shape, chunks=chunks, dtype='f8')

a = h5py.File('data.h5', 'r')
a_data = a['/data']

da.from_array(a_data)

Running the above script gives a MemoryError in the same place, because it is trying to create a tuple w/ 10^17 elements:

chunks = (1000,100)
shape = (9223372036854775807, 101)

# dask/array/core.py:954
tuple(
        ((bd,) * (d // bd) + ((d % bd,) if d % bd else ()) if d else (0,))
        for d, bd in zip(shape, chunks)
    )

-> also MemoryError.

(note that I am using div(BigInt(typemax(UInt64))-1, 2) == 9223372036854775807 because h5py doesn't like typemax(UInt64)-1 -- I get Unable to create dataset (unable to get the next power of 2))

ihnorton mentioned this issue Feb 27, 2020

Fix and test tiledb.open KeyError path for non-existent attribute TileDB-Inc/TileDB-Py#277

Merged

jsignell added array io labels Mar 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MemoryError with `dask.array.from_tiledb` #5915

MemoryError with `dask.array.from_tiledb` #5915

DPeterK commented Feb 18, 2020 •

edited

jsignell commented Feb 21, 2020

ihnorton commented Feb 21, 2020

ihnorton commented Feb 22, 2020 •

edited

ihnorton commented Feb 27, 2020

DPeterK commented Feb 28, 2020

DPeterK commented Mar 2, 2020

jsignell commented Mar 2, 2020

DPeterK commented Mar 3, 2020

jsignell commented Mar 3, 2020

TomAugspurger commented Mar 3, 2020

ihnorton commented Mar 3, 2020

MemoryError with dask.array.from_tiledb #5915

MemoryError with dask.array.from_tiledb #5915

Comments

DPeterK commented Feb 18, 2020 • edited

jsignell commented Feb 21, 2020

ihnorton commented Feb 21, 2020

ihnorton commented Feb 22, 2020 • edited

ihnorton commented Feb 27, 2020

DPeterK commented Feb 28, 2020

DPeterK commented Mar 2, 2020

jsignell commented Mar 2, 2020

DPeterK commented Mar 3, 2020

jsignell commented Mar 3, 2020

TomAugspurger commented Mar 3, 2020

ihnorton commented Mar 3, 2020

MemoryError with `dask.array.from_tiledb` #5915

MemoryError with `dask.array.from_tiledb` #5915

DPeterK commented Feb 18, 2020 •

edited

ihnorton commented Feb 22, 2020 •

edited