Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MemoryError with dask.array.from_tiledb #5915

Open
DPeterK opened this issue Feb 18, 2020 · 11 comments
Open

MemoryError with dask.array.from_tiledb #5915

DPeterK opened this issue Feb 18, 2020 · 11 comments

Comments

@DPeterK
Copy link

DPeterK commented Feb 18, 2020

TileDB allows you to create arrays with an effectively unlimited domain. If you try to create a dask array from such a TileDB array, however, you unsurprisingly hit issues. For example, if you run the code below you get a Segmentation Fault, which I think is hiding a MemoryError (I've seen this error before with similar code to this, but unfortunately can't reproduce it here).

Here's some code to demonstrate the problem:

import dask.array as da
import numpy as np
import tiledb

array_name = './tdb/example'
attr_name = 'test'

x_dim = tiledb.Dim(name="x", domain=(0, np.iinfo(np.uint64).max-1), tile=1, dtype=np.uint64)
y_dim = tiledb.Dim(name="y", domain=(0, 100), tile=10, dtype=np.uint64)
domain = tiledb.Domain(x_dim, y_dim)

attr = tiledb.Attr(name=attr_name, dtype=np.int64)
schema = tiledb.ArraySchema(domain=domain, sparse=False, attrs=[attr])
tiledb.Array.create(array_name, schema)

data = np.arange(100).reshape(10, 10)
with tiledb.open(array_name, 'w') as A:
    A[0:10, 0:10] = data
    
with tiledb.open(array_name, 'r') as A:
    print(f'Nonempty domain: {A.nonempty_domain()}')
    
array = da.from_tiledb(array_name, attribute=attr_name)

And here's an example of running the code:

$ python dask_memerror_code.py 
Nonempty domain: ((0, 9), (0, 9))
Segmentation fault (core dumped)

I think dask is unable to create a Python array large enough for the size of the TileDB array, regardless of the TileDB array's nonempty domain.

A possible solution for this would be to add functionality to da.from_tiledb to allow the TileDB array to be subset when it's read into a dask array, such as:

inds = (slice(0, 10, None), slice(0, 10, None))
array = da.from_tiledb(array_name, attribute='attr', inds=inds)

which could then be used when creating the dask array to subset the TileDB array:

    return core.from_array(tdb[inds], chunks, name="tiledb-%s" % uri)

I'm not sure, however, whether this would cause the data from the TileDB array to be loaded into local memory and potentially cause a different problem with memory being flooded if the indexed array were itself large.

@jsignell
Copy link
Member

Hi Peter, thanks for raising an issue. I think it would probably be helpful to pull in @ihnorton here since it looks like they did most of the TileDB implementation.

@ihnorton
Copy link
Contributor

Thanks for the ping @jsignell, I will take a look.

@ihnorton
Copy link
Contributor

ihnorton commented Feb 22, 2020

Some quick notes here, will look further on Monday:

  • @DPeterK the first issue is that you are passing attribute='attr', which should instead be attribute='test'. This seems to be failing internally to TileDB-Py, I believe in the Cython layer, and ending up with the segfault -- I will dig in to this in TileDB-Py to improve the failure mode and error reporting here.
  • after fixing the above, the second issue is a memory error when dask creates the chunk list:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/cmn/condaenvs/tiledb/lib/python3.7/site-packages/dask/array/tiledb_io.py", line 72, in from_tiledb
    return core.from_array(tdb, chunks, name="tiledb-%s" % uri)
  File "/cmn/condaenvs/tiledb/lib/python3.7/site-packages/dask/array/core.py", line 2725, in from_array
    chunks, x.shape, dtype=x.dtype, previous_chunks=previous_chunks
  File "/cmn/condaenvs/tiledb/lib/python3.7/site-packages/dask/array/core.py", line 2459, in normalize_chunks
    (),
  File "/cmn/condaenvs/tiledb/lib/python3.7/site-packages/dask/array/core.py", line 2457, in <genexpr>
    for s, c in zip(shape, chunks)
  File "/cmn/condaenvs/tiledb/lib/python3.7/site-packages/dask/array/core.py", line 954, in blockdims_from_blockshape
    for d, bd in zip(shape, chunks)
  File "/cmn/condaenvs/tiledb/lib/python3.7/site-packages/dask/array/core.py", line 954, in <genexpr>
    for d, bd in zip(shape, chunks)
MemoryError

It looks like this happens because dask.array is trying to cover the shape of the tiledb array -- which is the entire requested domain:

> /cmn/condaenvs/tiledb/lib/python3.7/site-packages/dask/array/tiledb_io.py(72)from_tiledb()
-> return core.from_array(tdb, chunks, name="tiledb-%s" % uri)
(Pdb) p tdb.shape
(18446744073709550615, 101)
(Pdb) p chunks
[1000, 10]

Dask supports np.nan as a dimension shape, but then it seems the array can't be sliced on that dimension. I'll have to take a deeper look at if/how this is handled in Dask for other storage backends.

Alternatively, we may be able to add an option in TileDB-Py to take a view on a subset of a dimension (as suggested by @DPeterK), which would make the .shape of the TileDB.Array object reasonably small enough for dask to cover with chunks. Will think this through a bit on Monday w/ @stavrospapadopoulos.

@ihnorton
Copy link
Contributor

For the MemoryError relating to the chunk list, I believe the dimension size for non-delayed arrays is inherently limited by the memory available on the main node, because chunk lists are calculated eagerly (there are some open issues about task graph size which might also come in to play). Hopefully a dask dev will comment if I missed something.

The large dimension scenario might be a good fit for a delayed array or delayed dataframe (given that you will be creating arrays with multiple attributes).

The first issue w/ attribute name mismatch leading to crash is now fixed in the TileDB-Py dev tree and will be included in the next release.

@DPeterK
Copy link
Author

DPeterK commented Feb 28, 2020

Thanks @jsignell and @ihnorton for engaging with this issue, and apologies for the slight mistake in the example code! I've updated the code example so that the array name is consistent - but I'm glad that mistake led to some improved error reporting.

I've just tried using the indexing approach that I tentatively suggested above:

tdb = tiledb.open(tdb_data_path, attr='surface_air_pressure')
points = da.from_array(tdb[0:60096, 0:219, 0:286], chunks)

Given that my Jupyter kernal keeps dying, it seems this approach is trying to realise all of the indexed TileDB array contents in memory then write it to a dask array, rather than streaming or proxying the data into the TileDB - as I was concerned about above. And yes this is a large volume of data, but that's sort of the point here!

@DPeterK
Copy link
Author

DPeterK commented Mar 2, 2020

I've realised that Iris already has a solution for the problem I described just above, in the form of the data proxy classes. Data proxy classes duck-type to being array-like, but only maintain a reference to the array on disk, rather than actually loading the array into memory. The only time the array is loaded into memory is on an indexing request, when only the subset defined by the index is loaded. I've a functional implementation of this for TileDB at https://github.com/informatics-lab/tiledb_netcdf/blob/master/nctotdb/readers.py#L38.

@jsignell
Copy link
Member

jsignell commented Mar 2, 2020

Great! So are you all good? If so I'll close the issue.

@DPeterK
Copy link
Author

DPeterK commented Mar 3, 2020

I personally am sorted, but others using dask with TileDB in cases like I've described here will still hit memory errors 😉 I feel this might still be worth a fix in dask itself.

@jsignell
Copy link
Member

jsignell commented Mar 3, 2020

Fair enough. I'll leave it open in case it helps others.

@TomAugspurger
Copy link
Member

What's the summary of the current understanding? Is it the case that

tdb = tiledb.open(tdb_data_path, attr='surface_air_pressure')
points = da.from_array(tdb[0:60096, 0:219, 0:286], chunks)

involves materializing tdb in memory? Do we know why that is? With, for example, an h5py.H5File object there, you wouldn't have any issues I think.

What's the next step for Dask to take?

@ihnorton
Copy link
Contributor

ihnorton commented Mar 3, 2020

Hi @TomAugspurger,

  • with regard to Peter's suggestion, slicing tdb[0:60096, 0:219, 0:286] does eagerly materialize the result as a 28GB, in-memory, ndarray. I believe h5py and most other NumPy-like objects would do the same -- e.g. see this test script using h5py.H5File:
import h5py
import numpy as np

data = np.zeros((60097, 220, 287), dtype=np.uint64)

with h5py.File('/tmp/data.h5', 'w') as hf:
    dset = hf.create_dataset('test', data=data)

a = h5py.File('/tmp/data.h5', 'r')

s = a['test'][0:60096, 0:219, 0:286]
print("type: ", type(s))
print("dtype: ", s.dtype)
print("shape: ", s.shape)
print("nbytes: ", s.nbytes, " MB: ", s.nbytes / 1024**2)

--- output:

type:  <class 'numpy.ndarray'>
dtype:  uint64
shape:  (60096, 219, 286)
nbytes:  30112422912  MB:  28717.4443359375
  • however, note that opening the TileDB.Array tdb object a few lines above does not materialize anything substantial into memory (just a wrapper object)

  • I am looking at adding a subset option to tiledb.array.open which would allow the array to present a smaller .shape than is saved in the underlying schema/data (restricting indexing to that shape, etc.). I will submit a PR to support this in dask, if/when it is available.


Some other notes below:

  • the tdb object is array-like and reports a .shape of (18446744073709550615, 101)
  • Regarding the MemoryError that led to Peter's proposal, based on the reported .shape, dask tries to create a very large chunk list and hits a MemoryError (see this comment). The same thing happens with Zarr or h5py. Here's an example:
import h5py
import dask.array as da

chunks = (1000,100)

shape = (9223372036854775807, 100)

with h5py.File('data.h5', 'w') as h:
    dset = h.create_dataset('/data', shape=shape, chunks=chunks, dtype='f8')

a = h5py.File('data.h5', 'r')
a_data = a['/data']

da.from_array(a_data)

Running the above script gives a MemoryError in the same place, because it is trying to create a tuple w/ 10^17 elements:

chunks = (1000,100)
shape = (9223372036854775807, 101)

# dask/array/core.py:954
tuple(
        ((bd,) * (d // bd) + ((d % bd,) if d % bd else ()) if d else (0,))
        for d, bd in zip(shape, chunks)
    )

-> also MemoryError.

(note that I am using div(BigInt(typemax(UInt64))-1, 2) == 9223372036854775807 because h5py doesn't like typemax(UInt64)-1 -- I get Unable to create dataset (unable to get the next power of 2))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants