Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Add TileDB backend #4679

Merged
merged 2 commits into from Apr 15, 2019

Conversation

Projects
None yet
3 participants
@ihnorton
Copy link
Contributor

commented Apr 8, 2019

This is a request-for-comment and WIP PR to propose adding TileDB integration to Dask, as suggested by several dask users in this pangeo thread (cc @rabernat, @martindurant)

Status

  • Tests added / passed
  • Passes flake8 dask
  • I will do a TileDB-Py point release on conda+pip to include anything else needed upon review here.

Proposed API additions

  • dask.array.to_tiledb and dask.array.from_tiledb

Background

TileDB is an open source, multi-dimensional array management system which provides:

  • storage for dense and sparse array data with support for fast updates, compression, and encryption.
  • a storage engine written in C++ for efficient queries and provides native a C/C++ API for portability.
  • open spec on disk format
  • multi-language support: Python, R, Go, and Java bindings
  • Conda packages for the TileDB core and python bindings
  • storage backends for S3 and HDFS
  • the concurrency and consistency model is: consistent reads via timestamped snapshots, and eventually-consistent writes. Supports concurrent multiple readers and multiple writers. Supports multithreaded parallel IO from a single python process.
@martindurant

This comment has been minimized.

Copy link
Member

commented Apr 9, 2019

Thanks, that looks nice and simple!

@martindurant

This comment has been minimized.

Copy link
Member

commented Apr 9, 2019

From the codecov, I guess the tests didn't run, and you would have to add installs to the build scripts.

@ihnorton ihnorton force-pushed the ihnorton:add_tiledb branch from b4952a7 to 1d4761f Apr 10, 2019

@ihnorton

This comment has been minimized.

Copy link
Contributor Author

commented Apr 10, 2019

Thanks, @martindurant. I updated the PR to include installation of tiledb-py from conda-forge.

@martindurant

This comment has been minimized.

Copy link
Member

commented Apr 10, 2019

Anyone know why Travis and coveralls didn't run?

@ihnorton ihnorton force-pushed the ihnorton:add_tiledb branch from 1d4761f to 7ff182b Apr 10, 2019

@ihnorton

This comment has been minimized.

Copy link
Contributor Author

commented Apr 10, 2019

I had enabled Travis on a fork, and ran a few builds there (in order to avoid hitting dask main CI). Maybe Travis ignored the push because it saw an existing run against the hash?

I just amended the commit and pushed again -- looks like travis picked it up this time.

@martindurant
Copy link
Member

left a comment

In general, liking this PR a lot, and want to see it merged rapidly, so people can experiment with it.

Note that you should make sure that the new methods are added to the API docs.

Show resolved Hide resolved dask/array/core.py Outdated
Show resolved Hide resolved dask/array/core.py Outdated
Show resolved Hide resolved dask/array/core.py Outdated
@ihnorton

This comment has been minimized.

Copy link
Contributor Author

commented Apr 10, 2019

Thanks for the comments, I will address them shortly.

It looks like the latest build failed on python 3.5, because it is picking up a too-old package from conda-forge. We might have to x-fail 3.5 (for tiledb), if that's possible, because conda-forge is no longer building python 3.5 packages (alternatively, we do have manylinux1 packages on PyPI for 2.7,3.5-3.7, but that's probably not ideal)

@martindurant

This comment has been minimized.

Copy link
Member

commented Apr 10, 2019

xfailing your test on py35 is ok.

@ihnorton ihnorton force-pushed the ihnorton:add_tiledb branch from e044f94 to f254195 Apr 11, 2019

@ihnorton

This comment has been minimized.

Copy link
Contributor Author

commented Apr 12, 2019

xfailing your test on py35 is ok.

I did this for the tests, but ended up changing the examples to not be doctests (>> instead of >>>) because pytest only recently gained support for skipping inside of doctest blocks.

It looks like codecov is happy now. (will squash commits if/when applicable)

@martindurant

This comment has been minimized.

Copy link
Member

commented Apr 12, 2019

That would be a shame, as the rendering of the docstrings would be messed up in the online sphynx pages and any IDE that renders docs. You should instead add # doctest: +SKIP to the end of any line in the docstring that shouldn't be executed in the tests.

@ihnorton ihnorton force-pushed the ihnorton:add_tiledb branch from f254195 to 8db2286 Apr 12, 2019

Add TileDB storage format support
adds dask.array.to_tiledb and .from_tiledb functions to allow loading
  data to/from TileDB format/storage manager.

@ihnorton ihnorton force-pushed the ihnorton:add_tiledb branch from 8db2286 to 0133da1 Apr 12, 2019

@ihnorton

This comment has been minimized.

Copy link
Contributor Author

commented Apr 12, 2019

Updated (and squashed). Thanks! I built the docs locally and the additions here seem to be rendered correctly now.

@martindurant

This comment has been minimized.

Copy link
Member

commented Apr 12, 2019

(there is generally no need to squash and force here, since we squash on merge anyway)

@martindurant
Copy link
Member

left a comment

Sorry, very small nit-picks on the docstring.
You can push a commit with [skip ci] in the message to avoid retesting.

Show resolved Hide resolved dask/array/tiledb_io.py Outdated
Show resolved Hide resolved dask/array/tiledb_io.py Outdated
@ihnorton

This comment has been minimized.

Copy link
Contributor Author

commented Apr 15, 2019

Updated, thanks. (happy to address any other issues)

@martindurant

This comment has been minimized.

Copy link
Member

commented Apr 15, 2019

The "skip ci" should have been in the first line of the commit message, sorry!
I think all is correct here, and I'll merge it by end of day if there are no further comments, once it goes green again.

@ihnorton

This comment has been minimized.

Copy link
Contributor Author

commented Apr 15, 2019

The "skip ci" should have been in the first line of the commit message

My mistake -- l didn't realize that Appveyor only supports that directive in the title.

@martindurant

This comment has been minimized.

Copy link
Member

commented Apr 15, 2019

Looking forward to seeing some example use cases for this, and perhaps the Xarray tiledb backend using this PR, thank you.

@martindurant martindurant merged commit ec0ea09 into dask:master Apr 15, 2019

1 check passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details

asmith26 added a commit to asmith26/dask that referenced this pull request Apr 22, 2019

RFC: Add TileDB backend (dask#4679)
* Add TileDB storage format support

adds dask.array.to_tiledb and .from_tiledb functions to allow loading
  data to/from TileDB format/storage manager.

* Update doc signatures for review

[skip ci]

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019

RFC: Add TileDB backend (dask#4679)
* Add TileDB storage format support

adds dask.array.to_tiledb and .from_tiledb functions to allow loading
  data to/from TileDB format/storage manager.

* Update doc signatures for review

[skip ci]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.