Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Add TileDB backend #4679

Merged
merged 2 commits into from Apr 15, 2019
Merged

RFC: Add TileDB backend #4679

merged 2 commits into from Apr 15, 2019

Conversation

@ihnorton
Copy link
Contributor

@ihnorton ihnorton commented Apr 8, 2019

This is a request-for-comment and WIP PR to propose adding TileDB integration to Dask, as suggested by several dask users in this pangeo thread (cc @rabernat, @martindurant)

Status

  • Tests added / passed
  • Passes flake8 dask
  • I will do a TileDB-Py point release on conda+pip to include anything else needed upon review here.

Proposed API additions

  • dask.array.to_tiledb and dask.array.from_tiledb

Background

TileDB is an open source, multi-dimensional array management system which provides:

  • storage for dense and sparse array data with support for fast updates, compression, and encryption.
  • a storage engine written in C++ for efficient queries and provides native a C/C++ API for portability.
  • open spec on disk format
  • multi-language support: Python, R, Go, and Java bindings
  • Conda packages for the TileDB core and python bindings
  • storage backends for S3 and HDFS
  • the concurrency and consistency model is: consistent reads via timestamped snapshots, and eventually-consistent writes. Supports concurrent multiple readers and multiple writers. Supports multithreaded parallel IO from a single python process.
@martindurant
Copy link
Member

@martindurant martindurant commented Apr 9, 2019

Thanks, that looks nice and simple!

@martindurant
Copy link
Member

@martindurant martindurant commented Apr 9, 2019

From the codecov, I guess the tests didn't run, and you would have to add installs to the build scripts.

@ihnorton
Copy link
Contributor Author

@ihnorton ihnorton commented Apr 10, 2019

Thanks, @martindurant. I updated the PR to include installation of tiledb-py from conda-forge.

@martindurant
Copy link
Member

@martindurant martindurant commented Apr 10, 2019

Anyone know why Travis and coveralls didn't run?

@ihnorton
Copy link
Contributor Author

@ihnorton ihnorton commented Apr 10, 2019

I had enabled Travis on a fork, and ran a few builds there (in order to avoid hitting dask main CI). Maybe Travis ignored the push because it saw an existing run against the hash?

I just amended the commit and pushed again -- looks like travis picked it up this time.

Copy link
Member

@martindurant martindurant left a comment

In general, liking this PR a lot, and want to see it merged rapidly, so people can experiment with it.

Note that you should make sure that the new methods are added to the API docs.

dask/array/core.py Outdated Show resolved Hide resolved
dask/array/core.py Outdated Show resolved Hide resolved
dask/array/core.py Outdated Show resolved Hide resolved
@ihnorton
Copy link
Contributor Author

@ihnorton ihnorton commented Apr 10, 2019

Thanks for the comments, I will address them shortly.

It looks like the latest build failed on python 3.5, because it is picking up a too-old package from conda-forge. We might have to x-fail 3.5 (for tiledb), if that's possible, because conda-forge is no longer building python 3.5 packages (alternatively, we do have manylinux1 packages on PyPI for 2.7,3.5-3.7, but that's probably not ideal)

@martindurant
Copy link
Member

@martindurant martindurant commented Apr 10, 2019

xfailing your test on py35 is ok.

@ihnorton
Copy link
Contributor Author

@ihnorton ihnorton commented Apr 12, 2019

xfailing your test on py35 is ok.

I did this for the tests, but ended up changing the examples to not be doctests (>> instead of >>>) because pytest only recently gained support for skipping inside of doctest blocks.

It looks like codecov is happy now. (will squash commits if/when applicable)

@martindurant
Copy link
Member

@martindurant martindurant commented Apr 12, 2019

That would be a shame, as the rendering of the docstrings would be messed up in the online sphynx pages and any IDE that renders docs. You should instead add # doctest: +SKIP to the end of any line in the docstring that shouldn't be executed in the tests.

adds dask.array.to_tiledb and .from_tiledb functions to allow loading
  data to/from TileDB format/storage manager.
@ihnorton
Copy link
Contributor Author

@ihnorton ihnorton commented Apr 12, 2019

Updated (and squashed). Thanks! I built the docs locally and the additions here seem to be rendered correctly now.

@martindurant
Copy link
Member

@martindurant martindurant commented Apr 12, 2019

(there is generally no need to squash and force here, since we squash on merge anyway)

Copy link
Member

@martindurant martindurant left a comment

Sorry, very small nit-picks on the docstring.
You can push a commit with [skip ci] in the message to avoid retesting.

dask/array/tiledb_io.py Outdated Show resolved Hide resolved
dask/array/tiledb_io.py Outdated Show resolved Hide resolved
@ihnorton
Copy link
Contributor Author

@ihnorton ihnorton commented Apr 15, 2019

Updated, thanks. (happy to address any other issues)

@martindurant
Copy link
Member

@martindurant martindurant commented Apr 15, 2019

The "skip ci" should have been in the first line of the commit message, sorry!
I think all is correct here, and I'll merge it by end of day if there are no further comments, once it goes green again.

@ihnorton
Copy link
Contributor Author

@ihnorton ihnorton commented Apr 15, 2019

The "skip ci" should have been in the first line of the commit message

My mistake -- l didn't realize that Appveyor only supports that directive in the title.

@martindurant
Copy link
Member

@martindurant martindurant commented Apr 15, 2019

Looking forward to seeing some example use cases for this, and perhaps the Xarray tiledb backend using this PR, thank you.

@martindurant martindurant merged commit ec0ea09 into dask:master Apr 15, 2019
1 check passed
jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this issue May 14, 2019
* Add TileDB storage format support

adds dask.array.to_tiledb and .from_tiledb functions to allow loading
  data to/from TileDB format/storage manager.

* Update doc signatures for review

[skip ci]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants