Support using pyarrow for hdfs by jcrist · Pull Request #3123 · dask/dask

jcrist · 2018-01-31T23:10:04Z

Adds support for using pyarrow instead of hdfs3 for hdfs integration. By default the first installed library in [hdfs3, pyarrow] is used. To explicitly set which driver to use, users can set hdfs_driver with dask.set_options:

# Use pyarrow for hdfs integration
with dask.set_options(hdfs_driver='pyarrow'):
    df = dd.read_csv('hdfs:///path/to/*.csv')

Since a user is unlikely to want to use both at the same time, this seemed like the cleanest way to configure.

Summary of changes:

Update dask.bytes to support both hdfs driver options
Add a hdfs filesystem using pyarrow. Note that this requires pyarrow development version due to several bugs in the latest release.
Update the tests to work with both/either driver installed.
Add a glob implementation for the pyarrow implementation. This was copied (and heavily modified) from the version in the standard library. As such I've copied over the relevant license, mirroring how this was done in distributed/threadpoolexecutor.py.
Update the test infrastructure to support libhdfs and libhdfs3 concurrently.
Update the relevant documentation.

Fixes #3046.
Fixes #1880.

jcrist · 2018-01-31T23:50:16Z

cc @mrocklin, @martindurant for review.

mrocklin · 2018-02-01T14:29:44Z

dask/bytes/tests/test_hdfs.py

+    PYARROW_DRIVER = LooseVersion(pyarrow.__version__) >= _MIN_PYARROW_VERSION_SUPPORTED
+except ImportError:
+    PYARROW_DRIVER = False
+    pyarrow = None


I like this approach

mrocklin · 2018-02-01T14:33:21Z

dask/bytes/tests/test_hdfs.py

+            hdfs.rm(basedir, recursive=True)
+        hdfs.mkdir(basedir)
+
+        yield HDFSDriver(hdfs, request.param)


Maybe instead do the following?

with dask.set_options(hdfs_driver=request.param): yield hdfs

This would centralize the dask.set_options code and remove the need for the HDFSDriver class.

(I may be missing something though)

Generally agree that it would be nice to have this as accurately match usage as possible.
Would it work to replace the driver.fs.open calls below with open_files? Then each test can run in one context.

I went with Matt's approach, this seemed cleaner and we already test that the set_options bit results in the correct driver.

mrocklin · 2018-02-01T14:35:29Z

cc @wesm in case he's interested

martindurant · 2018-02-01T15:56:24Z

continuous_integration/hdfs/install.sh

@@ -1,2 +1,2 @@
-docker exec -it $CONTAINER_ID conda install -y -q dask hdfs3 pyarrow -c conda-forge
+docker exec -it $CONTAINER_ID conda install -y -q dask hdfs3 pyarrow -c twosigma -c conda-forge


Does twosigma need to be ordered first here? I guess this will go away with time.

Yes. We want pyarrow from the nightly builds (for now). Conda channels are prioritized in the order given.

martindurant · 2018-02-01T17:05:04Z

dask/bytes/pyarrow.py

+import pyarrow as pa
+
+
+class HDFS3Wrapper(pa.filesystem.DaskFileSystem):


This is for arrow's libhdfs3?
I know this is just moved from above, but it's not obvious to me what it's doing, while the class below seem to have all the required behaviour.

This is for wrapping hdfs3's filesystem to be used inside pyarrow. I added a comment that should clarify this.

martindurant · 2018-02-01T17:06:21Z

dask/bytes/pyarrow.py

+        return sorted(_glob(self.fs, path))
+
+    def mkdirs(self, path):
+        return self.fs.mkdir(path)


This will fail if the parent directory does not exist, I think - unlike the usual understanding of mkdirs.

libhdfs always makes intermediate directories. I added an (ignored) keyword that arrow supports that should better document the intention here.

martindurant · 2018-02-01T17:08:52Z

dask/bytes/pyarrow.py

+#    Copyright 2001-2018 Python Software Foundation; All Rights Reserved
+
+
+def _glob(fs, pathname):


Do you think it's useful to export glob for other file-systems?
(I know there are issues on the glob implementation elsewhere)

martindurant · 2018-02-01T17:13:03Z

dask/bytes/tests/test_hdfs.py

+            hdfs.rm(basedir, recursive=True)
+        hdfs.mkdir(basedir)
+
+        yield HDFSDriver(hdfs, request.param)


Generally agree that it would be nice to have this as accurately match usage as possible.
Would it work to replace the driver.fs.open calls below with open_files? Then each test can run in one context.

martindurant · 2018-02-01T17:14:32Z

dask/bytes/tests/test_hdfs.py

    assert len(ddf2) == 1000  # smoke test on read
+
+
+def test_pyarrow_glob(pa_hdfs):


Here you show how the globs are different, but we would like them to be the same. Perhaps a test across engines showing that at least simpler globs are the same?

I moved the glob implementation out into its own file, and applied it to hdfs3 as well (which had inconsistent behavior). Modified the test to test both drivers.

martindurant · 2018-02-01T17:14:54Z

docs/source/remote-data-services.rst

 .. _s3fs: http://s3fs.readthedocs.io/
 .. .. _azure-data-lake-store-python: https://github.com/Azure/azure-data-lake-store-python
-.. _gcsfs: https://github.com/martindurant/gcsfs/
+.. _gcsfs: https://github.com/dask/gcsfs/


martindurant · 2018-02-01T17:17:05Z

docs/source/remote-data-services.rst

+    - ``ticket_cache``, ``token``: kerberos authentication
+    - ``pars``: dictionary of further parameters (e.g., for `high availability`_)
+
+The ``hdfs3`` driver also relies on a few environment variables. For


The hdfs3 driver configuration can also be affected by a few environment variables.

martindurant · 2018-02-01T17:20:58Z

docs/source/remote-data-services.rst

-.. _gcsfs: https://github.com/martindurant/gcsfs/
 .. _gcloud: https://cloud.google.com/sdk/docs/

 At the time of writing, ``gcsfs.GCSFileSystem`` instances pickle including the auth token, so sensitive


This paragraph and the next are outdated and can be removed.

jcrist · 2018-02-01T19:06:12Z

dask/bytes/glob.py

+#
+# These functions are under copyright by the Python Software Foundation
+#
+#    Copyright 2001-2018 Python Software Foundation; All Rights Reserved


It's not clear to me if this is needed (added it just to be safe). The functions below started from a copy-paste from the standard library, but they've been simplified (remove behavior/options we don't need) and heavily modified (support generic filesystems, better function names, cleaner implementation logic, remove duplicate branches, ...) such that they're probably unrecognizable compared to the original. I suppose this still is a derivative work though.

jcrist · 2018-02-01T19:06:25Z

Thanks for the review, I believe all comments have been addressed.

martindurant · 2018-02-01T22:17:30Z

On a quick glance, this looks OK.
You reworked the test fixture because of glob? Well, it probably looks simpler now.

mrocklin

To the extend that I am able to judge, this seems fine.

I left a few small comments.

mrocklin · 2018-02-01T22:24:40Z

continuous_integration/hdfs/Dockerfile

+    echo 'deb-src http://archive.cloudera.com/cdh5/ubuntu/xenial/amd64/cdh xenial-cdh5 contrib' >> /etc/apt/sources.list.d/cloudera.list && \
    apt-get update && \
-    apt-get install -y -q openjdk-7-jre-headless hadoop-conf-pseudo && \
+    apt-get install -y -q sudo openjdk-8-jre-headless hadoop-conf-pseudo libhdfs0 && \


Why is sudo listed here?

The new ubuntu docker base image doesn't provide sudo by default anymore, while it is needed (afaict) to setup hdfs on docker.

mrocklin · 2018-02-01T22:28:03Z

dask/bytes/hdfs3.py

+    def from_hdfs3(cls, fs):
+        out = object.__new__(cls)
+        out.fs = fs
+        return out


Why do we have this?

Testing purposes. Wrapping an existing hdfs client in the dask wrapper, rather than creating one from scratch.

mrocklin · 2018-02-01T22:28:48Z

dask/bytes/pyarrow.py

+        return self.fs.mkdir(path, create_parents=True)
+
+    def ukey(self, path):
+        return tokenize(path, self.fs.info(path)['last_modified'])


Just checking, but does HDFS already offer content hashing? It may.

$ hdfs dfs -checksum /project1/file.txt 0000020000000000000000003e50be59553b2ddaf401c575f8df6914

Afaict it does not. The checksums seem to be for robustness checking (you can turn on/off whether checksums are verified on read), but I don't think it exposes them.

OK, then I retract the comment

mrocklin · 2018-02-01T22:32:17Z

dask/bytes/tests/test_hdfs.py

        f.write('a b\nc d'.encode())

+    b = db.read_text('hdfs://%s/text.*.txt' % basedir)
+    result = b.str.strip().str.split().map(len).compute(get=dask.get)


There have been multiprocessing issues with hdfs3 in the past. It might be wise to leave these with the multiprocessing scheduler.

Fixed. Looks like there's fork-safety issues with libhdfs (or pyarrow's wrapper of it, not sure). Will file an issue. Testing using the spawn context works fine though, and mimics how things would work with the distributed scheduler.

- Only support recent Pyarrow version with patches pushed upstream - Add tests for glob - Add psf license for glob functionality

Move the glob code out of pyarrow module, and apply it to hdfs3 driver as well (due to inconsistent behavior between hdfs3 and other glob implementations). Test that hdfs3 and pyarrow glob matches.

mrocklin · 2018-02-01T23:05:50Z

To be clear, we use forkserver, which spawns a new process and then forks cleanly from that. You might consider doing a test with Client() to simulate normal distributed operation.

…

On Thu, Feb 1, 2018 at 6:00 PM, Jim Crist ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In dask/bytes/tests/test_hdfs.py <#3123 (comment)>: > with hdfs.open('%s/other.txt' % basedir, 'wb') as f: f.write('a b\nc d'.encode()) + b = db.read_text('hdfs://%s/text.*.txt' % basedir) + result = b.str.strip().str.split().map(len).compute(get=dask.get) Fixed. Looks like there's fork-safety issues with libhdfs (or pyarrow's wrapper of it, not sure). Will file an issue. Testing using the spawn context works fine though, and mimics how things would work with the distributed scheduler. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3123 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszLSbCPAiDNYEh-W9KHVJyzLWaPCfks5tQkH3gaJpZM4R0zFV> .

jcrist · 2018-02-01T23:33:43Z

You might consider doing a test with Client() to simulate normal distributed operation.

Done.

jcrist · 2018-02-01T23:58:32Z

Thanks for the review all. Merging.

wesm · 2018-02-09T21:11:48Z

Looking at this late, but thank you for doing this! Some users will appreciate being able to use libhdfs -- it might bear mentioning in the release notes that this allows the official HDFS Java client libraries to be used.

I didn't dig in far enough -- if we used the Arrow Parquet support, would the underlying hdfs client handle (faster) be passed down to pyarrow.parquet.read_parquet or would a wrapper object be passed (slower)?

jcrist · 2018-02-09T21:50:35Z

Some users will appreciate being able to use libhdfs

Heh, I was one of those users. libhdfs3 doesn't support at-rest-encyption, so we needed support for libhdfs.

I didn't dig in far enough -- if we used the Arrow Parquet support, would the underlying hdfs client handle (faster) be passed down to pyarrow.parquet.read_parquet or would a wrapper object be passed (slower)?

Correct. The hdfs3 driver will use a wrapper, but the pyarrow driver will pass the pyarrow hdfs filesystem directly.

alex959595 · 2018-03-02T18:36:34Z

When I run the code mentioned above,

with dask.set_options(hdfs_driver='pyarrow'):
    df = dd.read_csv('hdfs:///path/to/*.csv')

I get this error

RuntimeError: pyarrow version >= '0.8.1.dev81' required for hdfs driver support

The latest release of pyarrow that I was able to find was pyarrow 0.8.0

I was wondering if someone could point me in the right direction of how to get later versions.
Thanks

@jcrist @mrocklin

wesm · 2018-03-02T18:54:33Z

You can install nightlies on Linux with conda install pyarrow -c twosigma. The 0.9.0 release will hopefully be out by mid-March

alex959595 · 2018-03-02T19:51:55Z

Thanks for the quick response, try to install running into an error on the import.

from pyarrow.lib import cpu_count, set_cpu_count
ImportError: /home/awatson/anaconda3/lib/python3.6/site-packages/pyarrow/lib.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN5arrow2py17ConvertPySequenceEP7_objectRKSt10shared_ptrINS_8DataTypeEEPNS_10MemoryPoolEPS3_INS_5ArrayEE

wesm · 2018-03-02T20:55:53Z

@alex959595 can you show the installed arrow-cpp, parquet-cpp, and pyarrow versions? It looks like there's a problem with the nightlies. cc @cpcloud

alex959595 · 2018-03-05T15:27:44Z

@wesm arrow-cpp 0.8.0, parquet 1.4.0.pre, pyarrow 0.8.0+151.nightly

wesm · 2018-03-05T15:57:04Z

OK, from the look of https://anaconda.org/twosigma/pyarrow/files the nightly version numbers are messed up after the 0.3.0 JS release tag went out -- @cpcloud can you have a look at what's wrong? Must be related to the issue fixed in apache/arrow@55bdae5

wesm · 2018-03-05T22:15:12Z

I think @cpcloud is unavailable today and tomorrow so this may have to wait to get fixed until later in the week

alex959595 · 2018-03-06T15:45:39Z

@wesm Thanks for your quick responses, and sounds good!

mrocklin reviewed Feb 1, 2018

View reviewed changes

martindurant reviewed Feb 1, 2018

View reviewed changes

jcrist commented Feb 1, 2018

View reviewed changes

mrocklin reviewed Feb 1, 2018

View reviewed changes

jcrist added 15 commits February 1, 2018 16:55

Add support for HDFS using pyarrow

0e42863

Tweak a few tests

5536de3

Update dockerfile

258a808

Tests run without hdfs3

592e359

Update ci environment

3483a18

Test backend drivers

23eb862

Remove patches for pyarrow

1aaeedb

- Only support recent Pyarrow version with patches pushed upstream - Add tests for glob - Add psf license for glob functionality

Remove unnecessary hdfs3 test

19d35f4

Update docs [ci skip]

b637fe8

Update changelog

713202c

Skip doctests for filesystems

01f4d4b

Respond to comments

6ad7995

Respond to doc comments

84cf1c4

Make glob generic

9c5c4f2

Move the glob code out of pyarrow module, and apply it to hdfs3 driver as well (due to inconsistent behavior between hdfs3 and other glob implementations). Test that hdfs3 and pyarrow glob matches.

Use multiprocessing in hdfs read_text test

9f10da8

jcrist force-pushed the pyarrow-hdfs__TEST_HDFS__ branch from 0780795 to 9f10da8 Compare February 1, 2018 22:58

Add distributed test

368aa44

jcrist merged commit ae5e1d5 into dask:master Feb 1, 2018

jcrist deleted the pyarrow-hdfs__TEST_HDFS__ branch February 1, 2018 23:59

		@@ -1,2 +1,2 @@
		docker exec -it $CONTAINER_ID conda install -y -q dask hdfs3 pyarrow -c conda-forge
		docker exec -it $CONTAINER_ID conda install -y -q dask hdfs3 pyarrow -c twosigma -c conda-forge

		import pyarrow as pa


		class HDFS3Wrapper(pa.filesystem.DaskFileSystem):

		# Copyright 2001-2018 Python Software Foundation; All Rights Reserved


		def _glob(fs, pathname):

		assert len(ddf2) == 1000 # smoke test on read


		def test_pyarrow_glob(pa_hdfs):

Uh oh!

Conversation

jcrist commented Jan 31, 2018

Uh oh!

jcrist commented Jan 31, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Feb 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcrist commented Feb 1, 2018

Uh oh!

martindurant commented Feb 1, 2018

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Feb 1, 2018 via email

Uh oh!

jcrist commented Feb 1, 2018

Uh oh!

jcrist commented Feb 1, 2018

Uh oh!

wesm commented Feb 9, 2018

Uh oh!

jcrist commented Feb 9, 2018

alex959595 commented Mar 2, 2018 •

edited

Loading