Conversation
|
cc @mrocklin, @martindurant for review. |
| PYARROW_DRIVER = LooseVersion(pyarrow.__version__) >= _MIN_PYARROW_VERSION_SUPPORTED | ||
| except ImportError: | ||
| PYARROW_DRIVER = False | ||
| pyarrow = None |
dask/bytes/tests/test_hdfs.py
Outdated
| hdfs.rm(basedir, recursive=True) | ||
| hdfs.mkdir(basedir) | ||
|
|
||
| yield HDFSDriver(hdfs, request.param) |
There was a problem hiding this comment.
Maybe instead do the following?
with dask.set_options(hdfs_driver=request.param):
yield hdfsThis would centralize the dask.set_options code and remove the need for the HDFSDriver class.
(I may be missing something though)
There was a problem hiding this comment.
Generally agree that it would be nice to have this as accurately match usage as possible.
Would it work to replace the driver.fs.open calls below with open_files? Then each test can run in one context.
There was a problem hiding this comment.
I went with Matt's approach, this seemed cleaner and we already test that the set_options bit results in the correct driver.
|
cc @wesm in case he's interested |
| @@ -1,2 +1,2 @@ | |||
| docker exec -it $CONTAINER_ID conda install -y -q dask hdfs3 pyarrow -c conda-forge | |||
| docker exec -it $CONTAINER_ID conda install -y -q dask hdfs3 pyarrow -c twosigma -c conda-forge | |||
There was a problem hiding this comment.
Does twosigma need to be ordered first here? I guess this will go away with time.
There was a problem hiding this comment.
Yes. We want pyarrow from the nightly builds (for now). Conda channels are prioritized in the order given.
| import pyarrow as pa | ||
|
|
||
|
|
||
| class HDFS3Wrapper(pa.filesystem.DaskFileSystem): |
There was a problem hiding this comment.
This is for arrow's libhdfs3?
I know this is just moved from above, but it's not obvious to me what it's doing, while the class below seem to have all the required behaviour.
There was a problem hiding this comment.
This is for wrapping hdfs3's filesystem to be used inside pyarrow. I added a comment that should clarify this.
dask/bytes/pyarrow.py
Outdated
| return sorted(_glob(self.fs, path)) | ||
|
|
||
| def mkdirs(self, path): | ||
| return self.fs.mkdir(path) |
There was a problem hiding this comment.
This will fail if the parent directory does not exist, I think - unlike the usual understanding of mkdirs.
There was a problem hiding this comment.
libhdfs always makes intermediate directories. I added an (ignored) keyword that arrow supports that should better document the intention here.
dask/bytes/pyarrow.py
Outdated
| # Copyright 2001-2018 Python Software Foundation; All Rights Reserved | ||
|
|
||
|
|
||
| def _glob(fs, pathname): |
There was a problem hiding this comment.
Do you think it's useful to export glob for other file-systems?
(I know there are issues on the glob implementation elsewhere)
dask/bytes/tests/test_hdfs.py
Outdated
| hdfs.rm(basedir, recursive=True) | ||
| hdfs.mkdir(basedir) | ||
|
|
||
| yield HDFSDriver(hdfs, request.param) |
There was a problem hiding this comment.
Generally agree that it would be nice to have this as accurately match usage as possible.
Would it work to replace the driver.fs.open calls below with open_files? Then each test can run in one context.
dask/bytes/tests/test_hdfs.py
Outdated
| assert len(ddf2) == 1000 # smoke test on read | ||
|
|
||
|
|
||
| def test_pyarrow_glob(pa_hdfs): |
There was a problem hiding this comment.
Here you show how the globs are different, but we would like them to be the same. Perhaps a test across engines showing that at least simpler globs are the same?
There was a problem hiding this comment.
I moved the glob implementation out into its own file, and applied it to hdfs3 as well (which had inconsistent behavior). Modified the test to test both drivers.
| .. _s3fs: http://s3fs.readthedocs.io/ | ||
| .. .. _azure-data-lake-store-python: https://github.com/Azure/azure-data-lake-store-python | ||
| .. _gcsfs: https://github.com/martindurant/gcsfs/ | ||
| .. _gcsfs: https://github.com/dask/gcsfs/ |
docs/source/remote-data-services.rst
Outdated
| - ``ticket_cache``, ``token``: kerberos authentication | ||
| - ``pars``: dictionary of further parameters (e.g., for `high availability`_) | ||
|
|
||
| The ``hdfs3`` driver also relies on a few environment variables. For |
There was a problem hiding this comment.
The hdfs3 driver configuration can also be affected by a few environment variables.
docs/source/remote-data-services.rst
Outdated
| .. _gcsfs: https://github.com/martindurant/gcsfs/ | ||
| .. _gcloud: https://cloud.google.com/sdk/docs/ | ||
|
|
||
| At the time of writing, ``gcsfs.GCSFileSystem`` instances pickle including the auth token, so sensitive |
There was a problem hiding this comment.
This paragraph and the next are outdated and can be removed.
| # | ||
| # These functions are under copyright by the Python Software Foundation | ||
| # | ||
| # Copyright 2001-2018 Python Software Foundation; All Rights Reserved |
There was a problem hiding this comment.
It's not clear to me if this is needed (added it just to be safe). The functions below started from a copy-paste from the standard library, but they've been simplified (remove behavior/options we don't need) and heavily modified (support generic filesystems, better function names, cleaner implementation logic, remove duplicate branches, ...) such that they're probably unrecognizable compared to the original. I suppose this still is a derivative work though.
|
Thanks for the review, I believe all comments have been addressed. |
|
On a quick glance, this looks OK. |
mrocklin
left a comment
There was a problem hiding this comment.
To the extend that I am able to judge, this seems fine.
I left a few small comments.
| echo 'deb-src http://archive.cloudera.com/cdh5/ubuntu/xenial/amd64/cdh xenial-cdh5 contrib' >> /etc/apt/sources.list.d/cloudera.list && \ | ||
| apt-get update && \ | ||
| apt-get install -y -q openjdk-7-jre-headless hadoop-conf-pseudo && \ | ||
| apt-get install -y -q sudo openjdk-8-jre-headless hadoop-conf-pseudo libhdfs0 && \ |
There was a problem hiding this comment.
The new ubuntu docker base image doesn't provide sudo by default anymore, while it is needed (afaict) to setup hdfs on docker.
| def from_hdfs3(cls, fs): | ||
| out = object.__new__(cls) | ||
| out.fs = fs | ||
| return out |
There was a problem hiding this comment.
Testing purposes. Wrapping an existing hdfs client in the dask wrapper, rather than creating one from scratch.
| return self.fs.mkdir(path, create_parents=True) | ||
|
|
||
| def ukey(self, path): | ||
| return tokenize(path, self.fs.info(path)['last_modified']) |
There was a problem hiding this comment.
Just checking, but does HDFS already offer content hashing? It may.
There was a problem hiding this comment.
$ hdfs dfs -checksum /project1/file.txt
0000020000000000000000003e50be59553b2ddaf401c575f8df6914
There was a problem hiding this comment.
Afaict it does not. The checksums seem to be for robustness checking (you can turn on/off whether checksums are verified on read), but I don't think it exposes them.
dask/bytes/tests/test_hdfs.py
Outdated
| f.write('a b\nc d'.encode()) | ||
|
|
||
| b = db.read_text('hdfs://%s/text.*.txt' % basedir) | ||
| result = b.str.strip().str.split().map(len).compute(get=dask.get) |
There was a problem hiding this comment.
There have been multiprocessing issues with hdfs3 in the past. It might be wise to leave these with the multiprocessing scheduler.
There was a problem hiding this comment.
Fixed. Looks like there's fork-safety issues with libhdfs (or pyarrow's wrapper of it, not sure). Will file an issue. Testing using the spawn context works fine though, and mimics how things would work with the distributed scheduler.
- Only support recent Pyarrow version with patches pushed upstream - Add tests for glob - Add psf license for glob functionality
Move the glob code out of pyarrow module, and apply it to hdfs3 driver as well (due to inconsistent behavior between hdfs3 and other glob implementations). Test that hdfs3 and pyarrow glob matches.
0780795 to
9f10da8
Compare
|
To be clear, we use forkserver, which spawns a new process and then forks
cleanly from that.
You might consider doing a test with Client() to simulate normal
distributed operation.
…On Thu, Feb 1, 2018 at 6:00 PM, Jim Crist ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In dask/bytes/tests/test_hdfs.py
<#3123 (comment)>:
> with hdfs.open('%s/other.txt' % basedir, 'wb') as f:
f.write('a b\nc d'.encode())
+ b = db.read_text('hdfs://%s/text.*.txt' % basedir)
+ result = b.str.strip().str.split().map(len).compute(get=dask.get)
Fixed. Looks like there's fork-safety issues with libhdfs (or pyarrow's
wrapper of it, not sure). Will file an issue. Testing using the spawn
context works fine though, and mimics how things would work with the
distributed scheduler.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3123 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszLSbCPAiDNYEh-W9KHVJyzLWaPCfks5tQkH3gaJpZM4R0zFV>
.
|
Done. |
|
Thanks for the review all. Merging. |
|
Looking at this late, but thank you for doing this! Some users will appreciate being able to use libhdfs -- it might bear mentioning in the release notes that this allows the official HDFS Java client libraries to be used. I didn't dig in far enough -- if we used the Arrow Parquet support, would the underlying hdfs client handle (faster) be passed down to |
Heh, I was one of those users.
Correct. The |
|
When I run the code mentioned above, I get this error
The latest release of pyarrow that I was able to find was pyarrow 0.8.0 I was wondering if someone could point me in the right direction of how to get later versions. |
|
You can install nightlies on Linux with |
|
Thanks for the quick response, try to install running into an error on the import. |
|
@alex959595 can you show the installed arrow-cpp, parquet-cpp, and pyarrow versions? It looks like there's a problem with the nightlies. cc @cpcloud |
|
@wesm arrow-cpp 0.8.0, parquet 1.4.0.pre, pyarrow 0.8.0+151.nightly |
|
OK, from the look of https://anaconda.org/twosigma/pyarrow/files the nightly version numbers are messed up after the 0.3.0 JS release tag went out -- @cpcloud can you have a look at what's wrong? Must be related to the issue fixed in apache/arrow@55bdae5 |
|
I think @cpcloud is unavailable today and tomorrow so this may have to wait to get fixed until later in the week |
|
@wesm Thanks for your quick responses, and sounds good! |
Adds support for using
pyarrowinstead ofhdfs3for hdfs integration. By default the first installed library in [hdfs3,pyarrow] is used. To explicitly set which driver to use, users can sethdfs_driverwithdask.set_options:Since a user is unlikely to want to use both at the same time, this seemed like the cleanest way to configure.
Summary of changes:
dask.bytesto support both hdfs driver optionshdfsfilesystem usingpyarrow. Note that this requires pyarrow development version due to several bugs in the latest release.pyarrowimplementation. This was copied (and heavily modified) from the version in the standard library. As such I've copied over the relevant license, mirroring how this was done indistributed/threadpoolexecutor.py.Fixes #3046.
Fixes #1880.