Accessing HDFS without the need of JVM #171

DonDebonair · 2018-11-23T14:31:57Z

Hi all,

It says in the README:

Pyarrow's JNI hdfs interface is mature and stable. It also has fewer problems with configuration and various security settings, and does not require the complex build process of libhdfs3. Therefore, all users who have trouble with hdfs3 are recommended to try pyarrow.

This means that you're ignoring an important and obvious use case: accessing HDFS without the JVM. I think this was one of the main reasons hdfs3 was created in the first place. PyArrow's hdfs functionality doesn't solve this because it requires the JVM and all Hadoop jars to be present, which is especially inconvenient if you have a Python app that you want to run inside a Docker container, and connect to an HDFS cluster from there. I have no idea how to do that with PyArrow.

What are your thoughts on that?

martindurant · 2018-11-23T15:18:24Z

You are, of course correct. However, it turned out that libhdfs3 was very difficult to get right regarding the myriad hadoop security settings. Since most HDFS users seem to access from the cluster, especially if doing parallel work with Dask, it seemed better to allow the java stuff to handle that side of things.

For access from outside the cluster, I would recommend either sticking with the old hdfs3 (which should still work), or, more likely, the more straight-forward webhdfs. The corresponding python libraries don't look particularly complete, but it wouldn't take much effort to build one out, especially given the generic code in fsspec.

martindurant · 2018-11-23T15:19:12Z

Alternatively, maintainers for libhdfs3 would be most welcome! It is a project which has been reborn and abandoned multiple times.

DonDebonair · 2018-11-23T16:49:44Z

Thanks for your quick reply! I will look into the solutions you mentioned and figure out the right approach for my projects. WebHDFS seems the most straightforward, especially since I'm not looking to move massive amounts of data.
Maintaining libhdfs3 is sadly not something I can commit to :(

martindurant · 2018-11-23T16:52:02Z

Note that webHDFS will need to be enabled for your HDFS system, and you may require kerberos authentication (not too likely), which requests-kerberos should handle for you.

DonDebonair · 2018-11-23T17:07:21Z

We do have a Kerberized cluster, so thanks for the pointer!

DonDebonair closed this as completed Nov 23, 2018

martindurant mentioned this issue Nov 23, 2018

Add implementations? fsspec/filesystem_spec#18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accessing HDFS without the need of JVM #171

Accessing HDFS without the need of JVM #171

DonDebonair commented Nov 23, 2018

martindurant commented Nov 23, 2018

martindurant commented Nov 23, 2018

DonDebonair commented Nov 23, 2018

martindurant commented Nov 23, 2018

DonDebonair commented Nov 23, 2018

Accessing HDFS without the need of JVM #171

Accessing HDFS without the need of JVM #171

Comments

DonDebonair commented Nov 23, 2018

martindurant commented Nov 23, 2018

martindurant commented Nov 23, 2018

DonDebonair commented Nov 23, 2018

martindurant commented Nov 23, 2018

DonDebonair commented Nov 23, 2018