Skip to content
This repository has been archived by the owner on Feb 10, 2021. It is now read-only.

Accessing HDFS without the need of JVM #171

Closed
DonDebonair opened this issue Nov 23, 2018 · 5 comments
Closed

Accessing HDFS without the need of JVM #171

DonDebonair opened this issue Nov 23, 2018 · 5 comments

Comments

@DonDebonair
Copy link

Hi all,

It says in the README:

Pyarrow's JNI hdfs interface is mature and stable. It also has fewer problems with configuration and various security settings, and does not require the complex build process of libhdfs3. Therefore, all users who have trouble with hdfs3 are recommended to try pyarrow.

This means that you're ignoring an important and obvious use case: accessing HDFS without the JVM. I think this was one of the main reasons hdfs3 was created in the first place. PyArrow's hdfs functionality doesn't solve this because it requires the JVM and all Hadoop jars to be present, which is especially inconvenient if you have a Python app that you want to run inside a Docker container, and connect to an HDFS cluster from there. I have no idea how to do that with PyArrow.

What are your thoughts on that?

@martindurant
Copy link
Member

You are, of course correct. However, it turned out that libhdfs3 was very difficult to get right regarding the myriad hadoop security settings. Since most HDFS users seem to access from the cluster, especially if doing parallel work with Dask, it seemed better to allow the java stuff to handle that side of things.

For access from outside the cluster, I would recommend either sticking with the old hdfs3 (which should still work), or, more likely, the more straight-forward webhdfs. The corresponding python libraries don't look particularly complete, but it wouldn't take much effort to build one out, especially given the generic code in fsspec.

@martindurant
Copy link
Member

Alternatively, maintainers for libhdfs3 would be most welcome! It is a project which has been reborn and abandoned multiple times.

@DonDebonair
Copy link
Author

Thanks for your quick reply! I will look into the solutions you mentioned and figure out the right approach for my projects. WebHDFS seems the most straightforward, especially since I'm not looking to move massive amounts of data.
Maintaining libhdfs3 is sadly not something I can commit to :(

@martindurant
Copy link
Member

Note that webHDFS will need to be enabled for your HDFS system, and you may require kerberos authentication (not too likely), which requests-kerberos should handle for you.

@DonDebonair
Copy link
Author

We do have a Kerberized cluster, so thanks for the pointer!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants