Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] HADOOP_HOME doesn't work to find libhdfs.so #24069

Closed
asfimport opened this issue Feb 12, 2020 · 7 comments
Closed

[C++] HADOOP_HOME doesn't work to find libhdfs.so #24069

asfimport opened this issue Feb 12, 2020 · 7 comments

Comments

@asfimport
Copy link

asfimport commented Feb 12, 2020

I have my env variable setup correctly according to the pyarrow README

$ ls $HADOOP_HOME/lib/native
libhadoop.a  libhadooppipes.a  libhadoop.so  libhadoop.so.1.0.0  libhadooputils.a  libhdfs.a  libhdfs.so  libhdfs.so.0.0.0 

Use the following script to reproduce

import pyarrow
pyarrow.hdfs.connect('hdfs://localhost')

With pyarrow version 0.15.1 it is fine.

However, version 0.16.0 will give error

Traceback (most recent call last):
  File "<string>", line 2, in <module>
  File "/home/jackwindows/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 215, in connect
    extra_conf=extra_conf)
  File "/home/jackwindows/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 40, in __init__
    self._connect(host, port, user, kerb_ticket, driver, extra_conf)
  File "pyarrow/io-hdfs.pxi", line 89, in pyarrow.lib.HadoopFileSystem._connect
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
IOError: Unable to load libhdfs: /opt/hadoop/latest/libhdfs.so: cannot open shared object file: No such file or directory 

Reporter: Jack Fan
Assignee: Kouhei Sutou / @kou

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-7841. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Krisztian Szucs / @kszucs:
[~JackWindows] What are the values for HADOOP_HOME and ARROW_LIBHDFS_DIR environment variables? Arrow tries to load libhdfs.so from $HADOOP_HOME/libhdfs.so and $ARROW_LIBHDFS_DIR/libhdfs.so. You could try to set ARROW_LIBHDFS_DIR=$HADOOP_HOME/lib/native

@asfimport
Copy link
Author

Jack Fan:
@kszucs

What are the values for HADOOP_HOME and ARROW_LIBHDFS_DIR environment variables?

 $ echo $HADOOP_HOME
/opt/hadoop/latest
$ echo $ARROW_LIBHDFS_DIR

 

Arrow tries to load libhdfs.so from $HADOOP_HOME/libhdfs.so and $ARROW_LIBHDFS_DIR/libhdfs.so
Why there is a change of behaviour in version 0.16.0?

According to https://arrow.apache.org/docs/python/filesystems.html, "ARROW_LIBHDFS_DIR (optional): explicit location of libhdfs.so if it is installed somewhere other than $HADOOP_HOME/lib/native."

IMHO it doesn't seem to make sense to try loading from $HADOOP_HOME/libhdfs.so.

@asfimport
Copy link
Author

Krisztian Szucs / @kszucs:
This regression was probably introduced with e12d285#diff-29a7b8eebea6dfdb3246dd2b853ba8dcR145

Could you please try to set either ARROW_LIBHDFS_DIR=$HADOOP_HOME/lib/native to see whether that resolves the issue or not?

@asfimport
Copy link
Author

Jack Fan:
@kszucs  I can confirm that setting ARROW_LIBHDFS_DIR=$HADOOP_HOME/lib/native solves the issue, but ideally I would like to avoid that.

IMO I should have a working pyarrow with a standard HADOOP setup. ARROW_LIBHDFS_DIR should only be used if I explicitly want to load the libhdfs.so from a different location.

@asfimport
Copy link
Author

Kouhei Sutou / @kou:
Oh, sorry.
I missed the regression.

I'll fix it.

@asfimport
Copy link
Author

Krisztian Szucs / @kszucs:
[~JackWindows] Agree, thanks for verifying it!

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Issue resolved by pull request 6424
#6424

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants