Skip to content

[Python] java.io.IOException: No FileSystem for scheme: hdfs [On AWS EMR] #22762

@asfimport

Description

@asfimport

I can't access hdfs through pyarrow ( from inside a yarn container created by skein)

This code works in a jupyter notebook running on the master node, or in an ipython terminal on a worker when given the ARROW_LIBHDFS_DIR env var:

import pyarrow; pyarrow.hdfs.connect()

 

However, when running on yarn by submitting the following skein application, I get a Java error.

 

{{name: test_conn
queue: default

master:
env:
ARROW_LIBHDFS_DIR: /usr/lib/hadoop/lib/native
JAVA_HOME: /etc/alternatives/jre
resources:
vcores: 1
memory: 10 GiB
files:
conda_env: /home/hadoop/environment.tar.gz
script: |
echo $HADOOP_HOME
echo $JAVA_HOME
echo $HADOOP_CLASSPATH
echo $ARROW_LIBHDFS_DIR
source conda_env/bin/activate
python -c "import pyarrow; pyarrow.hdfs.connect(); print(fs.open('test.txt').read())"
echo "Hello World!"}}

FYI I tried with/without all those extra env vars, to no effect. I also tried modifying the EMR cluster with any of the following

 

{{"fs.hdfs.impl": "org.apache.hadoop.fs.Hdfs"
"fs.AbstractFileSystem.hdfs.impl": "org.apache.hadoop.hdfs.DistributedFileSystem"
"fs.hdfs.impl": "org.apache.hadoop.hdfs.DistributedFileSystem"}}

The fs.AbstractFileSystem.hdfs.impl one gave a slightly different error- it was able to find which class by name to use for the "hdfs://" prefix, namely org.apache.hadoop.hdfs.DistributedFileSystem, but not able to find that class.

Logs:

 

{{=========================================================================================
LogType:application.driver.log
Log Upload Time:Thu Aug 29 20:51:59 +0000 2019
LogLength:2635
Log Contents:
/usr/lib/hadoop
/usr/lib/jvm/java-openjdk
:/usr/lib/hadoop-lzo/lib/:/usr/share/aws/aws-java-sdk/:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/emrfs/auxlib/:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/:/usr/lib/hadoop-lzo/lib/:/usr/share/aws/aws-java-sdk/:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/emrfs/auxlib/:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/

hdfsBuilderConnect(forceNewInstance=1, nn=default, port=0, kerbTicketCachePath=(NULL), userName=(NULL)) error:
java.io.IOException: No FileSystem for scheme: hdfs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2846)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2857)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
at org.apache.hadoop.fs.FileSystem$Cache.getUnique(FileSystem.java:2884)
at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:439)
at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:414)
at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:411)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:411)
Traceback (most recent call last):
File "", line 1, in
File "/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_000001/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py", line 215, in connect
extra_conf=extra_conf)
File "/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_000001/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py", line 40, in init
self._connect(host, port, user, kerb_ticket, driver, extra_conf)
File "pyarrow/io-hdfs.pxi", line 105, in pyarrow.lib.HadoopFileSystem._connect
File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: HDFS connection failed
Hello World!
End of LogType:application.driver.log

LogType:application.master.log
Log Upload Time:Thu Aug 29 20:51:59 +0000 2019
LogLength:1588
Log Contents:
19/08/29 20:51:55 INFO skein.ApplicationMaster: Starting Skein version 0.8.0
19/08/29 20:51:55 INFO skein.ApplicationMaster: Running as user hadoop
19/08/29 20:51:55 INFO skein.ApplicationMaster: Application specification successfully loaded
19/08/29 20:51:56 INFO client.RMProxy: Connecting to ResourceManager at IP.ec2.internal/IP:8030
19/08/29 20:51:56 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
19/08/29 20:51:56 INFO skein.ApplicationMaster: gRPC server started at IP.ec2.internal:39361
19/08/29 20:51:57 INFO skein.ApplicationMaster: WebUI server started at IP.ec2.internal:36511
19/08/29 20:51:57 INFO skein.ApplicationMaster: Registering application with resource manager
19/08/29 20:51:57 INFO client.RMProxy: Connecting to ResourceManager at IP.ec2.internal/IP:8032
19/08/29 20:51:57 INFO skein.ApplicationMaster: Starting application driver
19/08/29 20:51:57 INFO skein.ApplicationMaster: Shutting down: Application driver completed successfully.
19/08/29 20:51:57 INFO skein.ApplicationMaster: Unregistering application with status SUCCEEDED
19/08/29 20:51:57 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
19/08/29 20:51:58 INFO skein.ApplicationMaster: Deleted application directory hdfs://IP.ec2.internal:8020/user/hadoop/.skein/application_1567110830725_0001
19/08/29 20:51:58 INFO skein.ApplicationMaster: WebUI server shut down
19/08/29 20:51:58 INFO skein.ApplicationMaster: gRPC server shut down
End of LogType:application.master.log}}

Environment: Hadoop 2.85
EMR 5.24.1
python version: 3.7.4
skein version: 0.8.0
Reporter: Ben Schreck

Note: This issue was originally created as ARROW-6389. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions