[Python] java.io.IOException: No FileSystem for scheme: hdfs [On AWS EMR]

I can't access hdfs through pyarrow ( from inside a yarn container created by skein)

This code works in a jupyter notebook running on the master node, or in an ipython terminal on a worker when given the ARROW_LIBHDFS_DIR env var:

````import pyarrow; pyarrow.hdfs.connect()````

 

However, when running on yarn by submitting the following skein application, I get a Java error.

 

{{name: test_conn
queue: default

master:
  env:
    ARROW_LIBHDFS_DIR: /usr/lib/hadoop/lib/native
    JAVA_HOME: /etc/alternatives/jre
  resources:
    vcores: 1
    memory: 10 GiB
  files:
    conda_env: /home/hadoop/environment.tar.gz
  script: |
    echo $HADOOP_HOME
    echo $JAVA_HOME
    echo $HADOOP_CLASSPATH
    echo $ARROW_LIBHDFS_DIR
    source conda_env/bin/activate
    python -c "import pyarrow; pyarrow.hdfs.connect(); print(fs.open('test.txt').read())"
    echo "Hello World!"}}

FYI I tried with/without all those extra env vars, to no effect. I also tried modifying the EMR cluster with any of the following

 

{{"fs.hdfs.impl": "org.apache.hadoop.fs.Hdfs"
"fs.AbstractFileSystem.hdfs.impl": "org.apache.hadoop.hdfs.DistributedFileSystem"
"fs.hdfs.impl": "org.apache.hadoop.hdfs.DistributedFileSystem"}}

The `fs.AbstractFileSystem.hdfs.impl` one gave a slightly different error- it was able to find which class by name to use for the "hdfs://" prefix, namely `org.apache.hadoop.hdfs.DistributedFileSystem`, but not able to find that class.

Logs:

 

{{=========================================================================================
LogType:application.driver.log
Log Upload Time:Thu Aug 29 20:51:59 +0000 2019
LogLength:2635
Log Contents:
/usr/lib/hadoop
/usr/lib/jvm/java-openjdk
:/usr/lib/hadoop-lzo/lib/**:/usr/share/aws/aws-java-sdk/**:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/**:/usr/share/aws/emr/emrfs/auxlib/**:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/**:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/**:/usr/lib/hadoop-lzo/lib/**:/usr/share/aws/aws-java-sdk/**:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/**:/usr/share/aws/emr/emrfs/auxlib/**:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/**:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/**

hdfsBuilderConnect(forceNewInstance=1, nn=default, port=0, kerbTicketCachePath=(NULL), userName=(NULL)) error:
java.io.IOException: No FileSystem for scheme: hdfs
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2846)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2857)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
        at org.apache.hadoop.fs.FileSystem$Cache.getUnique(FileSystem.java:2884)
        at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:439)
        at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:414)
        at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:411)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
        at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:411)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_000001/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py", line 215, in connect
    extra_conf=extra_conf)
  File "/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_000001/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py", line 40, in __init__
    self._connect(host, port, user, kerb_ticket, driver, extra_conf)
  File "pyarrow/io-hdfs.pxi", line 105, in pyarrow.lib.HadoopFileSystem._connect
  File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: HDFS connection failed
Hello World!
End of LogType:application.driver.log

LogType:application.master.log
Log Upload Time:Thu Aug 29 20:51:59 +0000 2019
LogLength:1588
Log Contents:
19/08/29 20:51:55 INFO skein.ApplicationMaster: Starting Skein version 0.8.0
19/08/29 20:51:55 INFO skein.ApplicationMaster: Running as user hadoop
19/08/29 20:51:55 INFO skein.ApplicationMaster: Application specification successfully loaded
19/08/29 20:51:56 INFO client.RMProxy: Connecting to ResourceManager at IP.ec2.internal/IP:8030
19/08/29 20:51:56 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
19/08/29 20:51:56 INFO skein.ApplicationMaster: gRPC server started at IP.ec2.internal:39361
19/08/29 20:51:57 INFO skein.ApplicationMaster: WebUI server started at IP.ec2.internal:36511
19/08/29 20:51:57 INFO skein.ApplicationMaster: Registering application with resource manager
19/08/29 20:51:57 INFO client.RMProxy: Connecting to ResourceManager at IP.ec2.internal/IP:8032
19/08/29 20:51:57 INFO skein.ApplicationMaster: Starting application driver
19/08/29 20:51:57 INFO skein.ApplicationMaster: Shutting down: Application driver completed successfully.
19/08/29 20:51:57 INFO skein.ApplicationMaster: Unregistering application with status SUCCEEDED
19/08/29 20:51:57 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
19/08/29 20:51:58 INFO skein.ApplicationMaster: Deleted application directory hdfs://IP.ec2.internal:8020/user/hadoop/.skein/application_1567110830725_0001
19/08/29 20:51:58 INFO skein.ApplicationMaster: WebUI server shut down
19/08/29 20:51:58 INFO skein.ApplicationMaster: gRPC server shut down
End of LogType:application.master.log}}

**Environment**: Hadoop 2.85
EMR 5.24.1
python version: 3.7.4
skein version: 0.8.0
**Reporter**: [Ben Schreck](https://issues.apache.org/jira/browse/ARROW-6389)

<sub>**Note**: *This issue was originally created as [ARROW-6389](https://issues.apache.org/jira/browse/ARROW-6389). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] java.io.IOException: No FileSystem for scheme: hdfs [On AWS EMR] #22762

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python] java.io.IOException: No FileSystem for scheme: hdfs [On AWS EMR] #22762

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions