Skip to content

pyarrow.fs.FileSystem.from_uri does not refer to HDFS core-site.xml config file when resolving namenode for HDFS URL #42050

@wkarwacki

Description

@wkarwacki
pyarrow==15.0.2

Hey! I'm trying to use this function https://arrow.apache.org/docs/python/generated/pyarrow.fs.FileSystem.html#pyarrow.fs.FileSystem.from_uri to obtain FileSystem like below:

from pyarrow.fs import FileSystem

FileSystem.from_uri("hdfs:///some/path")

however, even though core-site.xml is properly configured I'm getting:

URISyntaxException: Expected authority at index 7: hdfs://java.lang.IllegalArgumentException: Expected authority at index 7: hdfs://

I might be mistaken, however it seems that it tries to find hostname between second and third slash charater in such HDFS URL and just does not take into account core-site.xml config at all.

I'm able to successfully create FileSystem when providing namenode explicitly with hdfs://{namenode}/{path}. Currently, we are working around this issue with two strategies of parsing:

from pyarrow.fs import FileSystem, HadoopFileSystem

parsed = urlparse(url)
if parsed.scheme == "hdfs":
    hadoop_file_system = HadoopFileSystem("default") # this properly recognizes core-site.xml
    return (hadoop_file_system, parsed.path)
else:
    file_system: Tuple[FileSystem, str] = FileSystem.from_uri(uri=url)
    return file_system

Component(s)

C++, Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions