Hey! I'm trying to use this function https://arrow.apache.org/docs/python/generated/pyarrow.fs.FileSystem.html#pyarrow.fs.FileSystem.from_uri to obtain FileSystem like below:
from pyarrow.fs import FileSystem
FileSystem.from_uri("hdfs:///some/path")
however, even though core-site.xml is properly configured I'm getting:
URISyntaxException: Expected authority at index 7: hdfs://java.lang.IllegalArgumentException: Expected authority at index 7: hdfs://
I might be mistaken, however it seems that it tries to find hostname between second and third slash charater in such HDFS URL and just does not take into account core-site.xml config at all.
I'm able to successfully create FileSystem when providing namenode explicitly with hdfs://{namenode}/{path}. Currently, we are working around this issue with two strategies of parsing:
from pyarrow.fs import FileSystem, HadoopFileSystem
parsed = urlparse(url)
if parsed.scheme == "hdfs":
hadoop_file_system = HadoopFileSystem("default") # this properly recognizes core-site.xml
return (hadoop_file_system, parsed.path)
else:
file_system: Tuple[FileSystem, str] = FileSystem.from_uri(uri=url)
return file_system
Component(s)
C++, Python