New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] pyarrow.fs.HadoopFileSystem cannot access Azure Data Lake (ADLS) #26807
Comments
Joris Van den Bossche / @jorisvandenbossche: It's difficult for me to test whether your suggestion would work (and for other arrow developers as well, since we often don't have a Hadoop or Azure filesystem at our disposal to test). But would you be able to try your suggestion yourself, and see it that works for you? A PR would then also be very welcome. cc @kszucs |
Steve Loughran: |
ABFS URIs take the following form: It looks like the sanitisation that's done as part of the from_uri method ends up changing it to: This can be seen in the error returned – it is missing the container name. CC: hdfs.cc (not familiar with this codebase so I may have picked up the wrong codepath) A similar exception can be found using the Java client:
Interestingly, this all appears to happen before a connection to Azure is attempted so you may not need an ADLSgen2 container to validate this particular issue. If we include a valid authority, the FileSystem is returned:
The wrapper around libhdfs should be modified to retain the container name before the @. |
Here's the same example using libhdfs:
Similarly to the previous case, the behaviour is the same regardless of whether the ADLSgen2 storage account actually exists or not. |
It's not possible to open a
abfs://
orabfss://
URI with the pyarrow.fs.HadoopFileSystem.Using HadoopFileSystem.from_uri(path) does not work and libhdfs will throw an error saying that the authority is invalid (I checked that this is because the string is empty).
Note that the legacy pyarrow.hdfs.HadoopFileSystem interface works by doing for example:
pyarrow.hdfs.HadoopFileSystem(host="abfs://xxx@xxx.dfs.core.windows.net")
pyarrow.hdfs.connect(host="abfs://xxx@xxx.dfs.core.windows.net")
and I believe the new interface should work too by passing the full URI as "host" to
pyarrow.fs.HadoopFileSystem
constructor. However, the constructor wrongly prepends "hdfs://" at the beginning:arrow/python/pyarrow/_hdfs.pyx
Line 64 in 25c736d
Reporter: Juan Galvez
Note: This issue was originally created as ARROW-10872. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: