Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] pyarrow.fs.HadoopFileSystem cannot access Azure Data Lake (ADLS) #26807

Open
asfimport opened this issue Dec 10, 2020 · 4 comments
Open

Comments

@asfimport
Copy link

It's not possible to open a abfs:// or abfss:// URI with the pyarrow.fs.HadoopFileSystem.

Using HadoopFileSystem.from_uri(path) does not work and libhdfs will throw an error saying that the authority is invalid (I checked that this is because the string is empty).

Note that the legacy pyarrow.hdfs.HadoopFileSystem interface works by doing for example:

Reporter: Juan Galvez

Note: This issue was originally created as ARROW-10872. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
[~jjgalvez] thanks a lot for the report!

It's difficult for me to test whether your suggestion would work (and for other arrow developers as well, since we often don't have a Hadoop or Azure filesystem at our disposal to test). But would you be able to try your suggestion yourself, and see it that works for you? A PR would then also be very welcome.

cc @kszucs

@asfimport
Copy link
Author

Steve Loughran:
this problem would also surface if file:// was used as the source URL, which may permit local replication. (Note, MiniDFSCluster is something in the hadoop-hdfs test JAR to let you bring up an HDFS cluster in process purely for testing)

@WillDyson
Copy link

ABFS URIs take the following form:
abfs://<container_name>@<account_name>.dfs.core.windows.net

It looks like the sanitisation that's done as part of the from_uri method ends up changing it to:
abfs://<account_name>.dfs.core.windows.net

This can be seen in the error returned – it is missing the container name.

CC: hdfs.cc (not familiar with this codebase so I may have picked up the wrong codepath)

A similar exception can be found using the Java client:

scala> FileSystem.get(new URI("abfs://bogus.dfs.core.windows.net"), new Configuration())
23/06/02 14:50:26 WARN fs.FileSystem: Failed to initialize fileystem abfs://bogus.dfs.core.windows.net: abfs://bogus.dfs.core.windows.net has invalid authority.
org.apache.hadoop.fs.azurebfs.contracts.exceptions.InvalidUriAuthorityException: abfs://bogus.dfs.core.windows.net has invalid authority.
  at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.authorityParts(AzureBlobFileSystemStore.java:334)
  at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:202)
  at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:195)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3452)
  at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:162)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3557)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3504)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:522)
  ... 59 elided

Interestingly, this all appears to happen before a connection to Azure is attempted so you may not need an ADLSgen2 container to validate this particular issue.

If we include a valid authority, the FileSystem is returned:

scala> FileSystem.get(new URI("abfs://data@bogus.dfs.core.windows.net"), new Configuration())
res0: org.apache.hadoop.fs.FileSystem = AzureBlobFileSystem{uri=abfs://data@bogus.dfs.core.windows.net, user='wdyson', primaryUserGroup='wdyson'[fs.azure.capability.readahead.safe]}

The wrapper around libhdfs should be modified to retain the container name before the @.

@WillDyson
Copy link

Here's the same example using libhdfs:

#include <stdio.h>
#include <stdlib.h>
#include "hdfs.h"

int main(int argc, char **argv) {
    printf("### Test with container name\n");
    hdfsConnect("abfs://data@bogus.dfs.core.windows.net", 0);
    printf("### Test without container name\n");
    hdfsConnect("abfs://bogus.dfs.core.windows.net", 0);
}
### Test with container name
23/06/02 15:24:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
### Test without container name
23/06/02 15:24:57 WARN fs.FileSystem: Failed to initialize fileystem abfs://bogus.dfs.core.windows.net: abfs://bogus.dfs.core.windows.net has invalid authority.
hdfsBuilderConnect(forceNewInstance=0, nn=abfs://bogus.dfs.core.windows.net, port=0, kerbTicketCachePath=(NULL), userName=(NULL)) error:
InvalidUriAuthorityException: abfs://bogus.dfs.core.windows.net has invalid authority.abfs://bogus.dfs.core.windows.net has invalid authority.
        at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.authorityParts(AzureBlobFileSystemStore.java:334)
        at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:202)
        at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:195)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3452)
        at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:162)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3557)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3504)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:522)
        at org.apache.hadoop.fs.FileSystem$1.run(FileSystem.java:260)
        at org.apache.hadoop.fs.FileSystem$1.run(FileSystem.java:257)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:257)

Similarly to the previous case, the behaviour is the same regardless of whether the ADLSgen2 storage account actually exists or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants