Azure: Add FileIO that supports ADLSv2 storage #8303

bryanck · 2023-08-13T01:48:52Z

This PR adds a FileIO implementation for Azure Data Lake Storage Gen2. The URI format was kept consistent with Hadoop's Azure URI format to make any transition easier, however TLS is always used with either the abfs or abfss scheme. Range reads were also implemented. The new FileIO was added as a delegate type in ResolvingFileIO. Both the prefix and bulk operation mixin interfaces were implemented as well.

To limit the scope of this PR, authorization was limited to using the default Azure credential chain, SAS token, or connection string. Enhancements can be addressed in a follow-up PR.

The project was added as a dependency to the Spark and Flink runtimes, similar to AWS and GCP. Also added was an Azure bundle project for building a jar with the necessary Azure dependencies when running with Spark or Flink. This is similar to the bundles for AWS and GCP.

Currently Azurite doesn't yet support ADLSv2 directory operations, so mocks were used for prefix-related tests. Manual testing was performed against a real Azure account.

azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSv2Location.java

karlschriek · 2023-08-14T11:00:45Z

I am wondering how this differs from #4465, which has been lying dormant (and as far as I can see is just waiting for someone to hit the "merge" button) for over a year now?

azure/src/test/java/org/apache/iceberg/azure/adlsv2/ADLSv2LocationTest.java

azure/src/test/java/org/apache/iceberg/azure/adlsv2/ADLSv2OutputStreamTest.java

azure/src/test/java/org/apache/iceberg/azure/adlsv2/ADLSv2InputStreamTest.java

azure/src/main/java/org/apache/iceberg/azure/AzureProperties.java

azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSv2FileIO.java

bryanck · 2023-08-14T15:35:24Z

I am wondering how this differs from #4465, which has been lying dormant (and as far as I can see is just waiting for someone to hit the "merge" button) for over a year now?

The main difference is this PR uses the data lake client API instead of the blob client API. File operations work the same way but bulk and prefix operations will differ.

rdblue · 2023-08-14T17:55:09Z

azure-bundle/build.gradle

+  dependencies {
+    implementation platform(libs.azuresdk.bom)
+    implementation "com.azure:azure-storage-file-datalake"
+    implementation "com.azure:azure-identity"


For other integrations, we don't bundle the dependencies and only ship the Iceberg side. That keeps our bundle small and doesn't force any particular version on downstream consumers. It also avoids needing to do a lot of license and notice documentation work. Is that possible here? Is there a dependency bundle that we can use at runtime?

The azure-bundle project build was set up for bundling all of the necessary Azure dependencies in one shadow jar as an (optional) convenience to users, similar to the aws-bundle and gcp-bundle projects, which include only the necessary runtime libraries at the same version used for the Iceberg build and shades conflicting libraries. A user can opt to include their own Azure dependencies if desired and not use this at all. For example, all you would need to run with Spark is the Spark runtime and Azure/AWS/GCP bundle. Neither Microsoft nor Google provide such a bundle. Amazon has one for AWS but it is very large, which causes issues with some systems.

This is separate from the azure project build which declares the Azure dependencies as compileOnly so they are not included with any runtime.

Okay, I see. I didn't know about the other bundle projects. Looks like the LICENSE file is updated for those, but not the NOTICE. Did you check whether each bundled project has a NOTICE that we need to include?

I did not, I'll do that now for all three.

I added this. I'll open a separate PR for the AWS and GCP bundles.

The PR for the AWS and GCP bundles is here: #8323

Thanks! It's amazing that we can automate this now. It was such a giant pain to do this in the past!

azure-bundle/NOTICE

azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSFileIO.java

azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSInputFile.java

azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSOutputStream.java

azure/src/test/java/org/apache/iceberg/azure/adlsv2/ADLSOutputStreamTest.java

azure/src/test/java/org/apache/iceberg/azure/adlsv2/ADLSInputStreamTest.java

azure/src/main/java/org/apache/iceberg/azure/adlsv2/BaseADLSFile.java

azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSOutputStream.java

rdblue · 2023-08-24T23:22:33Z

azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSInputStream.java

+    Preconditions.checkState(!closed, "Cannot seek: already closed");
+    Preconditions.checkArgument(newPos >= 0, "Cannot seek: position %s is negative", newPos);
+
+    // this allows a seek beyond the end of the stream but the next read will fail


Why allow seek beyond the end of the stream?

This was done to keep the behavior consistent with S3InputStream

rdblue · 2023-08-24T23:27:38Z

azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSLocation.java

+ * Support</a>
+ */
+class ADLSLocation {
+  private static final Pattern URI_PATTERN = Pattern.compile("^abfss?://(.+?)([/?#].*)?$");


Wouldn't it be safer to use [^/?#]+ for the first group instead of using non-greedy matching?

Yes, thanks, I made this change

rdblue · 2023-08-24T23:29:59Z

azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSLocation.java

+
+    String uriPath = matcher.group(2);
+    uriPath = uriPath == null ? "" : uriPath.startsWith("/") ? uriPath.substring(1) : uriPath;
+    this.path = uriPath.split("\\?", -1)[0].split("#", -1)[0];


If uriPath is null, then path is going to be an empty string? I generally try to avoid empty string as a default.

This was done primarily for the Azure API, which expects an empty string instead of null for the root.

Fokko · 2023-08-25T14:02:51Z

Thanks @bryanck for working on this, and @danielcweeks, @rdblue & @nastra for the review 🙌🏻

bryanck added 2 commits August 12, 2023 16:33

Azure: add support for ADLSv2 for storage

6d965d2

typo fix

8c59928

github-actions bot added spark core flink build labels Aug 13, 2023

update labeler

e76a1a5

github-actions bot added the INFRA label Aug 13, 2023

Simpler URI parsing

2576250

bryanck force-pushed the azure-support branch from f307800 to 2576250 Compare August 13, 2023 06:14

bryanck changed the title ~~Azure: Add support for ADLSv2 storage~~ Azure: Add FileIO that supports ADLSv2 storage Aug 13, 2023

Fokko reviewed Aug 14, 2023

View reviewed changes

azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSv2Location.java Outdated Show resolved Hide resolved

karlschriek mentioned this pull request Aug 14, 2023

Support (or document) Azure Storage as sink memiiso/debezium-server-iceberg#222

Open

nastra reviewed Aug 14, 2023

View reviewed changes

bryanck added 3 commits August 14, 2023 09:15

PR feedback

e2fe211

allow any domain name in URI

42debe3

Merge branch 'master' into azure-support

f5a67c8

rdblue reviewed Aug 14, 2023

View reviewed changes

bryanck added 2 commits August 14, 2023 14:54

allow optional container in URI

6b53fbb

include notice from dependencies

571b562

rdblue reviewed Aug 14, 2023

View reviewed changes

azure-bundle/NOTICE Show resolved Hide resolved

bryanck added 7 commits August 14, 2023 16:50

notice dedupe

bc36c1d

simpler naming and oauth creds

f7b2221

naming

f86811b

simpler auth

dd066b9

support sas token

c8a4b5a

per-account sas token

93a00c3

Merge remote-tracking branch 'upstream/master' into azure-support

9c5c73e

bryanck added 3 commits August 19, 2023 16:14

Merge branch 'master' into azure-support

e048e2f

move shade fix to separate PR

4da61a3

add azure dependency to hive runtime

ed7c4d4

github-actions bot added the hive label Aug 20, 2023

skip stack trace log for missing Hadoop

7450fca

bryanck force-pushed the azure-support branch from d05a442 to 7450fca Compare August 20, 2023 17:29

Merge remote-tracking branch 'upstream/master' into azure-support

217a5b6