Add FileIO implementation for Azure Blob Storage #4465

sumeetgajjar · 2022-04-01T18:30:08Z

What changes were proposed in this pull request?

Currently, HadoopFileIO is used to talk to azure blob storage. This PR introduces AzureFileIO which uses Azure native SDK to communicate with Azure blob storage.

Does this PR introduce any user-facing change?

Yes, users can now configure the catalog io-impl property to org.apache.iceberg.azure.blob.AzureBlobFileIO to enable this feature.

How was this patch tested?

Ran the newly added tests against Azurite Emulator
Ran the newly added tests against Azure Blob Storage
Used newly added AzureFileIO to run queries on iceberg tables stored on azure using Spark

Closes #4257

sumeetgajjar · 2022-04-01T21:30:08Z

cc: @danielcweeks @jackye1995

rdblue · 2022-04-03T20:59:12Z

.baseline/checkstyle/checkstyle.xml

@@ -122,7 +122,8 @@
                org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName.*,
                org.apache.spark.sql.functions.*,
                org.apache.spark.sql.connector.iceberg.write.RowLevelOperation.Command.*,
-                org.junit.Assert.*"/>
+                org.junit.Assert.*,
+                org.assertj.core.api.Assertions.*"/>


I don't think we need to change this.

Hi Ryan - without the above change, the checkstyleIntegration task fails with the following error:

[ant:checkstyle] [ERROR] /<redacted>/upstream-iceberg/azure/src/integration/java/org/apache/iceberg/azure/blob/TestAzureBlobOutputStream.java:38:46: Using a static member import should be avoided - org.assertj.core.api.Assertions.assertThat. [AvoidStaticImport] FAILURE: Build failed with an exception. * What went wrong: Execution failed for task ':iceberg-azure:checkstyleIntegration'. > Checkstyle rule violations were found. See the report at: file:///<redacted>/upstream-iceberg/azure/build/reports/checkstyle/integration.html Checkstyle files with violations: 5 Checkstyle violations by severity: [error:12]

Yes, you shouldn't use a static import for those methods.

Ack,
will make the necessary modifications.

rdblue · 2022-04-03T21:03:31Z

build.gradle

+  dependencies {
+    api project(':iceberg-api')
+    implementation project(path: ':iceberg-bundled-guava', configuration: 'shadow')
+    implementation platform('com.azure:azure-sdk-bom')


I would rather track dependency versions specifically rather than relying on some external BOM. BOMs are fine for end uses, but libraries should generally not delegate dependency versions to other projects.

Ack.
Will switch to direct dependencies.

Note: Currently GCP also relies on bom. We should also make a change in the GCP module to avoid using bom.

iceberg/build.gradle

Line 375 in 7c2ea01

implementation platform('com.google.cloud:libraries-bom')

If required, I can file another PR to eliminate the usage of bom in the GCP module.

rdblue · 2022-04-03T21:05:28Z

build.gradle

@@ -38,6 +42,9 @@ buildscript {

 plugins {
  id 'nebula.dependency-recommender' version '9.0.2'
+  // Since 7.x gradle-docker-plugin is compiled using JDK11, thus using the latest version will fail Java8 builds
+  // https://bmuschko.github.io/gradle-docker-plugin/current/user-guide/#change_log


This comment doesn't make it clear why this plugin is used. Can you explain in more detail?

rdblue · 2022-04-03T21:07:28Z

azure/src/main/java/org/apache/iceberg/azure/blob/AzureURI.java

+    try {
+      uri = new URI(location);
+    } catch (URISyntaxException e) {
+      throw new ValidationException("Invalid Azure URI: %s.", location);


There's no need for end punctuation in logs, and in fact in cases like this it is misleading because . could be interpreted as part of the URI. Can you remove end punctuation?

rdblue · 2022-04-03T21:09:54Z

azure/src/main/java/org/apache/iceberg/azure/blob/AzureURI.java

+    Preconditions.checkNotNull(location, "Location cannot be null.");
+    final URI uri;
+    try {
+      uri = new URI(location);


We generally discourage the use of URI because it handles URI encoding in strange ways. I think it is a best practice to ignore the URI and parse manually using split and delimiters.

rdblue · 2022-04-03T21:12:19Z

azure/src/main/java/org/apache/iceberg/azure/blob/AzureURI.java

+    }
+    this.location = location;
+
+    ValidationException.check(


ValidationException is not a substitute for IllegalArgumentException. ValidationException indicates that while an argument may be valid, it is inconsistent with other arguments or config.

For example, when creating a partition spec, "column" is a valid column reference, but null would result in an IllegalArgumentException. But, if "column" is not a defined name in the schema then a ValidationException is thrown because you can't partition by an unknown column.

rdblue · 2022-04-03T21:13:52Z

azure/src/main/java/org/apache/iceberg/azure/blob/AzureBlobClientFactory.java

+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class AzureBlobClientFactory {


Should this be package-private? I don't see a reason why people would need to use it directly.

Along with this, I've also made BaseAzureBlobFile package-private.

rdblue · 2022-04-03T21:15:55Z

azure/src/main/java/org/apache/iceberg/azure/blob/AzureBlobClientFactory.java

+
+  public static BlobClient createBlobClient(AzureURI azureURI, AzureProperties azureProperties) {
+    final String storageAccount = azureURI.storageAccount();
+    final BlobClientBuilder builder = new BlobClientBuilder();


In Iceberg, we don't use final for local variables. Recent JVM versions (8+) handle this without problems and it is also not very valuable because it isn't actually translated into bytecode.

Ack, I'll remove it for the local variables.

rdblue · 2022-04-03T21:17:04Z

azure/src/main/java/org/apache/iceberg/azure/blob/AzureBlobClientFactory.java

+      LOG.debug("Using {} endpoint for {}", endpoint, storageAccount);
+      builder.endpoint(endpoint);
+      final AuthType authType = azureProperties.authType(storageAccount);
+      setAuth(storageAccount, authType, azureProperties, builder);


Why is setAuth not required when using a connection string?

The general form of a connection string is as follows:

DefaultEndpointsProtocol=http;AccountName=devstoreaccount1; AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==; BlobEndpoint=http://127.0.0.1:10000/devstoreaccount1; QueueEndpoint=http://127.0.0.1:10001/devstoreaccount1; TableEndpoint=http://127.0.0.1:10002/devstoreaccount1;

Thus the connection string contains all the necessary auth-related and storage account endpoints information required for establishing a connection and hence it is not necessary to explicitly set the auth.
https://docs.microsoft.com/en-us/azure/storage/common/storage-configure-connection-string#configure-a-connection-string-for-an-azure-storage-account

Note: the above connection string does not leak any actual keys, it is the default connection string used by the Azurite Emulator thus safe to share on public forums.

rdblue · 2022-04-03T21:20:46Z

azure/src/main/java/org/apache/iceberg/azure/blob/AzureBlobInputFile.java

+  @Override
+  public long getLength() {
+    if (length == null) {
+      length = blobClient().getProperties().getBlobSize();


What happens if the blob doesn't exist? Could this throw NotFoundException to standardize?

It throws a BlobStorageException: Status code 404, (BlobNotFound) as of now. I can standardize it to throw NotFoundException

rdblue · 2022-04-03T21:22:32Z

azure/src/main/java/org/apache/iceberg/azure/blob/AzureBlobInputStream.java

+
+  private void openStream(long offset) {
+    final BlobInputStreamOptions options = new BlobInputStreamOptions().setRange(new BlobRange(offset))
+        .setBlockSize(azureProperties.readBlockSize(azureURI.storageAccount()));


We usually put each chained method on a separate line.

rdblue · 2022-04-03T21:23:23Z

azure/src/main/java/org/apache/iceberg/azure/blob/AzureBlobInputStream.java

+      final long bytesToSkip = newPos - pos;
+      // BlobInputStream#skip only repositions the internal pointers,
+      // the actual bytes are skipped when BlobInputStream#read is invoked.
+      final long bytesSkipped = stream.skip(bytesToSkip);


Does this read through or just change the next request?

It does not read through, it repositions internal pointers to change the next read request

https://github.com/Azure/azure-sdk-for-java/blob/627f10710712d00eea89dfd892587f18da2a456a/sdk/storage/azure-storage-common/src/main/java/com/azure/storage/common/StorageInputStream.java#L419

https://github.com/Azure/azure-sdk-for-java/blob/627f10710712d00eea89dfd892587f18da2a456a/sdk/storage/azure-storage-common/src/main/java/com/azure/storage/common/StorageInputStream.java#L379

rdblue · 2022-04-03T21:26:37Z

build.gradle

+    targetContainerId createContainer.getContainerId()
+  }
+
+  task integrationTest(type: Test) {


Can you make sure that there is a workflow that runs these integration tests?

java-ci workflow already takes care of this.

The check task depends on the above integrationTest task. The java-ci workflow runs the check task on each module in Iceberg which would in turn run the integration tests for azure.

iceberg/.github/workflows/java-ci.yml

Line 68 in 7c2ea01

- run: ./gradlew check -DsparkVersions= -DhiveVersions= -DflinkVersions= -Pquick=true -x javadoc

rdblue · 2022-04-03T21:28:29Z

azure/src/main/java/org/apache/iceberg/azure/AuthType.java

+
+package org.apache.iceberg.azure;
+
+public enum AuthType {


Does this need to be public?

Yes, this needs to be public.
AuthType is accessed in AzureProperties, AzureBlobClientFactory, and IntegrationTests.

Making it package-private would leave AzureProperties unaffected since both of them reside in org.apache.iceberg.azure package.
However, AzureBlobClientFactory, and IntegrationTests would throw a compilation error since they reside in org.apache.iceberg.azure.blob package.

rdblue · 2022-04-03T21:29:22Z

azure/src/main/java/org/apache/iceberg/azure/AzureProperties.java

+public class AzureProperties implements Serializable {
+
+  // Start of storage account configuration
+  public static final String STORAGE_CONNECTION_STRING = "azure.storage.%s.connection-string";


These seem long. What about abfs.%s.uri or abfs.%s.connection-string instead? Does it really need to be "connection string"?

These seem long. What about abfs.%s.uri or abfs.%s.connection-string instead?

I can replace azure.storage with abfs for this and the rest of the configs to make them shorter.

Does it really need to be "connection string"?

It would be a good idea to keep connection-string as is, since it would directly map to the connection-string config available on the Azure portal, thereby reducing the chances of confusing this config with a different one.

rdblue

Thank for working on this, @sumeetgajjar! Overall it is looking good.

sumeetgajjar · 2022-04-05T04:57:23Z

@rdblue thanks for the review and your comments.
I have addressed all of your requested changes in the latest commit, please review those at your convenience.

…ng together

rbalamohan · 2022-04-05T22:54:01Z

azure/src/main/java/org/apache/iceberg/azure/blob/AzureBlobInputStream.java

+  }
+
+  @Override
+  public void seek(long newPos) {


Does this take care of reverse seek issues, where connections can be terminated and reopened?

Hi @rbalamohan - can you please elaborate more on the reverse seek issues?

For the reverse seek case, we close the current stream and re-open the stream from the earlier offset.

// Seeking backward. stream.close(); openStream(newPos); // newPos is a position back in the stream.

Connection close/reopens are expensive in cloud stores. https://issues.apache.org/jira/browse/HADOOP-12444 has more details on backward seek. Good set of tickets went in terms of fixing/improving for backward seek especially for columnar formats like ORC, Parq.

Thanks for the feedback - I can follow a similar approach in the current implementation where I simply set the newPos to the given value and can skip openStream in the seek method. Later lazily open a stream in the subsequent read request.

electrum · 2022-04-14T19:11:15Z

Does this support Data Lake Storage Gen2?

sumeetgajjar · 2022-04-14T22:32:45Z

Does this support Data Lake Storage Gen2?

Hi @electrum - yes, DataLake Storage Gen2 is supported. In fact, we only support Gen2.
Gen1 is set to retire on Feb 29, 2024 thus no point in adding support for it: https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-overview

sumeetgajjar · 2022-05-02T08:14:04Z

@rdblue gentle ping to re-review the PR.
All the suggested changes have been incorporated.

blcksrx · 2022-12-17T12:05:07Z

Any plan to merge this PR?

kiarash-rezahanjani · 2023-07-01T01:55:05Z

Curious about the state of this PR and if there is a plan to merge?

karlschriek · 2023-08-14T10:56:56Z

Doesn't look like there is anything left to do. Will this be merged anytime soon? @rdblue ?

ghost · 2023-10-11T08:14:05Z

Also curious about this one!

rni-HMC · 2024-08-02T16:08:23Z

Will this be merged anytime soon? I'd love to have support for blob storage instead of only adlfs (as implemented in #8303 )

github-actions · 2024-09-11T00:13:56Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2024-09-19T00:14:11Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions bot added build INFRA labels Apr 1, 2022

rdblue reviewed Apr 3, 2022

View reviewed changes

rdblue requested changes Apr 3, 2022

View reviewed changes

sumeetgajjar requested a review from rdblue April 5, 2022 05:00

sumeetgajjar added 9 commits April 5, 2022 09:54

Setup azure gradle project and add empty implementations

e648ed9

Implement FileIO, InputFile and OutputFile

1e14465

Implement AzureBlobInputStream

aa21744

Implement AzureURI

40716af

Implement OutputStream

3da60a6

Fix checkStyle

f6a5fcd

Implement AzureBlobClientFactory, AzureProperties and stitch everythi…

336f360

…ng together

Fix BlobClient creation

7c961df

Fix license

838859c

sumeetgajjar added 7 commits April 5, 2022 09:54

Test AzureProperties

631955d

Implment integration test fixture

d88ca41

Fix style

ec92a18

Add integration test for Input and Output stream

38297ce

Fix spark integration

c936b6b

Minor refactoring and include azure sdk in the iceberg-azure jar

35acbb5

Address review comments

b56ac5e

sumeetgajjar force-pushed the azure-file-io branch from bd2c950 to b56ac5e Compare April 5, 2022 16:55

rbalamohan reviewed Apr 5, 2022

View reviewed changes

Implement Lazy Seek to avoid extra IOs for backward seeks

2ac8a9c

gliter mentioned this pull request Feb 7, 2023

question: how can this be used with confluent kafka and azure wasb getindata/kafka-connect-iceberg-sink#19

Open

danielcweeks self-requested a review June 15, 2023 15:36

nastra self-requested a review June 15, 2023 15:37

This was referenced Aug 14, 2023

Azure: Add FileIO that supports ADLSv2 storage #8303

Merged

Support (or document) Azure Storage as sink memiiso/debezium-server-iceberg#222

Open

github-actions bot added the stale label Sep 11, 2024

github-actions bot closed this Sep 19, 2024

Add FileIO implementation for Azure Blob Storage #4465

Add FileIO implementation for Azure Blob Storage #4465

Conversation

sumeetgajjar commented Apr 1, 2022 • edited Loading

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

sumeetgajjar commented Apr 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sumeetgajjar Apr 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue left a comment

Choose a reason for hiding this comment

sumeetgajjar commented Apr 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sumeetgajjar Apr 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

electrum commented Apr 14, 2022

sumeetgajjar commented Apr 14, 2022

sumeetgajjar commented May 2, 2022

blcksrx commented Dec 17, 2022

kiarash-rezahanjani commented Jul 1, 2023

karlschriek commented Aug 14, 2023

ghost commented Oct 11, 2023

rni-HMC commented Aug 2, 2024

github-actions bot commented Sep 11, 2024

github-actions bot commented Sep 19, 2024

sumeetgajjar commented Apr 1, 2022 •

edited

Loading

sumeetgajjar Apr 4, 2022 •

edited

Loading

sumeetgajjar Apr 7, 2022 •

edited

Loading