Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure: Add FileIO that supports ADLSv2 storage #8303

Merged
merged 37 commits into from Aug 25, 2023

Conversation

bryanck
Copy link
Contributor

@bryanck bryanck commented Aug 13, 2023

This PR adds a FileIO implementation for Azure Data Lake Storage Gen2. The URI format was kept consistent with Hadoop's Azure URI format to make any transition easier, however TLS is always used with either the abfs or abfss scheme. Range reads were also implemented. The new FileIO was added as a delegate type in ResolvingFileIO. Both the prefix and bulk operation mixin interfaces were implemented as well.

To limit the scope of this PR, authorization was limited to using the default Azure credential chain, SAS token, or connection string. Enhancements can be addressed in a follow-up PR.

The project was added as a dependency to the Spark and Flink runtimes, similar to AWS and GCP. Also added was an Azure bundle project for building a jar with the necessary Azure dependencies when running with Spark or Flink. This is similar to the bundles for AWS and GCP.

Currently Azurite doesn't yet support ADLSv2 directory operations, so mocks were used for prefix-related tests. Manual testing was performed against a real Azure account.

@github-actions github-actions bot added the INFRA label Aug 13, 2023
@bryanck bryanck changed the title Azure: Add support for ADLSv2 storage Azure: Add FileIO that supports ADLSv2 storage Aug 13, 2023
@karlschriek
Copy link

I am wondering how this differs from #4465, which has been lying dormant (and as far as I can see is just waiting for someone to hit the "merge" button) for over a year now?

@bryanck
Copy link
Contributor Author

bryanck commented Aug 14, 2023

I am wondering how this differs from #4465, which has been lying dormant (and as far as I can see is just waiting for someone to hit the "merge" button) for over a year now?

The main difference is this PR uses the data lake client API instead of the blob client API. File operations work the same way but bulk and prefix operations will differ.

dependencies {
implementation platform(libs.azuresdk.bom)
implementation "com.azure:azure-storage-file-datalake"
implementation "com.azure:azure-identity"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other integrations, we don't bundle the dependencies and only ship the Iceberg side. That keeps our bundle small and doesn't force any particular version on downstream consumers. It also avoids needing to do a lot of license and notice documentation work. Is that possible here? Is there a dependency bundle that we can use at runtime?

Copy link
Contributor Author

@bryanck bryanck Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The azure-bundle project build was set up for bundling all of the necessary Azure dependencies in one shadow jar as an (optional) convenience to users, similar to the aws-bundle and gcp-bundle projects, which include only the necessary runtime libraries at the same version used for the Iceberg build and shades conflicting libraries. A user can opt to include their own Azure dependencies if desired and not use this at all. For example, all you would need to run with Spark is the Spark runtime and Azure/AWS/GCP bundle. Neither Microsoft nor Google provide such a bundle. Amazon has one for AWS but it is very large, which causes issues with some systems.

This is separate from the azure project build which declares the Azure dependencies as compileOnly so they are not included with any runtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see. I didn't know about the other bundle projects. Looks like the LICENSE file is updated for those, but not the NOTICE. Did you check whether each bundled project has a NOTICE that we need to include?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not, I'll do that now for all three.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this. I'll open a separate PR for the AWS and GCP bundles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR for the AWS and GCP bundles is here: #8323

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It's amazing that we can automate this now. It was such a giant pain to do this in the past!

@github-actions github-actions bot added the hive label Aug 20, 2023
Preconditions.checkState(!closed, "Cannot seek: already closed");
Preconditions.checkArgument(newPos >= 0, "Cannot seek: position %s is negative", newPos);

// this allows a seek beyond the end of the stream but the next read will fail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why allow seek beyond the end of the stream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done to keep the behavior consistent with S3InputStream

* Support</a>
*/
class ADLSLocation {
private static final Pattern URI_PATTERN = Pattern.compile("^abfss?://(.+?)([/?#].*)?$");
Copy link
Contributor

@rdblue rdblue Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be safer to use [^/?#]+ for the first group instead of using non-greedy matching?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks, I made this change


String uriPath = matcher.group(2);
uriPath = uriPath == null ? "" : uriPath.startsWith("/") ? uriPath.substring(1) : uriPath;
this.path = uriPath.split("\\?", -1)[0].split("#", -1)[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If uriPath is null, then path is going to be an empty string? I generally try to avoid empty string as a default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done primarily for the Azure API, which expects an empty string instead of null for the root.

@Fokko Fokko merged commit 99410a1 into apache:master Aug 25, 2023
41 checks passed
@Fokko
Copy link
Contributor

Fokko commented Aug 25, 2023

Thanks @bryanck for working on this, and @danielcweeks, @rdblue & @nastra for the review 🙌🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants