Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure multi read options #15630

Merged
merged 92 commits into from
Jan 25, 2024
Merged

Conversation

georgew5656
Copy link
Contributor

@georgew5656 georgew5656 commented Jan 5, 2024

Description

Currently the "azure" input source schema only supports ingesting files that are stored in the Azure Storage Account specified in druid.azure.account. To support ingesting data from different storage accounts, add a new "azureStorage" input source schema for azure with a slightly different spec.

Fixed the bug ...

Renamed the class ...

Added a forbidden-apis entry ...

  1. Added AzureStorageAccountInputSource
    I would have preferred to keep using the regular AzureInputSource class but that class assumes CloudObjectLocation.bucket to be the container of the file and CloudObjectLocation.path to be the path the file within the bucket. I couldn't think of a way to keep the behavior backwards compatible with existing ingestion specs and support multiple storage accounts.

One idea i had was to introduce a "storageAccount" field directly in AzureInputSource (instead of in CloudObjectLocation), then use that field when creating AzureStorage classes. I think this would work but it also would mean that the behavior of azure ingestion is different from other types of ingestion (by having the storage account not in the ingestion uri) and this could cause problems down the line.

  1. Create a new AzureStorage instance with a AzureClientFactory with credentials passed in from AzureStorageAccountInputSource.
    Since the AzureStorage instance is a abstraction over Azure Blob Storage clients, I thought it made sense to create a instance of it per Azure ingestion spec, since we do something similar with S3 (create a s3 client for each s3 ingestion spec).

It would have also been possible to pass the AzureInpuSourceConfig to the relevant functions in AzureStorage and generate different clients but I thought this would have been confusing.

  1. Auth methods
    I added support for key, sas token, and app registration authentication when ingesting from storage accounts. Managed/Workload identity auth can also work by not specifying properties and making sure the identity the cluster is deployed with can access the external storage account.

Release note

Support Azure ingestion from multiple Storage Accounts.


Key changed/added classes in this PR
  • AzureStorageAccountInputSource
  • AzureInpuSourceConfig
  • AzureClientFactory
  • AzureStorage
  • AzureEntity

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

{
try {
AzureStorage azureStorage = new AzureStorage(azureIngestClientFactory, location.getBucket());
Pair<String, String> locationInfo = getContainerAndPathFromObjectLocation(location);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, do we need to verify this returned pair in anyway?


if (currentUri.getScheme().equals(AzureStorageAccountInputSource.SCHEME)) {
CloudObjectLocation cloudObjectLocation = new CloudObjectLocation(currentUri);
Pair<String, String> containerInfo = AzureStorageAccountInputSource.getContainerAndPathFromObjectLocation(cloudObjectLocation);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto about maybe the need to validate the pair returned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do u think we should validate about it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is either in the pair come back empty or null, would that cause unexpected / unhandled error elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added a check in the function itself so it doesn't throw a index error on a bad input spec now. basically what happens now is the task will eventually fail b/c it fails to read the bad input file which i think is okay (the error from azure is pretty clear)

Copy link
Contributor

@317brian 317brian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small nits in the docs pages but nothing major. Thanks!

docs/ingestion/input-sources.md Outdated Show resolved Hide resolved
docs/ingestion/input-sources.md Outdated Show resolved Hide resolved
docs/ingestion/input-sources.md Outdated Show resolved Hide resolved
docs/ingestion/input-sources.md Outdated Show resolved Hide resolved
docs/ingestion/input-sources.md Outdated Show resolved Hide resolved
docs/ingestion/input-sources.md Outdated Show resolved Hide resolved
docs/ingestion/input-sources.md Outdated Show resolved Hide resolved
docs/ingestion/input-sources.md Outdated Show resolved Hide resolved
docs/ingestion/input-sources.md Outdated Show resolved Hide resolved
georgew5656 and others added 13 commits January 24, 2024 10:00
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
Copy link
Contributor

@zachjsh zachjsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@georgew5656 georgew5656 merged commit 3e51224 into apache:master Jan 25, 2024
82 of 83 checks passed
@adarshsanjeev adarshsanjeev added this to the 30.0.0 milestone May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants