Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-10152 Storage client caching in Azure ADLS processors #6158

Closed
wants to merge 6 commits into from

Conversation

nandorsoma
Copy link
Contributor

@nandorsoma nandorsoma commented Jun 27, 2022

Summary

NIFI-10152

In ADLS processors the client is created in the onTrigger method. Since expression language is supported in the properties which are used in the client it is not possible to extract the creation to eg. @OnSchedule. Now it is not just a performance problem, but the underlying reactor-netty library logs a misleading warning on every trigger which spoils the log. Because of that we need use caching to retrieve the client.
reactor-netty issue: reactor/reactor-netty#2300
azure-sdk-for-java-issue: Azure/azure-sdk-for-java#30148

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using mvn clean install -P contrib-check
    • JDK 8
    • JDK 11
    • JDK 17

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

@turcsanyip turcsanyip self-requested a review July 5, 2022 09:56
Copy link
Contributor

@turcsanyip turcsanyip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nandorsoma Thank you for adding this valuable enhancement!
I read through the code in this round and added some comments about (potential) issues and also some renaming / code clean-up.

* @param credentialsDetails used for caching because it can contain properties that are results of an expression
* @return DataLakeServiceClient
*/
public DataLakeServiceClient getStorageClient(ProxyOptions proxyOptions, ADLSCredentialsDetails credentialsDetails) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method can return the same DataLakeServiceClient object for multiple processors / processor threads.
Are we sure if DataLakeServiceClient is thread safe and can be used concurrently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked the related code in azure-storage-file-datalake-12.7.4.jar and I didn't find anything that would make me think that the library was thread unsafe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jfrazee Do you have any information whether com.azure.storage.file.datalake.DataLakeServiceClient is thread-safe or not?

Currently, DataLakeServiceClient is instantiated/built in every onTrigger() which leads to some performance loss. It is more relevant when a proxy is used (added in NIFI-9951 recently), because the new client establishes a new connection to the proxy every time.
This PR is about caching the DataLakeServiceClient between onTrigger() calls but it is unclear if that client object can be used in parallel or not.
For the .NET ADLS library the is a clear statement that it is thread-safe:
https://docs.microsoft.com/en-us/dotnet/api/overview/azure/storage.files.datalake-readme
However, we cannot find a similar docs regarding the Java lib.

Could you please check it? Thanks in advance!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to see whether this extends to the proxy scenario, but in the basic case, for recent versions of the SDK, the *ServiceClient can be treated as threadsafe:

https://github.com/Azure/azure-storage-java/blob/master/V12%20Upgrade%20Story.md

v12 [...] Some of the highlights include: [...] Thread-safety and immutability. It is still safe to share clients across threads, and there should still be no confusion about when the client is in sync with the service as there is no state maintained on the client. State may be returned as the result of a method call, but there is no pretense of storing state on a client that could be immediately invalid.

https://techcommunity.microsoft.com/t5/azure-storage-blog/announcing-the-azure-storage-v12-client-libraries/ba-p/1482394

All of our v12 libraries have improved performance and ensure thread safety.

https://azure.microsoft.com/en-us/blog/build-richer-applications-with-the-new-asynchronous-azure-storage-sdk-for-java/

Azure Storage SDK v10 for Java adopts the next-generation Storage SDK design providing thread-safe types [...] Some of the improvements in the new SDK are: [...] Thread-safe interfaces[.]

https://azure.github.io/azure-sdk/java_introduction.html#service-client

DO ensure that all service client classes are immutable and stateless upon instantiation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wasn't even sure that the basic scenario is thread safe. Thank you for the clarification @jfrazee! I think we can assume that the proxy scenario is thread safe as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jfrazee Do you have any update on the proxy scenario?
In my opinion, if a library/class declares itself as thread-safe (as the Azure v12 storage client in the referenced docs above), it must be thread-safe in all setup/usage. So I would go forward and merge this PR. Would you have any objections?

Comment on lines 57 to 59
public Cache<ADLSCredentialsDetails, DataLakeServiceClient> getCache() {
return clientCache;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nandorsoma I think it is a bit overkill to expose internal fields publicly for integration tests only. Furthermore, the IT does not really test we should check: it asserts the cache size at the end (1 client in the cache) but it is more important to check how many client instance creation have happened (only 1).
It could be tested with a unit test and in that case the public getter would not be needed.

…o eliminate the need of the public getter created for the IT test only. Move proxyOptions parameter to constructor because it is not used for caching and it won't change while the processor is scheduled.
Copy link
Contributor

@turcsanyip turcsanyip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nandorsoma Thanks for refactoring the tests! It looks much better.

Could you please add one more test case with the following setup:

        final ADLSCredentialsDetails credentialsOne = createCredentialDetails("accountOne");
        final ADLSCredentialsDetails credentialsTwo = createCredentialDetails("accountTwo");
        final ADLSCredentialsDetails credentialsThree = createCredentialDetails("accountOne");

It would test two things:

  • the factory can return the cached "accountOne" client after another client has been returned (not only the last one is cached)
  • more importantly: the factory returns the same client for a new ADLSCredentialsDetails object having the same accountName

@nandorsoma
Copy link
Contributor Author

@turcsanyip, thanks for the review! Good idea! Tbh this case was in my mind but in the end I forgot about it...

Copy link
Contributor

@turcsanyip turcsanyip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nandorsoma Thanks for adding the test case!

+1 LGTM
Merging to main.

@jfrazee Thanks for your help with the client thread-safety question!

@asfgit asfgit closed this in 320aed0 Sep 26, 2022
p-kimberley pushed a commit to p-kimberley/nifi that referenced this pull request Oct 15, 2022
This closes apache#6158.

Signed-off-by: Peter Turcsanyi <turcsanyi@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants