New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NIFI-10152 Storage client caching in Azure ADLS processors #6158
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nandorsoma Thank you for adding this valuable enhancement!
I read through the code in this round and added some comments about (potential) issues and also some renaming / code clean-up.
...ssors/src/main/java/org/apache/nifi/processors/azure/storage/utils/StorageClientFactory.java
Outdated
Show resolved
Hide resolved
* @param credentialsDetails used for caching because it can contain properties that are results of an expression | ||
* @return DataLakeServiceClient | ||
*/ | ||
public DataLakeServiceClient getStorageClient(ProxyOptions proxyOptions, ADLSCredentialsDetails credentialsDetails) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method can return the same DataLakeServiceClient
object for multiple processors / processor threads.
Are we sure if DataLakeServiceClient
is thread safe and can be used concurrently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked the related code in azure-storage-file-datalake-12.7.4.jar
and I didn't find anything that would make me think that the library was thread unsafe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jfrazee Do you have any information whether com.azure.storage.file.datalake.DataLakeServiceClient
is thread-safe or not?
Currently, DataLakeServiceClient
is instantiated/built in every onTrigger()
which leads to some performance loss. It is more relevant when a proxy is used (added in NIFI-9951 recently), because the new client establishes a new connection to the proxy every time.
This PR is about caching the DataLakeServiceClient
between onTrigger()
calls but it is unclear if that client object can be used in parallel or not.
For the .NET ADLS library the is a clear statement that it is thread-safe:
https://docs.microsoft.com/en-us/dotnet/api/overview/azure/storage.files.datalake-readme
However, we cannot find a similar docs regarding the Java lib.
Could you please check it? Thanks in advance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to see whether this extends to the proxy scenario, but in the basic case, for recent versions of the SDK, the *ServiceClient can be treated as threadsafe:
https://github.com/Azure/azure-storage-java/blob/master/V12%20Upgrade%20Story.md
v12 [...] Some of the highlights include: [...] Thread-safety and immutability. It is still safe to share clients across threads, and there should still be no confusion about when the client is in sync with the service as there is no state maintained on the client. State may be returned as the result of a method call, but there is no pretense of storing state on a client that could be immediately invalid.
All of our v12 libraries have improved performance and ensure thread safety.
Azure Storage SDK v10 for Java adopts the next-generation Storage SDK design providing thread-safe types [...] Some of the improvements in the new SDK are: [...] Thread-safe interfaces[.]
https://azure.github.io/azure-sdk/java_introduction.html#service-client
DO ensure that all service client classes are immutable and stateless upon instantiation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We wasn't even sure that the basic scenario is thread safe. Thank you for the clarification @jfrazee! I think we can assume that the proxy scenario is thread safe as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jfrazee Do you have any update on the proxy scenario?
In my opinion, if a library/class declares itself as thread-safe (as the Azure v12 storage client in the referenced docs above), it must be thread-safe in all setup/usage. So I would go forward and merge this PR. Would you have any objections?
...ssors/src/main/java/org/apache/nifi/processors/azure/storage/utils/StorageClientFactory.java
Outdated
Show resolved
Hide resolved
...ervices-api/src/main/java/org/apache/nifi/services/azure/storage/ADLSCredentialsDetails.java
Outdated
Show resolved
Hide resolved
...ssors/src/main/java/org/apache/nifi/processors/azure/storage/utils/StorageClientFactory.java
Outdated
Show resolved
Hide resolved
b6afb3b
to
aee5908
Compare
public Cache<ADLSCredentialsDetails, DataLakeServiceClient> getCache() { | ||
return clientCache; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nandorsoma I think it is a bit overkill to expose internal fields publicly for integration tests only. Furthermore, the IT does not really test we should check: it asserts the cache size at the end (1 client in the cache) but it is more important to check how many client instance creation have happened (only 1).
It could be tested with a unit test and in that case the public getter would not be needed.
…o eliminate the need of the public getter created for the IT test only. Move proxyOptions parameter to constructor because it is not used for caching and it won't change while the processor is scheduled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nandorsoma Thanks for refactoring the tests! It looks much better.
Could you please add one more test case with the following setup:
final ADLSCredentialsDetails credentialsOne = createCredentialDetails("accountOne");
final ADLSCredentialsDetails credentialsTwo = createCredentialDetails("accountTwo");
final ADLSCredentialsDetails credentialsThree = createCredentialDetails("accountOne");
It would test two things:
- the factory can return the cached "accountOne" client after another client has been returned (not only the last one is cached)
- more importantly: the factory returns the same client for a new ADLSCredentialsDetails object having the same accountName
@turcsanyip, thanks for the review! Good idea! Tbh this case was in my mind but in the end I forgot about it... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nandorsoma Thanks for adding the test case!
+1 LGTM
Merging to main.
@jfrazee Thanks for your help with the client thread-safety question!
This closes apache#6158. Signed-off-by: Peter Turcsanyi <turcsanyi@apache.org>
Summary
NIFI-10152
In ADLS processors the client is created in the onTrigger method. Since expression language is supported in the properties which are used in the client it is not possible to extract the creation to eg.
@OnSchedule
. Now it is not just a performance problem, but the underlying reactor-netty library logs a misleading warning on every trigger which spoils the log. Because of that we need use caching to retrieve the client.reactor-netty issue: reactor/reactor-netty#2300azure-sdk-for-java-issue: Azure/azure-sdk-for-java#30148
Tracking
Please complete the following tracking steps prior to pull request creation.
Issue Tracking
Pull Request Tracking
NIFI-00000
NIFI-00000
Pull Request Formatting
main
branchVerification
Please indicate the verification steps performed prior to pull request creation.
Build
mvn clean install -P contrib-check
Licensing
LICENSE
andNOTICE
filesDocumentation