-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix][broker] Fix deadlock in ExtensibleLoadManagerImplTest.initializeState
#20130
[fix][broker] Fix deadlock in ExtensibleLoadManagerImplTest.initializeState
#20130
Conversation
Could you share the deadlock thread? |
when metadata store thread is execute other async code path which will be executed on the metadata store an deadlock is occured.
1c83bf2
to
1a17f37
Compare
more information applied, maybe we can only make the call pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/NamespacesBase.java Lines 1020 to 1029 in eedf702
|
ExtensibleLoadManagerImplTest.initializeState
ExtensibleLoadManagerImplTest.initializeState
@nodece @heesung-sn @Demogorgon314 can you take a look ? thank you :- ) |
} else { | ||
// Do not refresh if we're not connected | ||
return CompletableFuture.completedFuture(oldValue); | ||
return CompletableFuture.supplyAsync(() -> oldValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to use the executor
in this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @Technoboy- after read the Caffeine asyncReload
method javadoc. i change back to the origin CompletableFuture.completedFuture
so there wont be any thread exchange. and the callback will execute on the origin cache.get thread which wont make deadlock like this cause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to use the executor in this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you want to switch a thread to call the callback, I also do a similar thing: #13809, and I'm not sure whether this is correct, but I know we can avoid timeout by not using 'CompleteFuture#join' and 'CompletetableFuture#get'.
Let me know your thoughts.
@lifepuzzlefun thanks for the good investigation work. It would be useful to fix this deadlock. @merlimat can you provide advice how to address this? thanks |
copy that. sorry guys, i'm a little busy right now. i'll handle the comments when i am available. |
@nodece hi, i think current metadataStore assume the caller should take the callback thread carefully and know if the logic will need to change thread. the most of origin logic will try to change thread explicitly to other thread. I agree with merlimat's comment #13809 (review) . maybe we can hide all the callback thread change logic inside the metadataStore which will prevent the deadlock happen again. |
I agreed that, and I think we can make a discussion on the ML. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the root cause is that the interface BrokerFilter
introduced by PIP-192, which declares a non-asynchronous API filter
[1], and implementation BrokerIsolationPoliciesFilter
uses a blocking call IsolationPolicyResources.getIsolationDataPolicies
.
@Demogorgon314 Can we make the API BrokerFilter.filter
asynchronous?
@poorbarcode Can you explain why making the API |
The pr had no activity for 30 days, mark with Stale label. |
fix #20157
deadlock stack:
Motivation
this can be seen in local environment sometimes.
why deadlock happen ?
initializeState will call
admin.namespaces().unload("public/default");
so in the broker side one call stack is
|_ org.apache.pulsar.broker.namespace.NamespaceService.isNamespaceBundleOwned
|_ org.apache.pulsar.metadata.api.MetadataStore.exists
|_ AbstractMetadataStore#existsCache -> asyncLoad
|_ AbstractMetadataStore.existsFromStore
|_ ZKMetadataStore.existsFromStore line:351 // we are on the
configuration-metadata-store
internal executor thread now.so we triggered the CompletableFuture here and the following thing is always execute on this thread. ( the stack above is the following thing )
then the code will call on metadata store
IsolationPolicyResources.getIsolationDataPolicies
so only one thread is here to trigger BatchMetadataStore to flush metadata call. but it won't be executed.
an deadlock occurs.
Modifications
force cache loading callback complete on forkjoinPool to avoid other task execute on metadataStore thread by CompletableFuture.
Verifying this change
the unit test should not fail on
ExtensibleLoadManagerImplTest.initializeState
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
Documentation
doc
doc-required
doc-not-needed
doc-complete
Matching PR in forked repository
PR in forked repository: