-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] Infinite Loop Observed for ResourceLockImpl after Zookeeper Session Expired #17759
Comments
Looks like we have two same path locks in the same broker. |
Thanks for your fix. @eolivelli
|
After #17762 merged, I think we solved this problem, but it still has risks in the future.
|
Observations
I haven't been able to reproduce this issue, but in one of my test environments, I see this behavior often. It seems pretty clear that there is a potential for an infinite loop in the
ResourceLockImpl
class. I specifically see it taking place for the bookie's lock on its znode. Here are the observed events. The bookie's logs are below./ledgers/available/bookie-0:3181
.Metadata store session has been re-established. Revalidating all the existing locks.
. That log line is associated with revalidating locks, and given the behavior we observe, it seems to find the lock existing in Zookeeper (otherwise, we wouldn't get this loop). Here is the code that we don't see executed (we don't see that line)pulsar/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/coordination/impl/ResourceLockImpl.java
Lines 244 to 250 in 12ca27f
pulsar/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/coordination/impl/ResourceLockImpl.java
Line 267 in 12ca27f
serde.deserialize
implementation forBookieServiceInfo
deserializes all objects tonull
. (This definitely seems like a bug, and I think fixing this would fix the bug for the rack awareness case but not for the general case.)pulsar/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/bookkeeper/BookieServiceInfoSerde.java
Lines 65 to 68 in 12ca27f
pulsar/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/coordination/impl/ResourceLockImpl.java
Lines 300 to 305 in 12ca27f
Potential solutions
store.put
to replace the znode (we already confirmed that we own it). This might have bad consequences for other use cases. It's relevant to remember that this is for a generic lock class. However, given this code handling "bad version" updates, it seems like this might be the right path:pulsar/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/coordination/impl/ResourceLockImpl.java
Lines 191 to 194 in 12ca27f
ResourceLockImpl
that indicates the lock is being revalidated in the edge case that requires deletion and we should ignore node deleted notification.I didn't have any time to test these implementations, but after writing this issue, I think one or both of these could be the right direction.
Open questions
Lock on resource /ledgers/available/bookie-0:3181 was invalidated
logged twice at a time and then later in the log (not shared here), I saw it 3 times at a time. This led me to test with ZK to see how persistent watches work, and it seems to me that we shouldn't be getting extra notifications. It could have to do with the way callbacks are handled in theResourceLockImpl#revalidate
.The text was updated successfully, but these errors were encountered: