-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Two TimerSubscription
s of the same timer wheel share the same timerId
#8776
Comments
@romansmirnov, @npepinpe and I all looked at this issue. There were two factors that seemed promising to look into as possible causes:
After reading the code carefully, it looks like both factors can't explain that behavior because they use So as long as there is no bug in agrona's I also wrote some test scenarios trying to reproduce the problem. That didn't lead anywhere but we are expecting this to happen very rarely anyway. We can still improve reliability somewhat by asserting that we don't overwrite existing values when scheduling a new timer here: |
Here we throw an exception when a timerId was generated that already exists in the timerJobMap. Throwing here will fail when scheduling a new timer or when resubmitting a recurring timer. This should never happen, but we have one heapdump where it looks like it did: #8776
Here we throw an exception when a timerId was generated that already exists in the timerJobMap. Throwing here will fail when scheduling a new timer or when resubmitting a recurring timer. This should never happen, but we have one heapdump where it looks like it did: #8776
In the worst case, timers could go missing, which would lead to, say, no snapshots being taken. In most cases there are conditions where the system would recover, but quite a bit of time could happen in between. Let's dig a bit more into this as this is a foundational piece of code for Zeebe, and we need to be able to trust our timers work correctly. |
I looked into this. Unfortunately there are no new insights.
I have no more ideas. I don't think it make sense to investigate this further. If any else would like to give it a try, I would be happy to handover. @npepinpe FYI Ref:- Test
|
Let's talk about it Monday then. Ole is medic, so maybe we can briefly touch on it during the hand over. |
8785: fix: throw instead of silently overwriting timers r=oleschoenburg a=oleschoenburg ## Description Here we throw an exception when a timerId was generated that alreadyexists in the timerJobMap. Throwing here will fail when scheduling a new timer or when resubmitting a recurring timer. This should never happen, but we have one heapdump where it looks like it did: #8776 ## Related issues <!-- Which issues are closed by this PR or are related --> relates to #8776 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
8785: fix: throw instead of silently overwriting timers r=oleschoenburg a=oleschoenburg ## Description Here we throw an exception when a timerId was generated that alreadyexists in the timerJobMap. Throwing here will fail when scheduling a new timer or when resubmitting a recurring timer. This should never happen, but we have one heapdump where it looks like it did: #8776 ## Related issues <!-- Which issues are closed by this PR or are related --> relates to #8776 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
8785: fix: throw instead of silently overwriting timers r=oleschoenburg a=oleschoenburg ## Description Here we throw an exception when a timerId was generated that alreadyexists in the timerJobMap. Throwing here will fail when scheduling a new timer or when resubmitting a recurring timer. This should never happen, but we have one heapdump where it looks like it did: #8776 ## Related issues <!-- Which issues are closed by this PR or are related --> relates to #8776 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
8785: fix: throw instead of silently overwriting timers r=oleschoenburg a=oleschoenburg ## Description Here we throw an exception when a timerId was generated that alreadyexists in the timerJobMap. Throwing here will fail when scheduling a new timer or when resubmitting a recurring timer. This should never happen, but we have one heapdump where it looks like it did: #8776 ## Related issues <!-- Which issues are closed by this PR or are related --> relates to #8776 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
Here we throw an exception when a timerId was generated that already exists in the timerJobMap. Throwing here will fail when scheduling a new timer or when resubmitting a recurring timer. This should never happen, but we have one heapdump where it looks like it did: #8776 (cherry picked from commit 842988d)
Here we throw an exception when a timerId was generated that already exists in the timerJobMap. Throwing here will fail when scheduling a new timer or when resubmitting a recurring timer. This should never happen, but we have one heapdump where it looks like it did: #8776 (cherry picked from commit 842988d)
We've added a failsafe which should provide some information next time this happens. For now, I'll put it back in the backlog, as potentially ditching the actor scheduler will also resolve this issue. |
ZDP-Planning: |
Describe the bug
In this heapdump, we can see two
TimerSubscription
s scheduled on the threads that have the sametimerId
:Because they have the same
timerId
s,ActorTimerQueue
'stimerJobMap
contains only one of the twoTimerSubscriptions
.When the timer on the
DeadlineTimerWheel
expires, only one of theTimerSubscriptions
is marked asisDone
and thus available to be worked on. Effectively, theTimerSubscription
that was inserted first is dead and will never get activated.Expected behavior
As far as we know,
DeadlineTimerWheel
hands out unique timer ids and a an overflow of available ids seems unlikely (if not impossible). This means that twoTimerSubscription
s scheduled on the same actor thread should never have the same id.related to SUPPORT-12940
The text was updated successfully, but these errors were encountered: