Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two TimerSubscriptions of the same timer wheel share the same timerId #8776

Closed
lenaschoenburg opened this issue Feb 11, 2022 · 6 comments
Closed
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) component/scheduler kind/bug Categorizes an issue or PR as a bug support Marks an issue as related to a customer support request version:1.3.5 Marks an issue as being completely or in parts released in 1.3.5

Comments

@lenaschoenburg
Copy link
Member

lenaschoenburg commented Feb 11, 2022

Describe the bug

In this heapdump, we can see two TimerSubscriptions scheduled on the threads that have the same timerId:

SELECT t.job, t.timerId, t.thread FROM io.camunda.zeebe.util.sched.TimerSubscription t
WHERE ((t.timerId = 68719476736L) AND (t.thread.toString() = "Broker-2-zb-actors-1"))
t.job t.timerId t.thread
io.camunda.zeebe.util.sched.ActorJob [id=0x680b0b658] 68719476736 io.camunda.zeebe.util.sched.ActorThread [id=0x6814aae80]
io.camunda.zeebe.util.sched.ActorJob [id=0x68ef1b7f0] 68719476736 io.camunda.zeebe.util.sched.ActorThread [id=0x6814aae80]

Because they have the same timerIds, ActorTimerQueue's timerJobMap contains only one of the two TimerSubscriptions.
When the timer on the DeadlineTimerWheel expires, only one of the TimerSubscriptions is marked as isDone and thus available to be worked on. Effectively, the TimerSubscription that was inserted first is dead and will never get activated.

Expected behavior

As far as we know, DeadlineTimerWheel hands out unique timer ids and a an overflow of available ids seems unlikely (if not impossible). This means that two TimerSubscriptions scheduled on the same actor thread should never have the same id.


related to SUPPORT-12940

@lenaschoenburg
Copy link
Member Author

@romansmirnov, @npepinpe and I all looked at this issue.

There were two factors that seemed promising to look into as possible causes:

  1. One timer was recurring, the other one not. Maybe recurring timers have a bug where the timer id is not generated properly.
  2. The jobs that are scheduled when a timer expires can be stolen by other threads.

After reading the code carefully, it looks like both factors can't explain that behavior because they use TimerSubscription#submit which is safe: Not calling this from an actor thread would cause a NullPointerException: https://github.com/camunda-cloud/zeebe/blob/1edfe4f762ef700885859bf87e1370963d3448a1/util/src/main/java/io/camunda/zeebe/util/sched/TimerSubscription.java#L79-L82
and calling this from an actor thread is safe because we then use the thread's timer wheel to generate a new timer id:
https://github.com/camunda-cloud/zeebe/blob/1edfe4f762ef700885859bf87e1370963d3448a1/util/src/main/java/io/camunda/zeebe/util/sched/ActorTimerQueue.java#L49-L54

So as long as there is no bug in agrona's DeadlineTimerWheel, it seems impossible that we have conflicting timer ids.

I also wrote some test scenarios trying to reproduce the problem. That didn't lead anywhere but we are expecting this to happen very rarely anyway.

We can still improve reliability somewhat by asserting that we don't overwrite existing values when scheduling a new timer here:
https://github.com/camunda-cloud/zeebe/blob/1edfe4f762ef700885859bf87e1370963d3448a1/util/src/main/java/io/camunda/zeebe/util/sched/ActorTimerQueue.java#L56

lenaschoenburg added a commit that referenced this issue Feb 14, 2022
Here we throw an exception when a timerId was generated that already
exists in the timerJobMap.
Throwing here will fail when scheduling a new timer or when resubmitting
a recurring timer.
This should never happen, but we have one heapdump where it looks like
it did: #8776
lenaschoenburg added a commit that referenced this issue Feb 14, 2022
Here we throw an exception when a timerId was generated that already
exists in the timerJobMap.
Throwing here will fail when scheduling a new timer or when resubmitting
a recurring timer.
This should never happen, but we have one heapdump where it looks like
it did: #8776
@npepinpe
Copy link
Member

In the worst case, timers could go missing, which would lead to, say, no snapshots being taken. In most cases there are conditions where the system would recover, but quite a bit of time could happen in between. Let's dig a bit more into this as this is a foundational piece of code for Zeebe, and we need to be able to trust our timers work correctly.

@npepinpe npepinpe added this to Ready in Zeebe Feb 15, 2022
@romansmirnov romansmirnov added the support Marks an issue as related to a customer support request label Feb 17, 2022
@deepthidevaki
Copy link
Contributor

I looked into this. Unfortunately there are no new insights.

  1. I wrote a property test for ActorTimerQueue to verify there are no duplicate timerIds. Duplicate timerIds never occurred in multiple runs of the test. timerId is generated by agrona DeadlineTimerWheel.
  2. Other hypothesis is a race-condition in scheduling timers. But as @oleschoenburg already mentioned, from the code this seems impossible. I couldn't find a way to reproduce race-conditions in-case they exists.

I have no more ideas. I don't think it make sense to investigate this further. If any else would like to give it a try, I would be happy to handover. @npepinpe FYI

Ref:- Test

public class ActorTimerQueueTest {

  private static final int OPERATION_SIZE = 20000;
  ActorTimerQueue actorTimerQueue;
  ControlledActorClock clock;
  Set<Long> scheduledTimers;

  @BeforeTry
  void before() {
    scheduledTimers = new HashSet<>();
    clock = new ControlledActorClock();
    clock.setCurrentTime(1);
    actorTimerQueue = new ActorTimerQueue(clock);
  }

  @Property(tries = 100)
  void actorTimerQueueTest(
      @ForAll("seeds") final long seed,
      @ForAll("operations") final List<Consumer<Long>> operations) {

    final Random random = new Random(seed);

    operations.forEach(operation -> operation.accept(random.nextLong(100, 60000)));

    for (int i = 0; i < OPERATION_SIZE; i++) {
      clock.addTime(Duration.ofMillis(60000));
      actorTimerQueue.processExpiredTimers(clock);
      if (scheduledTimers.isEmpty()) {
        break;
      }
    }

    assertThat(scheduledTimers).isEmpty();
  }

  @Provide
  Arbitrary<List<Consumer<Long>>> operations() {
    final List<Consumer<Long>> operations = new ArrayList<>();
    operations.add(this::scheduleTimer);
    operations.add(l -> clock.addTime(Duration.ofMillis(l / 10)));
    operations.add(l -> actorTimerQueue.processExpiredTimers(clock));

    final var randomOperations = Arbitraries.of(operations);
    return randomOperations.list().ofSize(OPERATION_SIZE);
  }

  private void scheduleTimer(final Long l) {
    final MinimalTimerSubscription timer = new MinimalTimerSubscription(l);
    actorTimerQueue.schedule(timer, clock);
    final var timerId = timer.getTimerId();
    assertThat(scheduledTimers.add(timerId)).isTrue();
  }

  @Provide
  Arbitrary<Long> seeds() {
    return Arbitraries.longs();
  }

  class MinimalTimerSubscription extends TimerSubscription {
    public MinimalTimerSubscription(final long deadline) {
      super(new ActorJob(), deadline, TimeUnit.MILLISECONDS, false);
    }

    @Override
    public void onTimerExpired(final TimeUnit timeUnit, final long now) {
      // do nothing
      scheduledTimers.remove(getTimerId());
    }
  }
}

@npepinpe
Copy link
Member

npepinpe commented Feb 18, 2022

Let's talk about it Monday then. Ole is medic, so maybe we can briefly touch on it during the hand over.

ghost pushed a commit that referenced this issue Feb 22, 2022
8785: fix: throw instead of silently overwriting timers r=oleschoenburg a=oleschoenburg

## Description

Here we throw an exception when a timerId was generated that alreadyexists in the timerJobMap.
Throwing here will fail when scheduling a new timer or when resubmitting a recurring timer.

This should never happen, but we have one heapdump where it looks like it did: #8776

## Related issues

<!-- Which issues are closed by this PR or are related -->

relates to #8776



Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
ghost pushed a commit that referenced this issue Feb 22, 2022
8785: fix: throw instead of silently overwriting timers r=oleschoenburg a=oleschoenburg

## Description

Here we throw an exception when a timerId was generated that alreadyexists in the timerJobMap.
Throwing here will fail when scheduling a new timer or when resubmitting a recurring timer.

This should never happen, but we have one heapdump where it looks like it did: #8776

## Related issues

<!-- Which issues are closed by this PR or are related -->

relates to #8776



Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
ghost pushed a commit that referenced this issue Feb 22, 2022
8785: fix: throw instead of silently overwriting timers r=oleschoenburg a=oleschoenburg

## Description

Here we throw an exception when a timerId was generated that alreadyexists in the timerJobMap.
Throwing here will fail when scheduling a new timer or when resubmitting a recurring timer.

This should never happen, but we have one heapdump where it looks like it did: #8776

## Related issues

<!-- Which issues are closed by this PR or are related -->

relates to #8776



Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
ghost pushed a commit that referenced this issue Feb 22, 2022
8785: fix: throw instead of silently overwriting timers r=oleschoenburg a=oleschoenburg

## Description

Here we throw an exception when a timerId was generated that alreadyexists in the timerJobMap.
Throwing here will fail when scheduling a new timer or when resubmitting a recurring timer.

This should never happen, but we have one heapdump where it looks like it did: #8776

## Related issues

<!-- Which issues are closed by this PR or are related -->

relates to #8776



Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
github-actions bot pushed a commit that referenced this issue Feb 22, 2022
Here we throw an exception when a timerId was generated that already
exists in the timerJobMap.
Throwing here will fail when scheduling a new timer or when resubmitting
a recurring timer.
This should never happen, but we have one heapdump where it looks like
it did: #8776

(cherry picked from commit 842988d)
github-actions bot pushed a commit that referenced this issue Feb 22, 2022
Here we throw an exception when a timerId was generated that already
exists in the timerJobMap.
Throwing here will fail when scheduling a new timer or when resubmitting
a recurring timer.
This should never happen, but we have one heapdump where it looks like
it did: #8776

(cherry picked from commit 842988d)
ghost pushed a commit that referenced this issue Feb 22, 2022
8827: [Backport stable/1.3] fix: throw instead of silently overwriting timers r=oleschoenburg a=github-actions[bot]

# Description
Backport of #8785 to `stable/1.3`.

relates to #8776 #8776

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
ghost pushed a commit that referenced this issue Feb 22, 2022
8826: [Backport stable/1.2] fix: throw instead of silently overwriting timers r=oleschoenburg a=github-actions[bot]

# Description
Backport of #8785 to `stable/1.2`.

relates to #8776 #8776

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
@Zelldon Zelldon added Release: 1.2.11 version:1.3.5 Marks an issue as being completely or in parts released in 1.3.5 labels Mar 1, 2022
@npepinpe
Copy link
Member

We've added a failsafe which should provide some information next time this happens. For now, I'll put it back in the backlog, as potentially ditching the actor scheduler will also resolve this issue.

@npepinpe npepinpe removed this from Ready in Zeebe Mar 15, 2022
@npepinpe npepinpe added area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) and removed Release: 1.2.11 labels Apr 11, 2022
@megglos
Copy link
Contributor

megglos commented Apr 28, 2023

ZDP-Planning:
never happened again since then, would close it for now.

@megglos megglos closed this as not planned Won't fix, can't repro, duplicate, stale Apr 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) component/scheduler kind/bug Categorizes an issue or PR as a bug support Marks an issue as related to a customer support request version:1.3.5 Marks an issue as being completely or in parts released in 1.3.5
Projects
None yet
Development

No branches or pull requests

8 participants