HDDS-4580. Datanode can be stuck in leader not ready state after restart #1690

lokeshj1703 · 2020-12-11T14:57:11Z

What changes were proposed in this pull request?

On restart the transactions are reapplied for an existing ratis pipeline. ContainerStateMachine#applyTransaction while processing future can throw NullPointerException leading to the future being completed exceptionally.

      future.thenApply(r -> {
        if (trx.getServerRole() == RaftPeerRole.LEADER) {
          long startTime = (long) trx.getStateMachineContext();
          metrics.incPipelineLatency(cmdType,
              Time.monotonicNowNanos() - startTime);
        }

In the above code snippet trx.getStateMachineContext() will be null during restart and this fails the future itself without updating the applyTransactionCompletionMap. As a result the lastAppliedIndex is not updated for the server and server is stuck in leader not ready state.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4580

How was this patch tested?

Existing UT

…art.

xiaoyuyao

LGTM, +1.

…art (apache#1690) (cherry picked from commit 81b75fd)

HDDS-4580. Datanode can be stuck in leader not ready state after rest…

8eaf610

…art.

lokeshj1703 self-assigned this Dec 11, 2020

trigger new CI

589663c

xiaoyuyao approved these changes Dec 14, 2020

View reviewed changes

xiaoyuyao merged commit 81b75fd into apache:master Dec 14, 2020

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Jan 18, 2021

HDDS-4580. Datanode can be stuck in leader not ready state after rest…

c183df3

…art (apache#1690) (cherry picked from commit 81b75fd)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-4580. Datanode can be stuck in leader not ready state after restart #1690

HDDS-4580. Datanode can be stuck in leader not ready state after restart #1690

lokeshj1703 commented Dec 11, 2020

xiaoyuyao left a comment

HDDS-4580. Datanode can be stuck in leader not ready state after restart #1690

HDDS-4580. Datanode can be stuck in leader not ready state after restart #1690

Conversation

lokeshj1703 commented Dec 11, 2020

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

xiaoyuyao left a comment

Choose a reason for hiding this comment