Skip to content

RATIS-1622. Fix high CPU load when some followers are down#680

Merged
szetszwo merged 7 commits intoapache:masterfrom
drriguz:master
Jul 20, 2022
Merged

RATIS-1622. Fix high CPU load when some followers are down#680
szetszwo merged 7 commits intoapache:masterfrom
drriguz:master

Conversation

@drriguz
Copy link
Contributor

@drriguz drriguz commented Jul 13, 2022

What changes were proposed in this pull request?

Avoid using 0 delay for slow or dead followers when new entries comes.

What is the link to the Apache JIRA

RATIS-1622: High cpu usage of LogAppenderDaemon

Please replace this section with the link to the Apache JIRA)

How was this patch tested?

Tested manually by using the following steps:

  • Start a cluster of 5 nodes
  • shutdown node 5 and 5
  • make some change to get new log entry
  • monitor the CPU usage of the leader node.

The CPU load is significantly reduced from 90% to 10%(in Docker environment, and cpus is limited to 1 for each node).

However, some unit tests fails locally. I'll open this to see if any tests will fail in CI.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drriguz , the change looks good. There is an unused import (see the checkstyle warning) and a typo in the comment.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good.

Copy link
Contributor

@codings-dan codings-dan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drriguz There are many failed ci tests, see https://github.com/apache/ratis/runs/7333835180?check_suite_focus=true#step:5:1030 Can you take a look, thank you!

@drriguz
Copy link
Contributor Author

drriguz commented Jul 14, 2022

@drriguz There are many failed ci tests, see https://github.com/apache/ratis/runs/7333835180?check_suite_focus=true#step:5:1030 Can you take a look, thank you!

Yeah, I'll try to figure it out!

@drriguz
Copy link
Contributor Author

drriguz commented Jul 15, 2022

@szetszwo Hi, there's failed test:

Error:  testRestartFollower(org.apache.ratis.grpc.TestInstallSnapshotNotificationWithGrpc)  Time elapsed: 62.907 s  <<< ERROR!
java.lang.IllegalArgumentException: s2@group-1E4B33D1632B-SegmentedRaftLog is expected to be opened but it is CLOSED

previously the failed test is:

TestNettyDataStreamStarTopologyWithGrpcCluster>DataStreamAsyncClusterTests.testMultipleStreamsMultipleServersStepDownLeader:78->DataStreamAsyncClusterTests.runTestDataStreamStepDownLeader:82->DataStreamAsyncClusterTests.runTestDataStream:97 » TestTimedOut

But I tried run tests locally several times and none of them failed, either the tests are not stable, or this commit somehow makes it unstable. Do you think it's related to this commit?

@szetszwo szetszwo merged commit e52710a into apache:master Jul 20, 2022
szetszwo pushed a commit that referenced this pull request Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants