Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBASE-28037 Replication stuck after switching to new WAL but the queue is empty #5375

Merged
merged 1 commit into from
Sep 28, 2023

Conversation

sunhelly
Copy link
Contributor

No description provided.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 43s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ branch-2.4 Compile Tests _
+1 💚 mvninstall 2m 39s branch-2.4 passed
+1 💚 compile 2m 22s branch-2.4 passed
+1 💚 checkstyle 0m 35s branch-2.4 passed
+1 💚 spotless 0m 40s branch has no errors when running spotless:check.
+1 💚 spotbugs 1m 23s branch-2.4 passed
_ Patch Compile Tests _
+1 💚 mvninstall 2m 19s the patch passed
+1 💚 compile 2m 20s the patch passed
+1 💚 javac 2m 20s the patch passed
+1 💚 checkstyle 0m 33s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 hadoopcheck 17m 7s Patch does not cause any errors with Hadoop 2.10.2 or 3.1.4 3.2.4 3.3.5.
+1 💚 spotless 0m 39s patch has no errors when running spotless:check.
+1 💚 spotbugs 1m 29s the patch passed
_ Other Tests _
+1 💚 asflicense 0m 9s The patch does not generate ASF License warnings.
34m 50s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5375/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #5375
Optional Tests dupname asflicense javac spotbugs hadoopcheck hbaseanti spotless checkstyle compile
uname Linux 91fefedbb899 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2.4 / 61250ad
Default Java Eclipse Adoptium-11.0.17+8
Max. process+thread count 79 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5375/1/console
versions git=2.34.1 maven=3.8.6 spotbugs=4.7.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@sunhelly sunhelly requested a review from Apache9 August 29, 2023 09:46
@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 34s Docker mode activated.
-0 ⚠️ yetus 0m 6s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ branch-2.4 Compile Tests _
+1 💚 mvninstall 2m 8s branch-2.4 passed
+1 💚 compile 0m 36s branch-2.4 passed
+1 💚 shadedjars 3m 47s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 24s branch-2.4 passed
_ Patch Compile Tests _
+1 💚 mvninstall 1m 58s the patch passed
+1 💚 compile 0m 35s the patch passed
+1 💚 javac 0m 35s the patch passed
+1 💚 shadedjars 3m 43s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 22s the patch passed
_ Other Tests _
+1 💚 unit 176m 50s hbase-server in the patch passed.
195m 22s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5375/1/artifact/yetus-jdk8-hadoop2-check/output/Dockerfile
GITHUB PR #5375
Optional Tests javac javadoc unit shadedjars compile
uname Linux 65c71698505e 5.4.0-156-generic #173-Ubuntu SMP Tue Jul 11 07:25:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2.4 / 61250ad
Default Java Temurin-1.8.0_352-b08
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5375/1/testReport/
Max. process+thread count 4338 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5375/1/console
versions git=2.34.1 maven=3.8.6
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 35s Docker mode activated.
-0 ⚠️ yetus 0m 5s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ branch-2.4 Compile Tests _
+1 💚 mvninstall 2m 26s branch-2.4 passed
+1 💚 compile 0m 42s branch-2.4 passed
+1 💚 shadedjars 4m 10s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 26s branch-2.4 passed
_ Patch Compile Tests _
+1 💚 mvninstall 2m 15s the patch passed
+1 💚 compile 0m 43s the patch passed
+1 💚 javac 0m 43s the patch passed
+1 💚 shadedjars 4m 11s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 24s the patch passed
_ Other Tests _
+1 💚 unit 176m 14s hbase-server in the patch passed.
196m 30s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5375/1/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #5375
Optional Tests javac javadoc unit shadedjars compile
uname Linux 899e8800ebc5 5.4.0-156-generic #173-Ubuntu SMP Tue Jul 11 07:25:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2.4 / 61250ad
Default Java Eclipse Adoptium-11.0.17+8
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5375/1/testReport/
Max. process+thread count 4618 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5375/1/console
versions git=2.34.1 maven=3.8.6
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache9
Copy link
Contributor

Apache9 commented Sep 13, 2023

IIRC it is impossible that a normal replication source has an empty queue since it will always has a wal file being written. Can you reproduce this problem with a UT? Or at least explain the sequence on how to reproduce this problem?

Thanks.

@sunhelly
Copy link
Contributor Author

sunhelly commented Sep 27, 2023

Thanks, @Apache9. There exists replication stuck on our production clusters which will recover after restarting the stuck regionserver. I digged the issue and found that in the jstack info there was no active source stream readers while the replication queue was not empty.
I think the replication log is enqueued by calling WALActionListener#postLogRoll, which means only the wal creation completed will make the wal enqueue. But there are some circumstances, e.g. hardware fails on datanodes, that will make the wal creation operation consums some seconds and with retries. Then when the speed of source reader is fast enough or wals under replicated are short enough, there always exists extremely short-term empty replication queue(especially for wal group, which distinguishes the wal replication queues).
By the way, since when the stream reader is stopped, only when the queue is not exist it will start new reader. Here the reader stopping is not consistent with the group queue REMOVE(NOT EXIST), it only matches the queue EMPTY. I think if we allow the reader stop here, then the logics of restarting the reader should be reconstructed. Only stopping the reader when it is a recovered queue is a safe and simple resolution.

@Apache9
Copy link
Contributor

Apache9 commented Sep 27, 2023

Thanks, @Apache9. There exists replication stuck on our production clusters which will recover after restarting the stuck regionserver. I digged the issue and found that in the jstack info there was no active source stream readers while the replication queue was not empty. I think the replication log is enqueued by calling WALActionListener#postLogRoll, which means only the wal creation completed will make the wal enqueue. But there are some circumstances, e.g. hardware fails on datanodes, that will make the wal creation operation consums some seconds and with retries. Then when the speed of source reader is fast enough or wals under replicated are short enough, there always exists extremely short-term empty replication queue(especially for wal group, which distinguishes the wal replication queues). By the way, since when the stream reader is stopped, only when the queue is not exist it will start new reader. Here the reader stopping is not consistent with the group queue REMOVE(NOT EXIST), it only matches the queue EMPTY. I think if we allow the reader stop here, then the logics of restarting the reader should be reconstructed. Only stopping the reader when it is a recovered queue is a safe and simple resolution.

If it is possible that the replication queue could be empty in a very shot time window, then there could be other serious problem, as we do not expect a non recovery replication queue could be empty...
Then we should try to add synchronization or changing the operation order to not allow this happen...
Will take a look at the related code later.

Thanks for reporting.

@Apache9
Copy link
Contributor

Apache9 commented Sep 27, 2023

Checked the code, on branch-2.x, we will only record the WAL file on zk in preLogRoll, this is for not losing the WAL after restarting, but we will not enqueue it. The enqueuing is done in postLogRoll. So it is possible that the replication queue is empty for a very short time window.

On master and branch-3, we even do not have preLogRoll implemented any more, only enqueue the log in postLogRoll.

So this is a problem.

I think we can apply this PR for branch-2.5 and branch-2.4.

I will open an issue for handling this problem for other branches, as the code has been refactored a lot...

@Apache9
Copy link
Contributor

Apache9 commented Sep 27, 2023

Oh, please change the comments? There is no sync replication for branch-2.x.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 39s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ branch-2.4 Compile Tests _
+1 💚 mvninstall 3m 46s branch-2.4 passed
+1 💚 compile 3m 57s branch-2.4 passed
+1 💚 checkstyle 1m 1s branch-2.4 passed
+1 💚 spotless 0m 59s branch has no errors when running spotless:check.
+1 💚 spotbugs 2m 1s branch-2.4 passed
_ Patch Compile Tests _
+1 💚 mvninstall 3m 45s the patch passed
+1 💚 compile 3m 40s the patch passed
+1 💚 javac 3m 40s the patch passed
+1 💚 checkstyle 0m 52s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 hadoopcheck 24m 14s Patch does not cause any errors with Hadoop 2.10.2 or 3.1.4 3.2.4 3.3.6.
+1 💚 spotless 1m 4s patch has no errors when running spotless:check.
+1 💚 spotbugs 2m 32s the patch passed
_ Other Tests _
+1 💚 asflicense 0m 13s The patch does not generate ASF License warnings.
50m 26s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5375/2/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #5375
Optional Tests dupname asflicense javac spotbugs hadoopcheck hbaseanti spotless checkstyle compile
uname Linux 08bce4ef6f70 5.4.0-156-generic #173-Ubuntu SMP Tue Jul 11 07:25:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2.4 / ae7dc1d
Default Java Eclipse Adoptium-11.0.17+8
Max. process+thread count 79 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5375/2/console
versions git=2.34.1 maven=3.8.6 spotbugs=4.7.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 34s Docker mode activated.
-0 ⚠️ yetus 0m 5s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ branch-2.4 Compile Tests _
+1 💚 mvninstall 2m 29s branch-2.4 passed
+1 💚 compile 0m 44s branch-2.4 passed
+1 💚 shadedjars 4m 20s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 25s branch-2.4 passed
_ Patch Compile Tests _
+1 💚 mvninstall 2m 17s the patch passed
+1 💚 compile 0m 43s the patch passed
+1 💚 javac 0m 43s the patch passed
+1 💚 shadedjars 4m 15s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 24s the patch passed
_ Other Tests _
+1 💚 unit 176m 23s hbase-server in the patch passed.
196m 45s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5375/2/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #5375
Optional Tests javac javadoc unit shadedjars compile
uname Linux e6f01658bae8 5.4.0-156-generic #173-Ubuntu SMP Tue Jul 11 07:25:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2.4 / ae7dc1d
Default Java Eclipse Adoptium-11.0.17+8
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5375/2/testReport/
Max. process+thread count 4557 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5375/2/console
versions git=2.34.1 maven=3.8.6
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 34s Docker mode activated.
-0 ⚠️ yetus 0m 6s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ branch-2.4 Compile Tests _
+1 💚 mvninstall 2m 12s branch-2.4 passed
+1 💚 compile 0m 36s branch-2.4 passed
+1 💚 shadedjars 4m 0s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 24s branch-2.4 passed
_ Patch Compile Tests _
+1 💚 mvninstall 1m 59s the patch passed
+1 💚 compile 0m 36s the patch passed
+1 💚 javac 0m 36s the patch passed
+1 💚 shadedjars 3m 55s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 22s the patch passed
_ Other Tests _
+1 💚 unit 179m 18s hbase-server in the patch passed.
198m 3s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5375/2/artifact/yetus-jdk8-hadoop2-check/output/Dockerfile
GITHUB PR #5375
Optional Tests javac javadoc unit shadedjars compile
uname Linux 258d5e9444e2 5.4.0-156-generic #173-Ubuntu SMP Tue Jul 11 07:25:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2.4 / ae7dc1d
Default Java Temurin-1.8.0_352-b08
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5375/2/testReport/
Max. process+thread count 4219 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5375/2/console
versions git=2.34.1 maven=3.8.6
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@sunhelly sunhelly merged commit 4c3bffe into apache:branch-2.4 Sep 28, 2023
1 check passed
asfgit pushed a commit that referenced this pull request Sep 28, 2023
…e is empty (#5375)

Signed-off-by: Duo Zhang <zhangduo@apache.org>
vinayakphegde pushed a commit to vinayakphegde/hbase that referenced this pull request Apr 4, 2024
…e is empty (apache#5375)

Signed-off-by: Duo Zhang <zhangduo@apache.org>
(cherry picked from commit 4c3bffe)
Change-Id: I4d8b6168ec533c7a5821f4be9d625b1e4f92b21e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants