HBASE-24438 Don't update TaskMonitor when deserializing ServerCrashProcedure by timoha · Pull Request #1826 · apache/hbase

timoha · 2020-06-01T21:48:07Z

The ServerCrashProcedure could have been completed by previously active
HBase Master which results in a stale ServerCrashProcedure task in TaskMonitor.
The TaskMonitor should only reflect the procedure in case the procedure has actually
been started/resumed which is done when ServerCrashProcedure.executeFromState is called.

…ocedure The ServerCrashProcedure could have been completed by previously active HBase Master which results in a stale ServerCrashProcedure task in TaskMonitor. The TaskMonitor should only reflect the procedure in case the procedure has actually been started/resumed which is done when ServerCrashProcedure.executeFromState is called.

clarax · 2020-06-01T22:27:54Z

Patch LGTM. But need to check with author @jingyuntian for https://issues.apache.org/jira/browse/HBASE-21647 to see if this will break other cases.

Apache-HBase · 2020-06-01T22:33:24Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	1m 13s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+1 💚	hbaseanti	0m 0s	Patch does not have any anti-patterns.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
		_ master Compile Tests _
+1 💚	mvninstall	4m 12s	master passed
+1 💚	checkstyle	1m 17s	master passed
+1 💚	spotbugs	2m 17s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	3m 47s	the patch passed
+1 💚	checkstyle	1m 12s	the patch passed
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	hadoopcheck	12m 19s	Patch does not cause any errors with Hadoop 3.1.2 3.2.1.
+1 💚	spotbugs	2m 58s	the patch passed
		_ Other Tests _
+1 💚	asflicense	0m 15s	The patch does not generate ASF License warnings.
		38m 4s

Subsystem	Report/Notes
Docker	Client=19.03.11 Server=19.03.11 base: https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-1826/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR	#1826
Optional Tests	dupname asflicense spotbugs hadoopcheck hbaseanti checkstyle
uname	Linux df9af8824035 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `716702a`
Max. process+thread count	84 (vs. ulimit of 12500)
modules	C: hbase-server U: hbase-server
Console output	https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-1826/1/console
versions	git=2.17.1 maven=(cecedd343002696d0abb50b32b541b8a6ba2883f) spotbugs=3.1.12
Powered by	Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2020-06-02T00:28:23Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	1m 38s	Docker mode activated.
-0 ⚠️	yetus	0m 3s	Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
		_ Prechecks _
		_ master Compile Tests _
+1 💚	mvninstall	4m 15s	master passed
+1 💚	compile	1m 3s	master passed
+1 💚	shadedjars	5m 48s	branch has no errors when building our shaded downstream artifacts.
-0 ⚠️	javadoc	0m 41s	hbase-server in master failed.
		_ Patch Compile Tests _
+1 💚	mvninstall	4m 1s	the patch passed
+1 💚	compile	1m 4s	the patch passed
+1 💚	javac	1m 4s	the patch passed
+1 💚	shadedjars	5m 42s	patch has no errors when building our shaded downstream artifacts.
-0 ⚠️	javadoc	0m 40s	hbase-server in the patch failed.
		_ Other Tests _
+1 💚	unit	126m 2s	hbase-server in the patch passed.
		153m 4s

Subsystem	Report/Notes
Docker	Client=19.03.11 Server=19.03.11 base: https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-1826/1/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR	#1826
Optional Tests	javac javadoc unit shadedjars compile
uname	Linux 8f066f9262d1 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `716702a`
Default Java	2020-01-14
javadoc	https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-1826/1/artifact/yetus-jdk11-hadoop3-check/output/branch-javadoc-hbase-server.txt
javadoc	https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-1826/1/artifact/yetus-jdk11-hadoop3-check/output/patch-javadoc-hbase-server.txt
Test Results	https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-1826/1/testReport/
Max. process+thread count	4411 (vs. ulimit of 12500)
modules	C: hbase-server U: hbase-server
Console output	https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-1826/1/console
versions	git=2.17.1 maven=(cecedd343002696d0abb50b32b541b8a6ba2883f)
Powered by	Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2020-06-02T01:57:56Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	1m 13s	Docker mode activated.
-0 ⚠️	yetus	0m 3s	Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
		_ Prechecks _
		_ master Compile Tests _
+1 💚	mvninstall	4m 13s	master passed
+1 💚	compile	1m 1s	master passed
+1 💚	shadedjars	6m 9s	branch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 39s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	3m 49s	the patch passed
+1 💚	compile	0m 59s	the patch passed
+1 💚	javac	0m 59s	the patch passed
+1 💚	shadedjars	6m 0s	patch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 36s	the patch passed
		_ Other Tests _
+1 💚	unit	216m 6s	hbase-server in the patch passed.
		242m 39s

Subsystem	Report/Notes
Docker	Client=19.03.11 Server=19.03.11 base: https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-1826/1/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile
GITHUB PR	#1826
Optional Tests	javac javadoc unit shadedjars compile
uname	Linux f673620e475f 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `716702a`
Default Java	1.8.0_232
Test Results	https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-1826/1/testReport/
Max. process+thread count	3675 (vs. ulimit of 12500)
modules	C: hbase-server U: hbase-server
Console output	https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-1826/1/console
versions	git=2.17.1 maven=(cecedd343002696d0abb50b32b541b8a6ba2883f)
Powered by	Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

Apache9 · 2020-06-02T01:59:25Z

I think the problem here is in the implementation of updateProgress? We pass updateState as false when calling it in the deserialize method, but the only place to update the currentRunningState is in the updateProgress method, so when calling from deserialize the currentRunningState will always be null.

And the updateProgress method is totally a mess. The order of the execution it really strange...
I think it should be like this

    String msg = "Processing ServerCrashProcedure of " + serverName;
    if (status == null) {
      status = TaskMonitor.get().createStatus(msg);
    }
    if (updateState) {
      currentRunningState = getCurrentState();
    }
    if (currentRunningState == ServerCrashState.SERVER_CRASH_FINISH) {
      status.markComplete(msg + " done");
      return;
    }
    int childrenLatch = getChildrenLatch();
    status.setStatus(msg + " current State " + currentRunningState
        + (childrenLatch > 0 ? "; remaining num of running child procedures = " + childrenLatch
            : ""));

And we should call updateProgress(true) in deserializeStateData. Could you please try if this can fix your problem?

Thanks.

timoha · 2020-06-02T19:01:13Z

Thanks for suggestion @Apache9, but I don't think that's going to work for the case of state being SERVER_CRASH_FINISH. With such code change, executeFromState will end up calling updateProgress(true) at the beginning of the function marking the task as done. Whereas it should be actually marking the task as done when switch case SERVER_CRASH_FINISH is called.

Also, could you please explain a rationale about adding a task during procedure deserialization? If procedure has finished during run of previous, what is the value in displaying it in tasks as "completed" when a new master comes after replaying the logs? In other words, should a simple fact of deserialization have such a side effect on tasks or should actual procedure execution drive the tasks updates instead?

saintstack · 2020-06-03T17:35:47Z

bq. .... should a simple fact of deserialization have such a side effect on tasks...

No. That is wonky. For this reason alone we should apply this patch.

Will wait a while in case comeback.

Apache9 · 2020-06-04T00:22:34Z

Thanks for suggestion @Apache9, but I don't think that's going to work for the case of state being SERVER_CRASH_FINISH. With such code change, executeFromState will end up calling updateProgress(true) at the beginning of the function marking the task as done. Whereas it should be actually marking the task as done when switch case SERVER_CRASH_FINISH is called.

Also, could you please explain a rationale about adding a task during procedure deserialization? If procedure has finished during run of previous, what is the value in displaying it in tasks as "completed" when a new master comes after replaying the logs? In other words, should a simple fact of deserialization have such a side effect on tasks or should actual procedure execution drive the tasks updates instead?

I think unless you call update progress, otherwise the procedure will not show up in the TaskMonitor? The updateProgress call in deserialization is something like an initialization. If there are so many SCPs, maybe it is possible that one of the SCPs can not be scheduled for a long time, then if you do not all updateProgress in deserialization, you can not see its current status for a long tim, until it gets scheduled.

Apache9 · 2020-06-04T00:33:27Z

And I got the reason why there is a currentRunningState field. We will call setNextState after scheduling sub procedures, so when updating progress, if we just use the state machine state by calling getCurrentState, we will set the message in the TaskMonitor to the next state, which is a bit confusing to users.

So I think we could add another flag to updateProgress, to indicate whether the procedure is complete, and when calling after excuting the SERVER_CRASH_FINISH, we set this flag to true to let the method complete the procedure in the TaskMonitor.

And also, in deserialization method, we could check the 'ProcedureState'(not the state of the state machine), if it is SUCCESS or FAILED, then we just skip calling updateProgress, so it will not have stale SCPs.

Thanks.

timoha · 2020-06-04T00:44:46Z

That's a good point @Apache9, I guess it would add sort of "continuity" to show that the task is still ongoing during master failover in case the procedure hasn't actually finished. However, I think there's an unwanted side effect here in case the procedure has actually completed by previous master (was marked done) and SCP not being actually scheduled for a while.

As an operator, if I were to notice that there's SCP, I would freak out and try to see why my regionserver failed and then try to look at the logs (and in this case nothing is logged about it new master), and then try to look at procedure list (also without finding anything there). So, I guess in that case it would be more expected to show SCP in task when it's is actually being executed rather than "pending execution"?

Maybe an alternative would be to improve the message in tasks and explicitly say "noticed an SCP, waiting for processing". Don't know if such verbosity would be useful though?

Apache9 · 2020-06-04T03:03:02Z

Maybe an alternative would be to improve the message in tasks and explicitly say "noticed an SCP, waiting for processing". Don't know if such verbosity would be useful though?

Good. Waiting for your new patch.

timoha · 2020-06-04T17:52:37Z

I'll let others also comment before submitting a patch. I personally prefer for de-serialize to not have side effects all the way to what shows in the UI and also like removing code rather than adding more :)

busbey · 2020-06-25T05:08:25Z

@timoha looks like consensus-by-silence is a new message that says something along the lines of there's an SCP waiting to be processed.

saintstack · 2020-06-25T06:13:03Z

Updating TaskMonitor inside the method that deserializes procedure state data is a total surprise.

I thought TaskMonitor best effort rather than a true view especially as so many systems are delinquent regards keeping up TaskMonitor state. On the current extraordinary effort to keep it in-line, I'd say, nice, but it is causing bigger issues so bypass for this rare condition?

On the flag juggling around resume, lets do that in another issue. Let this bug fix through?

I for one do not look at TaskMonitor figuring state of Procedures. Do others? Thanks.

timoha · 2020-07-06T17:56:55Z

Looks like operator intervention is needed for this issue :)

I for one do not look at TaskMonitor figuring state of Procedures. Do others? Thanks.

Just from my perspective, I find it useful to see the procedure progress as looking plainly at procedure list isn't as helpful. In ideal world, I wouldn't need this information at all, as it would just do its job (and would only show something when it's broken). However, since this task exists, it should not have false positives.

To make it clear, I'm against "improving" this side-effect as I wouldn't find it helpful to me as operator (I just don't care that something is de-serializing), that was just a suggestion that I now regret bringing up. I'm ok with closing this PR if you decide to go that way.

tsuna · 2020-07-07T04:31:51Z

Not sure whether my opinion matters but these kinds of side effects during deserialization are usually code smells. I would also be surprised to find that out.

saintstack · 2020-09-09T21:53:19Z

@timoha closed it because no attention?

timoha · 2020-09-10T00:44:35Z

yup, the approach wasn't suitable enough to address the issue and I wasn't planning to improve it further, so I'll leave it to be properly addressed by hbase team 👍

clarax approved these changes Jun 1, 2020

View reviewed changes

timoha closed this Sep 9, 2020

Conversation

timoha commented Jun 1, 2020

Uh oh!

clarax commented Jun 1, 2020

Uh oh!

Apache-HBase commented Jun 1, 2020

Uh oh!

Apache-HBase commented Jun 2, 2020

Uh oh!

Apache-HBase commented Jun 2, 2020

Uh oh!

Apache9 commented Jun 2, 2020

Uh oh!

timoha commented Jun 2, 2020

Uh oh!

saintstack commented Jun 3, 2020

Uh oh!

Apache9 commented Jun 4, 2020

Uh oh!

Apache9 commented Jun 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timoha commented Jun 4, 2020

Uh oh!

Apache9 commented Jun 4, 2020

Uh oh!

timoha commented Jun 4, 2020

Uh oh!

busbey commented Jun 25, 2020

Uh oh!

saintstack commented Jun 25, 2020

Uh oh!

timoha commented Jul 6, 2020

Uh oh!

tsuna commented Jul 7, 2020

Uh oh!

saintstack commented Sep 9, 2020

Uh oh!

timoha commented Sep 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Apache9 commented Jun 4, 2020 •

edited

Loading