Skip to content

HBASE-24438 Don't update TaskMonitor when deserializing ServerCrashProcedure#1826

Closed
timoha wants to merge 1 commit intoapache:masterfrom
timoha:HBASE-24438
Closed

HBASE-24438 Don't update TaskMonitor when deserializing ServerCrashProcedure#1826
timoha wants to merge 1 commit intoapache:masterfrom
timoha:HBASE-24438

Conversation

@timoha
Copy link
Contributor

@timoha timoha commented Jun 1, 2020

The ServerCrashProcedure could have been completed by previously active
HBase Master which results in a stale ServerCrashProcedure task in TaskMonitor.
The TaskMonitor should only reflect the procedure in case the procedure has actually
been started/resumed which is done when ServerCrashProcedure.executeFromState is called.

…ocedure

The ServerCrashProcedure could have been completed by previously active
HBase Master which results in a stale ServerCrashProcedure task in TaskMonitor.
The TaskMonitor should only reflect the procedure in case the procedure has actually
been started/resumed which is done when ServerCrashProcedure.executeFromState is called.
@clarax
Copy link
Contributor

clarax commented Jun 1, 2020

Patch LGTM. But need to check with author @jingyuntian for https://issues.apache.org/jira/browse/HBASE-21647 to see if this will break other cases.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 1m 13s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ master Compile Tests _
+1 💚 mvninstall 4m 12s master passed
+1 💚 checkstyle 1m 17s master passed
+1 💚 spotbugs 2m 17s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 3m 47s the patch passed
+1 💚 checkstyle 1m 12s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 hadoopcheck 12m 19s Patch does not cause any errors with Hadoop 3.1.2 3.2.1.
+1 💚 spotbugs 2m 58s the patch passed
_ Other Tests _
+1 💚 asflicense 0m 15s The patch does not generate ASF License warnings.
38m 4s
Subsystem Report/Notes
Docker Client=19.03.11 Server=19.03.11 base: https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-1826/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #1826
Optional Tests dupname asflicense spotbugs hadoopcheck hbaseanti checkstyle
uname Linux df9af8824035 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 716702a
Max. process+thread count 84 (vs. ulimit of 12500)
modules C: hbase-server U: hbase-server
Console output https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-1826/1/console
versions git=2.17.1 maven=(cecedd343002696d0abb50b32b541b8a6ba2883f) spotbugs=3.1.12
Powered by Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 1m 38s Docker mode activated.
-0 ⚠️ yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+1 💚 mvninstall 4m 15s master passed
+1 💚 compile 1m 3s master passed
+1 💚 shadedjars 5m 48s branch has no errors when building our shaded downstream artifacts.
-0 ⚠️ javadoc 0m 41s hbase-server in master failed.
_ Patch Compile Tests _
+1 💚 mvninstall 4m 1s the patch passed
+1 💚 compile 1m 4s the patch passed
+1 💚 javac 1m 4s the patch passed
+1 💚 shadedjars 5m 42s patch has no errors when building our shaded downstream artifacts.
-0 ⚠️ javadoc 0m 40s hbase-server in the patch failed.
_ Other Tests _
+1 💚 unit 126m 2s hbase-server in the patch passed.
153m 4s
Subsystem Report/Notes
Docker Client=19.03.11 Server=19.03.11 base: https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-1826/1/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #1826
Optional Tests javac javadoc unit shadedjars compile
uname Linux 8f066f9262d1 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 716702a
Default Java 2020-01-14
javadoc https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-1826/1/artifact/yetus-jdk11-hadoop3-check/output/branch-javadoc-hbase-server.txt
javadoc https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-1826/1/artifact/yetus-jdk11-hadoop3-check/output/patch-javadoc-hbase-server.txt
Test Results https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-1826/1/testReport/
Max. process+thread count 4411 (vs. ulimit of 12500)
modules C: hbase-server U: hbase-server
Console output https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-1826/1/console
versions git=2.17.1 maven=(cecedd343002696d0abb50b32b541b8a6ba2883f)
Powered by Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 1m 13s Docker mode activated.
-0 ⚠️ yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+1 💚 mvninstall 4m 13s master passed
+1 💚 compile 1m 1s master passed
+1 💚 shadedjars 6m 9s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 39s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 3m 49s the patch passed
+1 💚 compile 0m 59s the patch passed
+1 💚 javac 0m 59s the patch passed
+1 💚 shadedjars 6m 0s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 36s the patch passed
_ Other Tests _
+1 💚 unit 216m 6s hbase-server in the patch passed.
242m 39s
Subsystem Report/Notes
Docker Client=19.03.11 Server=19.03.11 base: https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-1826/1/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile
GITHUB PR #1826
Optional Tests javac javadoc unit shadedjars compile
uname Linux f673620e475f 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 716702a
Default Java 1.8.0_232
Test Results https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-1826/1/testReport/
Max. process+thread count 3675 (vs. ulimit of 12500)
modules C: hbase-server U: hbase-server
Console output https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-1826/1/console
versions git=2.17.1 maven=(cecedd343002696d0abb50b32b541b8a6ba2883f)
Powered by Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

@Apache9
Copy link
Contributor

Apache9 commented Jun 2, 2020

I think the problem here is in the implementation of updateProgress? We pass updateState as false when calling it in the deserialize method, but the only place to update the currentRunningState is in the updateProgress method, so when calling from deserialize the currentRunningState will always be null.

And the updateProgress method is totally a mess. The order of the execution it really strange...
I think it should be like this

    String msg = "Processing ServerCrashProcedure of " + serverName;
    if (status == null) {
      status = TaskMonitor.get().createStatus(msg);
    }
    if (updateState) {
      currentRunningState = getCurrentState();
    }
    if (currentRunningState == ServerCrashState.SERVER_CRASH_FINISH) {
      status.markComplete(msg + " done");
      return;
    }
    int childrenLatch = getChildrenLatch();
    status.setStatus(msg + " current State " + currentRunningState
        + (childrenLatch > 0 ? "; remaining num of running child procedures = " + childrenLatch
            : ""));

And we should call updateProgress(true) in deserializeStateData. Could you please try if this can fix your problem?

Thanks.

@timoha
Copy link
Contributor Author

timoha commented Jun 2, 2020

Thanks for suggestion @Apache9, but I don't think that's going to work for the case of state being SERVER_CRASH_FINISH. With such code change, executeFromState will end up calling updateProgress(true) at the beginning of the function marking the task as done. Whereas it should be actually marking the task as done when switch case SERVER_CRASH_FINISH is called.

Also, could you please explain a rationale about adding a task during procedure deserialization? If procedure has finished during run of previous, what is the value in displaying it in tasks as "completed" when a new master comes after replaying the logs? In other words, should a simple fact of deserialization have such a side effect on tasks or should actual procedure execution drive the tasks updates instead?

@saintstack
Copy link
Contributor

bq. .... should a simple fact of deserialization have such a side effect on tasks...

No. That is wonky. For this reason alone we should apply this patch.

Will wait a while in case comeback.

@Apache9
Copy link
Contributor

Apache9 commented Jun 4, 2020

Thanks for suggestion @Apache9, but I don't think that's going to work for the case of state being SERVER_CRASH_FINISH. With such code change, executeFromState will end up calling updateProgress(true) at the beginning of the function marking the task as done. Whereas it should be actually marking the task as done when switch case SERVER_CRASH_FINISH is called.

Also, could you please explain a rationale about adding a task during procedure deserialization? If procedure has finished during run of previous, what is the value in displaying it in tasks as "completed" when a new master comes after replaying the logs? In other words, should a simple fact of deserialization have such a side effect on tasks or should actual procedure execution drive the tasks updates instead?

I think unless you call update progress, otherwise the procedure will not show up in the TaskMonitor? The updateProgress call in deserialization is something like an initialization. If there are so many SCPs, maybe it is possible that one of the SCPs can not be scheduled for a long time, then if you do not all updateProgress in deserialization, you can not see its current status for a long tim, until it gets scheduled.

@Apache9
Copy link
Contributor

Apache9 commented Jun 4, 2020

And I got the reason why there is a currentRunningState field. We will call setNextState after scheduling sub procedures, so when updating progress, if we just use the state machine state by calling getCurrentState, we will set the message in the TaskMonitor to the next state, which is a bit confusing to users.

So I think we could add another flag to updateProgress, to indicate whether the procedure is complete, and when calling after excuting the SERVER_CRASH_FINISH, we set this flag to true to let the method complete the procedure in the TaskMonitor.

And also, in deserialization method, we could check the 'ProcedureState'(not the state of the state machine), if it is SUCCESS or FAILED, then we just skip calling updateProgress, so it will not have stale SCPs.

Thanks.

@timoha
Copy link
Contributor Author

timoha commented Jun 4, 2020

That's a good point @Apache9, I guess it would add sort of "continuity" to show that the task is still ongoing during master failover in case the procedure hasn't actually finished. However, I think there's an unwanted side effect here in case the procedure has actually completed by previous master (was marked done) and SCP not being actually scheduled for a while.

As an operator, if I were to notice that there's SCP, I would freak out and try to see why my regionserver failed and then try to look at the logs (and in this case nothing is logged about it new master), and then try to look at procedure list (also without finding anything there). So, I guess in that case it would be more expected to show SCP in task when it's is actually being executed rather than "pending execution"?

Maybe an alternative would be to improve the message in tasks and explicitly say "noticed an SCP, waiting for processing". Don't know if such verbosity would be useful though?

@Apache9
Copy link
Contributor

Apache9 commented Jun 4, 2020

Maybe an alternative would be to improve the message in tasks and explicitly say "noticed an SCP, waiting for processing". Don't know if such verbosity would be useful though?

Good. Waiting for your new patch.

@timoha
Copy link
Contributor Author

timoha commented Jun 4, 2020

I'll let others also comment before submitting a patch. I personally prefer for de-serialize to not have side effects all the way to what shows in the UI and also like removing code rather than adding more :)

@busbey
Copy link
Contributor

busbey commented Jun 25, 2020

@timoha looks like consensus-by-silence is a new message that says something along the lines of there's an SCP waiting to be processed.

@saintstack
Copy link
Contributor

Updating TaskMonitor inside the method that deserializes procedure state data is a total surprise.

I thought TaskMonitor best effort rather than a true view especially as so many systems are delinquent regards keeping up TaskMonitor state. On the current extraordinary effort to keep it in-line, I'd say, nice, but it is causing bigger issues so bypass for this rare condition?

On the flag juggling around resume, lets do that in another issue. Let this bug fix through?

I for one do not look at TaskMonitor figuring state of Procedures. Do others? Thanks.

@timoha
Copy link
Contributor Author

timoha commented Jul 6, 2020

Looks like operator intervention is needed for this issue :)

I for one do not look at TaskMonitor figuring state of Procedures. Do others? Thanks.

Just from my perspective, I find it useful to see the procedure progress as looking plainly at procedure list isn't as helpful. In ideal world, I wouldn't need this information at all, as it would just do its job (and would only show something when it's broken). However, since this task exists, it should not have false positives.

To make it clear, I'm against "improving" this side-effect as I wouldn't find it helpful to me as operator (I just don't care that something is de-serializing), that was just a suggestion that I now regret bringing up. I'm ok with closing this PR if you decide to go that way.

@tsuna
Copy link

tsuna commented Jul 7, 2020

Not sure whether my opinion matters but these kinds of side effects during deserialization are usually code smells. I would also be surprised to find that out.

@timoha timoha closed this Sep 9, 2020
@saintstack
Copy link
Contributor

@timoha closed it because no attention?

@timoha
Copy link
Contributor Author

timoha commented Sep 10, 2020

yup, the approach wasn't suitable enough to address the issue and I wasn't planning to improve it further, so I'll leave it to be properly addressed by hbase team 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants