HBASE-24438 Don't update TaskMonitor when deserializing ServerCrashProcedure#1826
HBASE-24438 Don't update TaskMonitor when deserializing ServerCrashProcedure#1826timoha wants to merge 1 commit intoapache:masterfrom
Conversation
…ocedure The ServerCrashProcedure could have been completed by previously active HBase Master which results in a stale ServerCrashProcedure task in TaskMonitor. The TaskMonitor should only reflect the procedure in case the procedure has actually been started/resumed which is done when ServerCrashProcedure.executeFromState is called.
|
Patch LGTM. But need to check with author @jingyuntian for https://issues.apache.org/jira/browse/HBASE-21647 to see if this will break other cases. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
I think the problem here is in the implementation of updateProgress? We pass updateState as false when calling it in the deserialize method, but the only place to update the currentRunningState is in the updateProgress method, so when calling from deserialize the currentRunningState will always be null. And the updateProgress method is totally a mess. The order of the execution it really strange... And we should call updateProgress(true) in deserializeStateData. Could you please try if this can fix your problem? Thanks. |
|
Thanks for suggestion @Apache9, but I don't think that's going to work for the case of state being Also, could you please explain a rationale about adding a task during procedure deserialization? If procedure has finished during run of previous, what is the value in displaying it in tasks as "completed" when a new master comes after replaying the logs? In other words, should a simple fact of deserialization have such a side effect on tasks or should actual procedure execution drive the tasks updates instead? |
|
bq. .... should a simple fact of deserialization have such a side effect on tasks... No. That is wonky. For this reason alone we should apply this patch. Will wait a while in case comeback. |
I think unless you call update progress, otherwise the procedure will not show up in the TaskMonitor? The updateProgress call in deserialization is something like an initialization. If there are so many SCPs, maybe it is possible that one of the SCPs can not be scheduled for a long time, then if you do not all updateProgress in deserialization, you can not see its current status for a long tim, until it gets scheduled. |
|
And I got the reason why there is a currentRunningState field. We will call setNextState after scheduling sub procedures, so when updating progress, if we just use the state machine state by calling getCurrentState, we will set the message in the TaskMonitor to the next state, which is a bit confusing to users. So I think we could add another flag to updateProgress, to indicate whether the procedure is complete, and when calling after excuting the SERVER_CRASH_FINISH, we set this flag to true to let the method complete the procedure in the TaskMonitor. And also, in deserialization method, we could check the 'ProcedureState'(not the state of the state machine), if it is SUCCESS or FAILED, then we just skip calling updateProgress, so it will not have stale SCPs. Thanks. |
|
That's a good point @Apache9, I guess it would add sort of "continuity" to show that the task is still ongoing during master failover in case the procedure hasn't actually finished. However, I think there's an unwanted side effect here in case the procedure has actually completed by previous master (was marked done) and SCP not being actually scheduled for a while. As an operator, if I were to notice that there's SCP, I would freak out and try to see why my regionserver failed and then try to look at the logs (and in this case nothing is logged about it new master), and then try to look at procedure list (also without finding anything there). So, I guess in that case it would be more expected to show SCP in task when it's is actually being executed rather than "pending execution"? Maybe an alternative would be to improve the message in tasks and explicitly say "noticed an SCP, waiting for processing". Don't know if such verbosity would be useful though? |
Good. Waiting for your new patch. |
|
I'll let others also comment before submitting a patch. I personally prefer for de-serialize to not have side effects all the way to what shows in the UI and also like removing code rather than adding more :) |
|
@timoha looks like consensus-by-silence is a new message that says something along the lines of there's an SCP waiting to be processed. |
|
Updating TaskMonitor inside the method that deserializes procedure state data is a total surprise. I thought TaskMonitor best effort rather than a true view especially as so many systems are delinquent regards keeping up TaskMonitor state. On the current extraordinary effort to keep it in-line, I'd say, nice, but it is causing bigger issues so bypass for this rare condition? On the flag juggling around resume, lets do that in another issue. Let this bug fix through? I for one do not look at TaskMonitor figuring state of Procedures. Do others? Thanks. |
|
Looks like operator intervention is needed for this issue :)
Just from my perspective, I find it useful to see the procedure progress as looking plainly at procedure list isn't as helpful. In ideal world, I wouldn't need this information at all, as it would just do its job (and would only show something when it's broken). However, since this task exists, it should not have false positives. To make it clear, I'm against "improving" this side-effect as I wouldn't find it helpful to me as operator (I just don't care that something is de-serializing), that was just a suggestion that I now regret bringing up. I'm ok with closing this PR if you decide to go that way. |
|
Not sure whether my opinion matters but these kinds of side effects during deserialization are usually code smells. I would also be surprised to find that out. |
|
@timoha closed it because no attention? |
|
yup, the approach wasn't suitable enough to address the issue and I wasn't planning to improve it further, so I'll leave it to be properly addressed by hbase team 👍 |
The ServerCrashProcedure could have been completed by previously active
HBase Master which results in a stale ServerCrashProcedure task in TaskMonitor.
The TaskMonitor should only reflect the procedure in case the procedure has actually
been started/resumed which is done when ServerCrashProcedure.executeFromState is called.