HBASE-23984 [Flakey Tests] TestMasterAbortAndRSGotKilled fails in tea…#1311
HBASE-23984 [Flakey Tests] TestMasterAbortAndRSGotKilled fails in tea…#1311saintstack merged 1 commit intoapache:branch-2from
Conversation
|
Updated the changeset comment: |
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
| // opening can not be interrupted by a close request any more. | ||
| region = HRegion.openHRegion(regionInfo, htd, rs.getWAL(regionInfo), rs.getConfiguration(), | ||
| rs, null); | ||
| rs.postOpenDeployTasks(new PostOpenDeployContext(region, openProcId, masterSystemTime)); |
There was a problem hiding this comment.
Yikes! Yeah, this seems better here. Good.
There was a problem hiding this comment.
No...
IIRC, the design here is that, postOpenDeployTasks is the PONR, if we arrive here, then we can not revert back, the only way to address the exception is to abort the region server.
The fact is that, if we haven't told master anything, it is fine for us to close the region and tell master the failure, but once we have already called master with the succeeded message, even if the rpc call fails, we do not know whether the other side(the master) has received and processed the request already, so the only way is to retry for ever, and if this can not be done, the only way is to abort ourselves...
There was a problem hiding this comment.
bq. IIRC, the design here is that, postOpenDeployTasks is the PONR, if we arrive here, then we can not revert back, the only way to address the exception is to abort the region server.
Ok. That helps. Let me add above as comment and ensure the above happens and that I get my fix in.
| * <p>Expects that the close has been registered in the hosting RegionServer before | ||
| * submitting this Handler; i.e. <code>rss.getRegionsInTransitionInRS().putIfAbsent( | ||
| * this.regionInfo.getEncodedNameAsBytes(), Boolean.FALSE);</code> has been called first. | ||
| * In here when done, we do the deregister.</p> |
| } | ||
| } | ||
|
|
||
| @Override protected void handleException(Throwable t) { |
There was a problem hiding this comment.
Maybe it's just because I'm new to the *Handler code, but it's not clear to me why one would handle exceptions locally vs. handle them from this handleException method. I guess it's all hooks for operating within the confines of a Runnable off on a thread pool somewhere.
There was a problem hiding this comment.
Yes, inconsistently used. Here trying to keep w/ the herd.
| // the master to split our logs in order to recover the data. | ||
| server.abort("Unrecoverable exception while closing region " + | ||
| regionInfo.getRegionNameAsString() + ", still finishing close", ioe); | ||
| throw new RuntimeException(ioe); |
There was a problem hiding this comment.
After reading the above comment and seeing you discarded the throwing of this exception, I initially choked. But reading through the actual use of these Handlers in the ExecutorService instance hanging off of HRegionServer, and HMaster I can only conclude that the above throw was only wishful thinking. There's even a comment (emphasis mine):
Start up all services. If any of these threads gets an unhandled exception
then they just die with a logged message. This should be fine because
in general, we do not expect the master to get such unhandled exceptions
as OOMEs; it should be lightly loaded. See what HRegionServer does if
need to install an unexpected exception handler.
The author of the above comment speaks wistfully of what i can only assume is HRegionServer#uncaughtExceptionHandler. However, it doesn't appear that this is threaded down into the executor service, which means this line's throw statement is simply logged and ignored.
So yes, I think removing the throw is the right choice. It removes the false sense of handling this error condition correctly. It's really the abort that protects the content of the memstore.
Also, why is there not a named exception thrown by the memstore when it cannot flush? Seems like a useful point in that data structure's API.
| rs.finishRegionProcedure(closeProcId); | ||
| LOG.info("Closed {}", regionName); | ||
| } finally { | ||
| rs.getRegionsInTransitionInRS().remove(encodedNameBytes, Boolean.FALSE); |
Apache9
left a comment
There was a problem hiding this comment.
Let's focus on just making the UT pass here, without changing other code.
I suggest we open a follow on issue, to discuss the abort behavior. To me, the operations in abort method do not make sense. Maybe we just need to try our best to close the connection to zk to let master know we are dead, and then just do a System.exit(1). For now we will do lots of clean up work and even want to flush all the regions? This is not a abort I'd say, it is almost like a graceful shutdown...
| // opening can not be interrupted by a close request any more. | ||
| region = HRegion.openHRegion(regionInfo, htd, rs.getWAL(regionInfo), rs.getConfiguration(), | ||
| rs, null); | ||
| rs.postOpenDeployTasks(new PostOpenDeployContext(region, openProcId, masterSystemTime)); |
There was a problem hiding this comment.
No...
IIRC, the design here is that, postOpenDeployTasks is the PONR, if we arrive here, then we can not revert back, the only way to address the exception is to abort the region server.
The fact is that, if we haven't told master anything, it is fine for us to close the region and tell master the failure, but once we have already called master with the succeeded message, even if the rpc call fails, we do not know whether the other side(the master) has received and processed the request already, so the only way is to retry for ever, and if this can not be done, the only way is to abort ourselves...
| // Cache the close region procedure id after report region transition succeed. | ||
| rs.finishRegionProcedure(closeProcId); | ||
| LOG.info("Closed {}", regionName); | ||
| } finally { |
There was a problem hiding this comment.
So this is the actual fix here?
If you really want to do this to let the test pass, I suggest you add the removal in the handleException method, and add a FIXME or TODO comment to say that this is just for making test pass, should be addressed later.
|
bq. Let's focus on just making the UT pass here, without changing other code. It is not just about unit test. bq. I suggest we open a follow on issue, to discuss the abort behavior. You are welcome to. I'm current just interested in landing a fix for cluster shutdown/RS aborts and concurrent assign/unassigns which causes flakey test failures and hangs in the wild. bq. To me, the operations in abort method do not make sense. Maybe we just need to try our best to close the connection to zk to let master know we are dead, and then just do a System.exit(1). For now we will do lots of clean up work and even want to flush all the regions? This is not a abort I'd say, it is almost like a graceful shutdown... For new issue. |
|
bq. If you really want to do this to let the test pass, I suggest you add the removal in the handleException method, and add a FIXME or TODO comment to say that this is just for making test pass, should be addressed later. I can move it to handleException, np. I will NOT note that it a UT fix only. There is an obvious hole here holds up shutdowns and shutdowns are not UT only. These Handlers strike me as arbitrary regards where stuff goes; no wonder there are holes. Let me put up another patch w/ your suggestions. |
|
New push. Enjoys the benefit of @Apache9 feedback. Main change is restoring these Handlers to as they were (but w/ the PONR comment added) and then in the handleException, just removing entry from RS RIT map just before call to abort. Gets me what I want and leaves rest of code as was. |
|
🎊 +1 overall
This message was automatically generated. |
Apache9
left a comment
There was a problem hiding this comment.
+1 for now.
Let's open another issue to address the shutdown issue.
| // Done! Region is closed on this RS | ||
| this.rsServices.getRegionsInTransitionInRS(). | ||
| remove(this.regionInfo.getEncodedNameAsBytes(), Boolean.FALSE); | ||
| LOG.debug("Closed " + region.getRegionInfo().getRegionNameAsString()); |
There was a problem hiding this comment.
LOG.debug("Closed {}", region.getRegionInfo().getRegionNameAsString());
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
…rdown hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java Change parameter name and add javadoc to make it more clear what the param actually is. hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/AssignRegionHandler.java Move postOpenDeployTasks so if it fails to talk to the Master -- which can happen on cluster shutdown -- then we will do cleanup of state; without this the RS can get stuck and won't go down. hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java Add handleException so CRH looks more like UnassignRegionHandler and AssignRegionHandler around exception handling. Add a bit of doc on why CRH. hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/UnassignRegionHandler.java Right shift most of the body of process so can add in a finally that cleans up rs.getRegionsInTransitionInRS is on exception (otherwise outstanding entries can stop a RS going down on cluster shutdown)
|
One of the test failures -- A perversion around Region handling in TestRegionObserverInterface -- exposed issue w/ the CloseRegionHandler refactor trying to make it look like other handlers around regionsInTransitionInRS handling. Fixed (and fixed the issue @Apache9 noted above). Added region name logging to this journal stuff -- otherwise its just opaque... thats just log change. |
|
Thanks for review @Apache9 . I'd filed HBASE-24015 a few days ago because it seemed plain this issue had opened a can of worms -- and that was before you showed up. You want to go more radical that the scope of HBASE-24015, so I made HBASE-24026 for shutdown redo. Thanks. |
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
Test report shows no failures. The test output got dropped because of below. I've been running this in tests local and seems fine. Will merge and keep an eye on it. Post stage |
…rdown (#1311) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java Change parameter name and add javadoc to make it more clear what the param actually is. hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/AssignRegionHandler.java Move postOpenDeployTasks so if it fails to talk to the Master -- which can happen on cluster shutdown -- then we will do cleanup of state; without this the RS can get stuck and won't go down. hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java Add handleException so CRH looks more like UnassignRegionHandler and AssignRegionHandler around exception handling. Add a bit of doc on why CRH. hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/UnassignRegionHandler.java Right shift most of the body of process so can add in a finally that cleans up rs.getRegionsInTransitionInRS is on exception (otherwise outstanding entries can stop a RS going down on cluster shutdown) Signed-off-by: Nick Dimiduk <ndimiduk@apache.org> Signed-off-by: Duo Zhang <zhangduo@apache.org>
…rdown (#1311) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java Change parameter name and add javadoc to make it more clear what the param actually is. hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/AssignRegionHandler.java Move postOpenDeployTasks so if it fails to talk to the Master -- which can happen on cluster shutdown -- then we will do cleanup of state; without this the RS can get stuck and won't go down. hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java Add handleException so CRH looks more like UnassignRegionHandler and AssignRegionHandler around exception handling. Add a bit of doc on why CRH. hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/UnassignRegionHandler.java Right shift most of the body of process so can add in a finally that cleans up rs.getRegionsInTransitionInRS is on exception (otherwise outstanding entries can stop a RS going down on cluster shutdown) Signed-off-by: Nick Dimiduk <ndimiduk@apache.org> Signed-off-by: Duo Zhang <zhangduo@apache.org>
…rdown (apache#1311) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java Change parameter name and add javadoc to make it more clear what the param actually is. hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/AssignRegionHandler.java Move postOpenDeployTasks so if it fails to talk to the Master -- which can happen on cluster shutdown -- then we will do cleanup of state; without this the RS can get stuck and won't go down. hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java Add handleException so CRH looks more like UnassignRegionHandler and AssignRegionHandler around exception handling. Add a bit of doc on why CRH. hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/UnassignRegionHandler.java Right shift most of the body of process so can add in a finally that cleans up rs.getRegionsInTransitionInRS is on exception (otherwise outstanding entries can stop a RS going down on cluster shutdown) Signed-off-by: Nick Dimiduk <ndimiduk@apache.org> Signed-off-by: Duo Zhang <zhangduo@apache.org>
…rdown (apache#1311) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java Change parameter name and add javadoc to make it more clear what the param actually is. hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/AssignRegionHandler.java Move postOpenDeployTasks so if it fails to talk to the Master -- which can happen on cluster shutdown -- then we will do cleanup of state; without this the RS can get stuck and won't go down. hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java Add handleException so CRH looks more like UnassignRegionHandler and AssignRegionHandler around exception handling. Add a bit of doc on why CRH. hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/UnassignRegionHandler.java Right shift most of the body of process so can add in a finally that cleans up rs.getRegionsInTransitionInRS is on exception (otherwise outstanding entries can stop a RS going down on cluster shutdown) Signed-off-by: Nick Dimiduk <ndimiduk@apache.org> Signed-off-by: Duo Zhang <zhangduo@apache.org>
…rdown (apache#1311) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java Change parameter name and add javadoc to make it more clear what the param actually is. hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/AssignRegionHandler.java Move postOpenDeployTasks so if it fails to talk to the Master -- which can happen on cluster shutdown -- then we will do cleanup of state; without this the RS can get stuck and won't go down. hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java Add handleException so CRH looks more like UnassignRegionHandler and AssignRegionHandler around exception handling. Add a bit of doc on why CRH. hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/UnassignRegionHandler.java Right shift most of the body of process so can add in a finally that cleans up rs.getRegionsInTransitionInRS is on exception (otherwise outstanding entries can stop a RS going down on cluster shutdown) Signed-off-by: Nick Dimiduk <ndimiduk@apache.org> Signed-off-by: Duo Zhang <zhangduo@apache.org>
…rdown