New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-25523 Region normalizer chore thread is getting killed #2899
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: changes seem applicable to master and all active branch-2.x also. Let's raise PR in master and continue this discussion.
getRegionsLoad().get(hri.getRegionName()); | ||
ServerLoad load = masterServices.getServerManager().getLoad(sn); | ||
if (load == null) { | ||
LOG.debug(hri.getRegionNameAsString() + " was not found on any server"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: was not found on any server while retrieving it's regionSize
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saw this issue today.. few comments
Doesn't this mean the server "sn" is offline (not that region is not assigned)?
A tighter check here is isRegionOnline(), first check if the region is online (which ensures the region is not in transition and is assigned), we can even loop a few times to avoid any transient RITs.
Another related question is does it make sense to run normalizer if there are RITs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bharathv you are right about RITs. Default behaviour of balancer is also to not execute when the cluster has RITs, only by providing a force
option does it allow using balancer forcefully in the presence of RITs. Something similar (if not already in place) should be applicable to normalizer too.
Moreover, I just checked, this change should be applicable to master and branch-2's also. I think we can start with master branch PR, continue discussion, finalize changes there and then come back here while backporting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this mean the server "sn" is offline (not that region is not assigned)?
True, most likely this would mean that region is in transition because RS hosting this region went offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@virajjasani - Yes these changes will have to be done for branch-2 and master also.
A tighter check here is isRegionOnline(), first check if the region is online (which ensures the region is not in transition and is assigned), we can even loop a few times to avoid any transient RITs.
@bharathv - I am currently looking at the Merge procedure code will get back to you. But my understanding was if a region is in transition than merge will fail so we will catch that as an exception and move ahead to other regions. But i will look again and confirm this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In branch-1 we have the check at
https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java#L685
which is being called by the flow from here
https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/normalizer/MergeNormalizationPlan.java#L71
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also in branch-2 and master we ahve https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/MergeTableRegionsProcedure.java
where in constructor we cal checkRegionsToMerge which eventually calls checkOnline() where the condition is being checked.
@virajjasani @bharathv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mnpoonia Ya, nvm. I was hoping we'd add the same check here to avoid an unnecessary RPC to merge (when we know it will fail) but I think we are still prone to other race conditions (depending on when the master assignment state is updated), so perhaps better to keep the code simple as it is today and defer the check to when the merge is actually run.
True, most likely this would mean that region is in transition because RS hosting this region went offline.
@virajjasani I see what you mean but theoretically it could also be possible that this region is online elsewhere in which case the log is misleading, may be slightly reword to clarify it? (sorry for the nitpick)..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@virajjasani I see what you mean but theoretically it could also be possible that this region is online elsewhere in which case the log is misleading
Hmm, I was thinking just the line above this where we get ServerName from AM, retrieves us regions hosting server, so was wondering in order for region to quickly transition and become online by the time we reach here might not be possible?
ServerName sn = masterServices.getAssignmentManager().getRegionStates().
getRegionServerOfRegion(hri);
One thing to figure out is what does this call return in case region is already in transition when we call this.
Anyways, good to reword accordingly in the log. Agree on it @bharathv .
💔 -1 overall
This message was automatically generated. |
getRegionsLoad().get(hri.getRegionName()); | ||
ServerLoad load = masterServices.getServerManager().getLoad(sn); | ||
if (load == null) { | ||
LOG.debug(hri.getRegionNameAsString() + " was not found on any server"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saw this issue today.. few comments
Doesn't this mean the server "sn" is offline (not that region is not assigned)?
A tighter check here is isRegionOnline(), first check if the region is online (which ensures the region is not in transition and is assigned), we can even loop a few times to avoid any transient RITs.
Another related question is does it make sense to run normalizer if there are RITs.
56df5a2
to
1fbc2f1
Compare
💔 -1 overall
This message was automatically generated. |
1fbc2f1
to
74407ab
Compare
74407ab
to
4ad08af
Compare
@bharathv @virajjasani Have a look in your free cycle. Have reworded log line and added extra null check which is a possibility. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
No description provided.