-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-28168 Add option in RegionMover.java to isolate one or more reg… #5470
Conversation
…ions on the RegionSever
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting this PR @mihir6692 , looks useful. I had left a few comments within.
An extra thought I have is if we should implement this as a separate "isolate_regions" operation, rather than being an option to "unload". It sounds more intuitive to me, wdyt?
|
||
if (!allRegionOpsSuccessful) { | ||
// Failed to fetch one of the region's RegionInfo, so we exit from here. | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we've succeed to submit a first few before the failing to fetch meta location, we won't wait for the ack on the ones we submitted? This could lead the RS even more overloaded, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we've succeed to submit a first few before the failing to fetch meta location, we won't wait for the ack on the ones we submitted?
Tool haven't submitted any region move task yet, it is only collection RegionInfo for possible move tasks.
At this point, if flag allRegionOpsSuccessful
is set to false
that means that tool fails to fetch RegionInfo
for one of the region that needs to be isolated. By running break
tool will exit from while (true)
loop without submitting any region move (i.e. isolation) request And we won't be submitted any move request, at this point we only have list of region move task in isolateRegionTaskList
If allRegionOpsSuccessful
is set to true
then all RegionInfo were found and waitMoveTasksToFinish
will submit region move (i.e. isolation) task in Ack mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't you submitting a task for each region on your thread pool on line #562? For example, if you pass 3 regions, the first two are located in meta, but the third isn't, you will break the while, but two tasks for the located regions would be running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally missed that. Will take a look into it. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it in 911df20
+ server.getHostname()); | ||
} else { | ||
Future<Boolean> isolateRegionTask = isolateRegionPool.submit( | ||
new MoveWithAck(conn, isolateRegionInfo, hRegionLocation.getServerName(), server, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are moving regions to the RS we actually wanted to unload? Sounds contradictory, no? And since we are doing it before we actually do the unloads, it may even make situation worse for the case of overloaded RS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are moving regions to the RS we actually wanted to unload? Sounds contradictory, no?
Yes and Yes, it does sound contradictory But Region isolation means we are isolating one or more regions to a single RS, and if that RS have existing regions, then we would have to unload them.
And since we are doing it before we actually do the unloads, it may even make situation worse for the case of overloaded RS.
Region isolation can be done in two ways here.
-
Unload the target RS, move the regions on the target RS to isolate them and put the target RS in draining/decommission mode.. While we do this, HMaster could move another region on the target RS before target RS is in draining/decommission mode. (We could disable/enable balancer switch in this option but it would be unnecessary in my opinion.)
-
Move the regions on the target RS, put the target RS in draining/decommission mode and then unload the rest of regions from the RS. This would ensure that while we unload regions from the target RS, HMaster can't move any new Regions on the RS.
With option 2, It seems cleaner and more efficient to me. Granted that target RS could be overloaded for some amount of time but it would be easier and cleaner to ensure that target RS only have the regions intended for isolation.
Hope this make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather do #1 with an extra round of unload to address the case master moved regions there during the isolating phase. It would reduce the risk of problems due to overloaded RS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In approach 1, All the regions would be offloaded first including hotspotting region. This can lead to resource contention on the new region where hotspotting moves. Also hotspotting region could get stuck in region transition if region is holding some read-write lock.
@wchevreuil That a good suggestion. Let me look into it. Earlier I planned this feature as separate class from Thanks for the review. @wchevreuil |
On the second thought, It may be a lot easier to implement and user-friendly with |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Oh, I thought about implement this In hbck2 we do a lot of "composable" operations, some granular and others aggregated. |
IMO, it's not only confusing to make this is as part of unload, it's semantically incorrect. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
@wchevreuil Yes, I have implemented it following commit 44c6b43 Although there is one case which won't work with current implementation. While discussing offline with @virajjasani , I came to realise that It won't work for isolating Currently I am using I am working on handling this case right now. |
this.addOptWithArg("isolateRegionIds", | ||
"Comma separated list of Region IDs hash to isolate on a RegionServer and put " | ||
+ " region server in draining mode. This option should only be used with '-o isolate_regions'" | ||
+ " only. By putting region server in decommission/draining mode, master can't assign any" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove extra "only" at the and of the sentence: "This option should only be used with '-o isolate_regions' only"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
🎊 +1 overall
This message was automatically generated. |
@wchevreuil Can you again review the PR? I have fixed all the cases and addressed the review comments too. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Test failures are not related to the Patch. Not sure why they are failing. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, +1. UT failure seems unrelated.
Thanks @wchevreuil |
Hi @wchevreuil Can you merge this PR? This same patch would work for For |
#5470) Signed-off-by: Wellington Chevreuil <wchevreuil@apache.org>
apache#5470) Signed-off-by: Wellington Chevreuil <wchevreuil@apache.org>
apache#5470) Signed-off-by: Wellington Chevreuil <wchevreuil@apache.org>
…ions on the RegionSever