Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBASE-24368 Let HBCKSCP clear 'Unknown Servers', even if RegionStateN… #1709

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -99,12 +99,11 @@ protected Flow executeFromState(MasterProcedureEnv env, GCMergedRegionsState sta
case GC_MERGED_REGIONS_PREPARE:
// If GCMultipleMergedRegionsProcedure processing is slower than the CatalogJanitor's scan
// interval, it will end resubmitting GCMultipleMergedRegionsProcedure for the same
// region, we can skip duplicate GCMultipleMergedRegionsProcedure while previous finished
// region. We can skip duplicate GCMultipleMergedRegionsProcedure while previous finished
List<RegionInfo> parents = MetaTableAccessor.getMergeRegions(
env.getMasterServices().getConnection(), mergedChild.getRegionName());
if (parents == null || parents.isEmpty()) {
LOG.info("Region=" + mergedChild.getShortNameToLog()
+ " info:merge qualifier has been deleted");
LOG.info("{} mergeXXX qualifiers have ALL been deleted", mergedChild.getShortNameToLog());
return Flow.NO_MORE_STATE;
}
setNextState(GCMergedRegionsState.GC_MERGED_REGIONS_PURGE);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
import org.apache.hadoop.hbase.client.RegionInfo;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.master.RegionState;
import org.apache.hadoop.hbase.master.assignment.RegionStateNode;
import org.apache.hadoop.hbase.master.assignment.RegionStateStore;
import org.apache.yetus.audience.InterfaceAudience;
import org.slf4j.Logger;
Expand Down Expand Up @@ -168,4 +169,16 @@ private List<RegionInfo> getReassigns() {
return this.reassigns;
}
}

/**
* The RegionStateNode will not have a location if a confirm of an OPEN fails. On fail,
* the RegionStateNode regionLocation is set to null. This is 'looser' than the test done
* in the superclass. The HBCKSCP has been scheduled by an operator via hbck2 probably at the
* behest of a report of an 'Unknown Server' in the 'HBCK Report'. Let the operators operation
* succeed even in case where the region location in the RegionStateNode is null.
*/
@Override
protected boolean isMatchingRegionLocation(RegionStateNode rsn) {
return super.isMatchingRegionLocation(rsn) || rsn.getRegionLocation() == null;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -450,6 +450,15 @@ protected boolean shouldWaitClientAck(MasterProcedureEnv env) {
return false;
}

/**
* Moved out here so can be overridden by the HBCK fix-up SCP to be less strict about what
* it will tolerate as a 'match'.
* @return True if the region location in <code>rsn</code> matches that of this crashed server.
*/
protected boolean isMatchingRegionLocation(RegionStateNode rsn) {
return this.serverName.equals(rsn.getRegionLocation());
}

/**
* Assign the regions on the crashed RS to other Rses.
* <p/>
Expand All @@ -467,14 +476,17 @@ private void assignRegions(MasterProcedureEnv env, List<RegionInfo> regions) thr
regionNode.lock();
try {
// This is possible, as when a server is dead, TRSP will fail to schedule a RemoteProcedure
// to us and then try to assign the region to a new RS. And before it has updated the region
// and then try to assign the region to a new RS. And before it has updated the region
// location to the new RS, we may have already called the am.getRegionsOnServer so we will
// consider the region is still on us. And then before we arrive here, the TRSP could have
// updated the region location, or even finished itself, so the region is no longer on us
// any more, we should not try to assign it again. Please see HBASE-23594 for more details.
if (!serverName.equals(regionNode.getRegionLocation())) {
LOG.info("{} found a region {} which is no longer on us {}, give up assigning...", this,
regionNode, serverName);
// consider the region is still on this crashed server. Then before we arrive here, the
// TRSP could have updated the region location, or even finished itself, so the region is
// no longer on this crashed server any more. We should not try to assign it again. Please
// see HBASE-23594 for more details.
// UPDATE: HBCKServerCrashProcedure overrides isMatchingRegionLocation; this check can get
// in the way of our clearing out 'Unknown Servers'.
if (!isMatchingRegionLocation(regionNode)) {
LOG.info("{} found {} whose regionLocation no longer matches {}, skipping assign...",
this, regionNode, serverName);
continue;
}
if (regionNode.getProcedure() != null) {
Expand Down
11 changes: 11 additions & 0 deletions hbase-server/src/main/resources/hbase-webapps/master/hbck.jsp
Original file line number Diff line number Diff line change
Expand Up @@ -259,6 +259,17 @@
<h2>Unknown Servers</h2>
</div>
</div>
<p>
<span>The below are servers mentioned in the hbase:meta table that are not known to the cluster either as 'live' or 'dead'.
The server likely belongs to an older epoch and we no longer have accounting. To clear, run
'hbck2 scheduleRecoveries UNKNOWN_SERVERNAME' to schedule a ServerCrashProcedure to clear out references
and to schedule reassigns of any hosted Regions. But first, be sure the referenced Region is not currently
stuck looping trying to open. Does it show as a Region-In-Transition on the Master home page? Is it mentioned
in the 'Procedures and Locks' Procedures list? If so, perhaps it stuck in a loop trying to open but unable to
because of a missing reference of file. Read the Master log looking for the most recent mentions of the associated
Region name.
</span>
</p>
<table class="table table-striped">
<tr>
<th>RegionInfo</th>
Expand Down