always sets needs recovery flag in tablet mgmt iterator#4255
always sets needs recovery flag in tablet mgmt iterator#4255keith-turner merged 3 commits intoapache:elasticityfrom
Conversation
This commit fixes apache#4251. The WalSunnyDayIT was failing because the root tablet needed recovery, but the tablet mgmt iterator was not indicating this becaue the manager was in safe mode. Therefore recovery never happened and the test timed out.
|
This commit also changes the tablet mgmt iterator to always indicate if volume recovery is needed. |
|
I put some notes on #4251 showing some of the steps taken while debugging this problem. |
| Exception error = null; | ||
| try { | ||
| LOG.trace("Evaluating extent: {}", tm); | ||
| computeTabletManagementActions(tm, actions); |
There was a problem hiding this comment.
Prior to this change computeTabletManagementActions would only be called if the Manager state was normal, tservers were online, and there were online tables. We don't need these conditions to be true for the Root table, but I think we do for the Metadata and other tables as they are hosted by TabletServers. I'm wondering the consequence of calling this in all cases.
I'm wondering if we should only do this for the Root table.
There was a problem hiding this comment.
When the manager state is SAFE_MODE, want to host the root and metadata table. If the metadata table needs log recovery and we do not call computeTabletManagementActions then the log recovery will not happen. Also if the root or metadata table need volume replacement and we do not call omputeTabletManagementActions then it will not happen. This change fixes those problems, but not sure if it introduces new problems. I did open #4256 about evaluating this code overall.
There was a problem hiding this comment.
Not sure how to do it ATM, but would like to minimize the specialized code used by TabletManagementIterator that is not used by TabletGroupWatcher. That was one thing I was wondering about when opening #4256.
There was a problem hiding this comment.
Maybe this is sufficient to fix the problem:
if (tm is Root or Metadata Table) {
LOG.trace("Evaluating extent: {}", tm);
computeTabletManagementActions(tm, actions);
} else {
if (tabletMgmtParams.getManagerState() != ManagerState.NORMAL
|| tabletMgmtParams.getOnlineTsevers().isEmpty()
|| tabletMgmtParams.getOnlineTables().isEmpty()) {
// when manager is in the process of starting up or shutting down return everything.
actions.add(ManagementAction.NEEDS_LOCATION_UPDATE);
} else {
LOG.trace("Evaluating extent: {}", tm);
computeTabletManagementActions(tm, actions);
}
}
This commit fixes #4251. The WalSunnyDayIT was failing because the root tablet needed recovery, but the tablet mgmt iterator was not indicating this becaue the manager was in safe mode. Therefore recovery never happened and the test timed out.