Skip to content

always sets needs recovery flag in tablet mgmt iterator#4255

Merged
keith-turner merged 3 commits intoapache:elasticityfrom
keith-turner:accumulo-4251
Feb 14, 2024
Merged

always sets needs recovery flag in tablet mgmt iterator#4255
keith-turner merged 3 commits intoapache:elasticityfrom
keith-turner:accumulo-4251

Conversation

@keith-turner
Copy link
Copy Markdown
Contributor

This commit fixes #4251. The WalSunnyDayIT was failing because the root tablet needed recovery, but the tablet mgmt iterator was not indicating this becaue the manager was in safe mode. Therefore recovery never happened and the test timed out.

This commit fixes apache#4251.  The WalSunnyDayIT was failing because the root
tablet needed recovery, but the tablet mgmt iterator was not indicating
this becaue the manager was in safe mode.  Therefore recovery never
happened and the test timed out.
@keith-turner
Copy link
Copy Markdown
Contributor Author

This commit also changes the tablet mgmt iterator to always indicate if volume recovery is needed.

@keith-turner
Copy link
Copy Markdown
Contributor Author

I put some notes on #4251 showing some of the steps taken while debugging this problem.

@keith-turner keith-turner changed the title always sets neeeds recovery flag in tablet mgmt iterator always sets needs recovery flag in tablet mgmt iterator Feb 12, 2024
Exception error = null;
try {
LOG.trace("Evaluating extent: {}", tm);
computeTabletManagementActions(tm, actions);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to this change computeTabletManagementActions would only be called if the Manager state was normal, tservers were online, and there were online tables. We don't need these conditions to be true for the Root table, but I think we do for the Metadata and other tables as they are hosted by TabletServers. I'm wondering the consequence of calling this in all cases.

I'm wondering if we should only do this for the Root table.

Copy link
Copy Markdown
Contributor Author

@keith-turner keith-turner Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the manager state is SAFE_MODE, want to host the root and metadata table. If the metadata table needs log recovery and we do not call computeTabletManagementActions then the log recovery will not happen. Also if the root or metadata table need volume replacement and we do not call omputeTabletManagementActions then it will not happen. This change fixes those problems, but not sure if it introduces new problems. I did open #4256 about evaluating this code overall.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how to do it ATM, but would like to minimize the specialized code used by TabletManagementIterator that is not used by TabletGroupWatcher. That was one thing I was wondering about when opening #4256.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is sufficient to fix the problem:

if (tm is Root or Metadata Table) {
        LOG.trace("Evaluating extent: {}", tm);
        computeTabletManagementActions(tm, actions);
} else {
        if (tabletMgmtParams.getManagerState() != ManagerState.NORMAL
            || tabletMgmtParams.getOnlineTsevers().isEmpty()
            || tabletMgmtParams.getOnlineTables().isEmpty()) {
          // when manager is in the process of starting up or shutting down return everything.
          actions.add(ManagementAction.NEEDS_LOCATION_UPDATE);
        } else {
          LOG.trace("Evaluating extent: {}", tm);
          computeTabletManagementActions(tm, actions);
        }
}        

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is sufficient to fix the problem:

That is a more conservative change. Made it in 43e11b1. Thinking we may be able to remove this code, but that can be done in #4256 after we get all ITs working in elasticity.

@keith-turner keith-turner merged commit 4e1b052 into apache:elasticity Feb 14, 2024
@keith-turner keith-turner deleted the accumulo-4251 branch February 14, 2024 22:13
@keith-turner keith-turner added this to the 4.0.0 milestone Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants