Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds admin check for dangling fate references #4686

Merged
merged 3 commits into from
Jun 25, 2024

Conversation

keith-turner
Copy link
Contributor

Added the ability to check for tablets that reference fate operations that are no longer active. This was added to the accumulo admin checkTablets command.

A unit test was added that validates the algorithm and the extraction of fate ids from tablet metadata. Manual testing was done to validate end to end functionality.

For manual test the following command were run in the shell

grant Table.WRITE -t accumulo.metadata -u root
insert 1< srv opid SPLITTING:FATE:USER:dfdb85a6-65a0-47d2-a9e2-4c671b499829

and then the following was run

$ accumulo admin checkTablets

*** Looking for offline tablets ***

Scanning zookeeper
Scanning accumulo.root
Scanning accumulo.metadata
1<< is UNASSIGNED  #walogs:0

*** Looking for missing files ***

Scanning : accumulo.root (-inf,~ : [] 9223372036854775807 false)
Scan finished, 0 files of 2 missing

Scanning : accumulo.metadata (-inf,~ : [] 9223372036854775807 false)
Scan finished, 0 files of 0 missing

*** Looking for dangling fate operations ***

FATE:USER:dfdb85a6-65a0-47d2-a9e2-4c671b499829 1<<

Found 1 dangling references to fate operations

Added the ability to check for tablets that reference
fate operations that are no longer active.  This was
added to the `accumulo admin checkTablets` command.

A unit test was added that validates the algorithm and
the extraction of fate ids from tablet metadata. Manual
testing was done to validate end to end functionality.

For manual test the following command were run in the shell

```
grant Table.WRITE -t accumulo.metadata -u root
insert 1< srv opid SPLITTING:FATE:USER:dfdb85a6-65a0-47d2-a9e2-4c671b499829
```

and then the following was run

```
$ accumulo admin checkTablets

*** Looking for offline tablets ***

Scanning zookeeper
Scanning accumulo.root
Scanning accumulo.metadata
1<< is UNASSIGNED  #walogs:0

*** Looking for missing files ***

Scanning : accumulo.root (-inf,~ : [] 9223372036854775807 false)
Scan finished, 0 files of 2 missing

Scanning : accumulo.metadata (-inf,~ : [] 9223372036854775807 false)
Scan finished, 0 files of 0 missing

*** Looking for dangling fate operations ***

FATE:USER:dfdb85a6-65a0-47d2-a9e2-4c671b499829 1<<

Found 1 dangling references to fate operations
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command description may need to be updated:
@Parameters(commandDescription = "print tablets that are offline in online tables")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 61447d8. While looking at the command as a whole noticed some minor issue w/ the existing code and fixed those.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should an option be added to the command to do something about these dangling fate ops similar to
@Parameter(names = "--fixFiles", description = "Remove dangling file pointers")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not want to do that now because for something like merge or split the tablet could be in a really bad state so removing the reference does not make things better. Thinking its best to find these for now if they exists and try to find the cause. If the cause is a bug in accumulo, then the bug needs to be fixed and cleanup considered as part of that bug fix..

@@ -93,4 +115,94 @@ public void testCannotQualifySessionId() {
EasyMock.verify(zc);
}

@Test
public void testDanglingFate() {
Copy link
Contributor

@kevinrr888 kevinrr888 Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice test. Includes test where all are inactive fateIds, a mix of active and inactive, and tests the race condition. Could potentially add test where all are active but not really necessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the unit test in 987f699

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Verified new test passes, verified end-to-end test with no dangling fate ops and 1 dangling fate op:


*** Looking for offline tablets ***

Scanning zookeeper
Scanning accumulo.root
Scanning accumulo.metadata

*** Looking for missing files ***

Scanning : accumulo.root (-inf,~ : [] 9223372036854775807 false)
Scan finished, 0 files of 2 missing

Scanning : accumulo.metadata (-inf,~ : [] 9223372036854775807 false)
Scan finished, 0 files of 0 missing


*** Looking for dangling fate operations ***


Found 0 dangling references to fate operations

*** Looking for offline tablets ***

Scanning zookeeper
Scanning accumulo.root
Scanning accumulo.metadata

*** Looking for missing files ***

Scanning : accumulo.root (-inf,~ : [] 9223372036854775807 false)
Scan finished, 0 files of 1 missing

Scanning : accumulo.metadata (-inf,~ : [] 9223372036854775807 false)
Scan finished, 0 files of 0 missing


*** Looking for dangling fate operations ***

FATE:USER:dfdb85a6-65a0-47d2-a9e2-4c671b499829 1<<

Found 1 dangling references to fate operations

@keith-turner keith-turner merged commit f477617 into apache:elasticity Jun 25, 2024
8 checks passed
@keith-turner keith-turner deleted the accumulo-3846 branch June 25, 2024 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gracefully handle failures in splits
2 participants