New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to prevent half-closed Tablet due to failing minc #3677
Attempt to prevent half-closed Tablet due to failing minc #3677
Conversation
Prior to this change Tablet.initiateClose would close the compactable, wait for the current minor compaction to finish, and then kick off another minor compaction. In the case where minor compactions are failing, the compactable will get closed and the call to wait for the current minor compaction to finish will hang indefinitely leaving the Tablet in a half-closed state. Tablet.initiateClose is called when either the TabletServer is closing the Tablet (due to migration or shutdown) or on a Tablet split. Therefore, a failing minor compaction will prevent Tablet migration and split or normal TabletServer shutdown. This change adds some logic at the start of Tablet.initiateClose to try and detect a currently or previously failing minor compaction. In these cases an exception is thrown before the compactable is closed to prevent the Tablet from being in a half-closed state. This prevents the Tablet from being closed until the cause for the failing minor compaction is corrected. Causes could be a bad iterator applied at minor compaction or a bad table configuration option (like classloader context). When the cause of the failing minor compaction is corrected, then the Tablet should be able to be closed normally allowing normal migration, split, and TabletServer shutdown. Fixes apache#3674 Co-authored-by: Ed Coleman <edcoleman@apache.org> Co-authored-by: dtspence <dtspence@users.noreply.github.com>
Kicked off full IT build |
server/base/src/main/java/org/apache/accumulo/server/conf/TableConfiguration.java
Show resolved
Hide resolved
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java
Outdated
Show resolved
Hide resolved
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java
Outdated
Show resolved
Hide resolved
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactableImpl.java
Show resolved
Hide resolved
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java
Outdated
Show resolved
Hide resolved
@dlmarion We believe we found a code path that produces a test timeout. It appeared to be related to a minc thread exiting due to not being able to obtain a mapfile (i.e. volume chooser seeing invalid context). The code-path in the integration test saw an error within the The calling path that causes the thread to end (i.e. context error w/volume-chooser) logged the tablets were unable to unload. The logs report the following (w/minc thread exit path) when attempting to shutdown:
To replicate, the following is needed as base configuration: @Override
public void configureMiniCluster(MiniAccumuloConfigImpl cfg, Configuration coreSite) {
cfg.setNumTservers(1);
cfg.setProperty("general.volume.chooser", "org.apache.accumulo.core.spi.fs.DelegatingChooser");
cfg.setProperty("general.custom.volume.chooser.default",
"org.apache.accumulo.core.spi.fs.PreferredVolumeChooser");
cfg.setProperty("general.custom.volume.preferred.default", "file:/home/dtspen2/dev/git/dlmarion/accumulo/test/target/mini-tests/org.apache.accumulo.test.functional.HalfClosedTabletIT_SharedMiniClusterBase/accumulo");
} Then a tops.setProperty(tableName, Property.TABLE_CLASSLOADER_CONTEXT.getKey(), "invalid");
tops.flush(tableName);
Thread.sleep(500);
// This should fail to split, but not leave the tablets in a state where they can't
// be unloaded
assertThrows(AccumuloServerException.class,
() -> tops.addSplits(tableName, Sets.newTreeSet(List.of(new Text("b")))));
tops.removeProperty(tableName, Property.TABLE_CLASSLOADER_CONTEXT.getKey()); |
I looked into this path where the minor compaction thread was dying because an invalid classloader context could not load the volume chooser to create the minor compaction output file. The issue is more serious that what I initially thought, but thankfully the fix is easy. TLDR - A failed minor compaction thread might likely mean that subsequent minor compactions for a Tablet will not occur. The fix is to catch Longer analysis
If the Tablet has no entries in memory, then the tablet metadata is updated with the flushId and the variable If the Tablet has entries in memory, then MinorCompactionTask when executed will create a new output file and call If |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach here looks good to me.
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTablet2IT.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small style and wording suggestion, but otherwise, seems fine to me.
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTablet2IT.java
Outdated
Show resolved
Hide resolved
Co-authored-by: Christopher Tubbs <ctubbsii@apache.org>
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java
Outdated
Show resolved
Hide resolved
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactableImpl.java
Show resolved
Hide resolved
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactableImpl.java
Outdated
Show resolved
Hide resolved
server/base/src/main/java/org/apache/accumulo/server/conf/TableConfiguration.java
Show resolved
Hide resolved
server/tserver/src/main/java/org/apache/accumulo/tserver/TabletClientHandler.java
Show resolved
Hide resolved
server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java
Outdated
Show resolved
Hide resolved
// This should fail to split, but not leave the tablets in a state where they can't | ||
// be unloaded | ||
assertThrows(AccumuloServerException.class, | ||
() -> tops.addSplits(tableName, Sets.newTreeSet(List.of(new Text("b"))))); | ||
|
||
removeInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(), | ||
tableId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried removing the code in initiateClose and running the test locally. Made the following changes to this test to get it running.
// This should fail to split, but not leave the tablets in a state where they can't | |
// be unloaded | |
assertThrows(AccumuloServerException.class, | |
() -> tops.addSplits(tableName, Sets.newTreeSet(List.of(new Text("b"))))); | |
removeInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(), | |
tableId); | |
Thread configFixer = new Thread(() -> { | |
UtilWaitThread.sleep(3000); | |
removeInvalidClassLoaderContextPropertyWithoutValidation(getCluster().getServerContext(), | |
tableId); | |
}); | |
// grab this time before starting the thread starts and before it sleeps. | |
long t1 = System.nanoTime(); | |
configFixer.start(); | |
// The split will probably start running w/ bad config that will cause it to get stuck. However once the config is fixed by the background thread it should continue. | |
tops.addSplits(tableName, Sets.newTreeSet(List.of(new Text("b")))); | |
long t2 = System.nanoTime(); | |
// expect that split took at least 3 seconds because that is the time it takes to fix the config | |
assertTrue(TimeUnit.NANOSECONDS.toMillis(t2-t1) > 3000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is that it works because I changed the catch clause in MinorCompactionTask from IOException to Exception. If you were to revert that single change, then I think this would fail and leave the tablet in a bad state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made these changes in 9f84ed4, however 2/3 of these new tests no longer work. I believe that is because I merged 3678 into 2.1 and into this branch. The tests that are failing are setting an invalid context, and when ClassLoaderUtil.getClassLoader is called, the context is not valid, and the non-context aware classloader is returned instead of the code returning null or throwing an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keith-turner - If #3683 is merged, then I think that these tests will work again.
…t/CompactableImpl.java Co-authored-by: Keith Turner <kturner@apache.org>
#3685 moves the first minor compaction in tablet close to happen before any variables are set that begin the close process. |
I re-ran the tests after #3683 and #3685 were merged into 2.1. Everything appears to be working. However, it's possible that the tests could pass without the minor compactions failing. This could happen if, for example, the minor compaction kicks off before the tablet server gets the property update to set an invalid context. I verified via log inspection that failures were happening and that it recovered. It would be good in the future to add some metrics or something that can be checked in the test to confirm that failures do happen on the minc before the configuration is corrected. |
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTabletIT.java
Outdated
Show resolved
Hide resolved
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTabletIT.java
Outdated
Show resolved
Hide resolved
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTabletIT.java
Outdated
Show resolved
Hide resolved
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTabletIT.java
Outdated
Show resolved
Hide resolved
…sedTabletIT.java Co-authored-by: Keith Turner <kturner@apache.org>
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTabletIT.java
Outdated
Show resolved
Hide resolved
…sedTabletIT.java Co-authored-by: Keith Turner <kturner@apache.org>
test/src/main/java/org/apache/accumulo/test/functional/HalfClosedTabletIT.java
Outdated
Show resolved
Hide resolved
…sedTabletIT.java Co-authored-by: Keith Turner <kturner@apache.org>
…sedTabletIT.java Co-authored-by: Keith Turner <kturner@apache.org>
…sedTabletIT.java Co-authored-by: Keith Turner <kturner@apache.org>
Wondering if a metric that counts failed minor compactions would be useful. Should normally be zero. |
Yeah, that's what I'm thinking. |
Ok, that is a good idea. Are you going to open an issue? |
Prior to this change Tablet.initiateClose would close the compactable, wait for the current minor compaction to finish, and then kick off another minor compaction. In the case where minor compactions are failing, the compactable will get closed and the call to wait for the current minor compaction to finish will hang indefinitely leaving the Tablet in a half-closed state. Tablet.initiateClose is called when either the TabletServer is closing the Tablet (due to migration or shutdown) or on a Tablet split. Therefore, a failing minor compaction will prevent Tablet migration and split or normal TabletServer shutdown.
This change adds some logic at the start of Tablet.initiateClose to try and detect a currently or previously failing minor compaction. In these cases an exception is thrown before the compactable is closed to prevent the Tablet from being in a half-closed state. This prevents the Tablet from being closed until the cause for the failing minor compaction is corrected. Causes could be a bad iterator applied at minor compaction or a bad table configuration option (like classloader context). When the cause of the failing minor compaction is corrected, then the Tablet should be able to be closed normally allowing normal migration, split, and TabletServer shutdown.
Fixes #3674