New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes #449 fix two bugs with WAL recovery #458

Merged
merged 4 commits into from May 8, 2018

Conversation

Projects
None yet
4 participants
@keith-turner
Contributor

keith-turner commented May 2, 2018

  • Fix bug where tablet is unloaded, reloaded on tserver, and then tserver dies
  • Fix bug with out of order logs. Recovery code assumed logs were passed in
    time order. However, since 1.8.0 they have been passed in random order. Rewrote
    recovery code to handle out of order logs. The fix was to read all logs in
    a sorted merged way.
import org.apache.accumulo.tserver.logger.LogFileKey;
import org.apache.accumulo.tserver.logger.LogFileValue;
import org.apache.hadoop.fs.Path;
import org.mortbay.log.Log;

This comment has been minimized.

@ctubbsii

ctubbsii May 2, 2018

Member

mortbay log is probably not the Log you want

private static class MultiReaderIterator implements Iterator<Entry<LogFileKey,LogFileValue>> {
private MultiReader reader;

This comment has been minimized.

@milleruntime

milleruntime May 2, 2018

Contributor

The MultiReader should be renamed. Then the MultiReaderIterator as well.

This comment has been minimized.

@milleruntime

milleruntime May 2, 2018

Contributor

Perhaps this Iterator could be moved into "MultiReader"

lastFinish = newFinish;
}
private int findMaxTabletId(KeyExtent extent, List<Path> recoveryLogs) throws IOException {
int tid = -1;

This comment has been minimized.

@milleruntime

milleruntime May 2, 2018

Contributor

Should use less abstract name than "tid". We use this name for different things throughout the code.

@@ -16,132 +16,79 @@
*/
package org.apache.accumulo.tserver.log;
import static com.google.common.base.Preconditions.checkState;

This comment has been minimized.

@milleruntime

milleruntime May 2, 2018

Contributor

Should just import the class. I found myself looking around for your definition of checkState. Perhaps this isn't a big of deal in an IDE.

}
private void playbackMutations(MultiReader reader, int tid, LastStartToFinish lastStartToFinish,
public void recover(KeyExtent extent, List<Path> recoveryLogs, Set<String> tabletFiles,

This comment has been minimized.

@milleruntime

milleruntime May 2, 2018

Contributor

Personal preference to have public methods and entry points at the top of classes. Unlike binary data, humans read from the top down :)

try (RecoveryLogsIterator rli = new RecoveryLogsIterator(fs, recoveryLogs, COMPACTION_START,
tid)) {
DeduplicatingIterator ddi = new DeduplicatingIterator(rli);

This comment has been minimized.

@milleruntime

milleruntime May 2, 2018

Contributor

Could DeduplicatingIterator be Autoclosable as well?

This comment has been minimized.

@milleruntime

milleruntime May 2, 2018

Contributor

If not, looks like you need to close it.

return recoverySeq;
}
private void playbackMutations(List<Path> recoveryLogs, MutationReceiver mr, int tid,

This comment has been minimized.

@milleruntime

milleruntime May 2, 2018

Contributor

Nice improvement... this method now becomes a lot more clear.

private List<MultiReader> readers;
private UnmodifiableIterator<Entry<LogFileKey,LogFileValue>> iter;
private static class MultiReaderIterator implements Iterator<Entry<LogFileKey,LogFileValue>> {

This comment has been minimized.

@mikewalch

mikewalch May 2, 2018

Member

Default method 'remove' should be overridden

}
}
static class SortCheckIterator implements Iterator<Entry<LogFileKey,LogFileValue>> {

This comment has been minimized.

@mikewalch

mikewalch May 2, 2018

Member

Default method 'remove' should be overridden

import static org.apache.accumulo.tserver.logger.LogEvents.OPEN;
import java.io.IOException;
import java.io.UncheckedIOException;

This comment has been minimized.

@mikewalch

mikewalch May 2, 2018

Member

This is a Java 8 exception

int findLastStartToFinish(MultiReader reader, int fileno, KeyExtent extent,
Set<String> tabletFiles, LastStartToFinish lastStartToFinish)
throws IOException, EmptyMapFileException, UnusedException {
static class DeduplicatingIterator implements Iterator<Entry<LogFileKey,LogFileValue>> {

This comment has been minimized.

@mikewalch

mikewalch May 2, 2018

Member

Default method 'remove' should be overridden

ctubbsii added a commit to ctubbsii/accumulo that referenced this pull request May 2, 2018

@@ -106,12 +106,12 @@ private int findMaxTabletId(KeyExtent extent, List<Path> recoveryLogs) throws IO
checkState(key.event == DEFINE_TABLET); // should only fail if bug elsewhere
if (key.tablet.equals(extent) || key.tablet.equals(alternative)) {
checkState(key.tid >= 0, "Tid %s for %s is negative", key.tid, extent);
checkState(tabletId == -1 || key.tid >= tabletId); // should only fail if bug in
checkState(key.tabletId >= 0, "Tid %s for %s is negative", key.tabletId, extent);

This comment has been minimized.

@ctubbsii

ctubbsii May 3, 2018

Member

Could update the log messages to refer to "tabletId" instead of "Tid", too.

@keith-turner

This comment has been minimized.

Contributor

keith-turner commented May 7, 2018

Using these changes, a ~24hr test of CI w/ agitation completed with no data loss. There were 8 nodes runnning tservers and tservers were killed 279 times during the test.

        org.apache.accumulo.test.continuous.ContinuousVerify$Counts
                REFERENCED=24169811377
                UNREFERENCED=7532412

keith-turner and others added some commits Apr 27, 2018

fixes #449 fix two bugs with WAL recovery (#458)
 * Fix bug where tablet is unloaded, reloaded on tserver, and then tserver dies
 * Fix bug with out of order logs.  Recovery code assumed logs were passed in
   time order.  However, since 1.8.0 they have been passed in random order. Rewrote
   recovery code to handle out of order logs.  The fix was to read all logs in
   a sorted merged way.

@keith-turner keith-turner merged commit b8f574f into apache:1.9 May 8, 2018

0 of 2 checks passed

Jenkins Jenkins is validating pull request ...
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details

keith-turner added a commit that referenced this pull request May 8, 2018

fixes #449 fix two bugs with WAL recovery (#458)
 * Fix bug where tablet is unloaded, reloaded on tserver, and then tserver dies
 * Fix bug with out of order logs.  Recovery code assumed logs were passed in
   time order.  However, since 1.8.0 they have been passed in random order. Rewrote
   recovery code to handle out of order logs.  The fix was to read all logs in
   a sorted merged way.

keith-turner added a commit that referenced this pull request May 8, 2018

keith-turner added a commit that referenced this pull request May 8, 2018

@keith-turner keith-turner deleted the keith-turner:accumulo-449 branch May 8, 2018

keith-turner added a commit that referenced this pull request May 8, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment