Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes #449 fix two bugs with WAL recovery #458

Merged
merged 4 commits into from
May 8, 2018

Conversation

keith-turner
Copy link
Contributor

  • Fix bug where tablet is unloaded, reloaded on tserver, and then tserver dies
  • Fix bug with out of order logs. Recovery code assumed logs were passed in
    time order. However, since 1.8.0 they have been passed in random order. Rewrote
    recovery code to handle out of order logs. The fix was to read all logs in
    a sorted merged way.

import org.apache.accumulo.tserver.logger.LogFileKey;
import org.apache.accumulo.tserver.logger.LogFileValue;
import org.apache.hadoop.fs.Path;
import org.mortbay.log.Log;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mortbay log is probably not the Log you want


private static class MultiReaderIterator implements Iterator<Entry<LogFileKey,LogFileValue>> {

private MultiReader reader;
Copy link
Contributor

@milleruntime milleruntime May 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MultiReader should be renamed. Then the MultiReaderIterator as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this Iterator could be moved into "MultiReader"

lastFinish = newFinish;
}
private int findMaxTabletId(KeyExtent extent, List<Path> recoveryLogs) throws IOException {
int tid = -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use less abstract name than "tid". We use this name for different things throughout the code.

@@ -16,132 +16,79 @@
*/
package org.apache.accumulo.tserver.log;

import static com.google.common.base.Preconditions.checkState;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should just import the class. I found myself looking around for your definition of checkState. Perhaps this isn't a big of deal in an IDE.

}

private void playbackMutations(MultiReader reader, int tid, LastStartToFinish lastStartToFinish,
public void recover(KeyExtent extent, List<Path> recoveryLogs, Set<String> tabletFiles,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal preference to have public methods and entry points at the top of classes. Unlike binary data, humans read from the top down :)

try (RecoveryLogsIterator rli = new RecoveryLogsIterator(fs, recoveryLogs, COMPACTION_START,
tid)) {

DeduplicatingIterator ddi = new DeduplicatingIterator(rli);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could DeduplicatingIterator be Autoclosable as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not, looks like you need to close it.

return recoverySeq;
}

private void playbackMutations(List<Path> recoveryLogs, MutationReceiver mr, int tid,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement... this method now becomes a lot more clear.

private List<MultiReader> readers;
private UnmodifiableIterator<Entry<LogFileKey,LogFileValue>> iter;

private static class MultiReaderIterator implements Iterator<Entry<LogFileKey,LogFileValue>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default method 'remove' should be overridden

}
}

static class SortCheckIterator implements Iterator<Entry<LogFileKey,LogFileValue>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default method 'remove' should be overridden

import static org.apache.accumulo.tserver.logger.LogEvents.OPEN;

import java.io.IOException;
import java.io.UncheckedIOException;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a Java 8 exception

int findLastStartToFinish(MultiReader reader, int fileno, KeyExtent extent,
Set<String> tabletFiles, LastStartToFinish lastStartToFinish)
throws IOException, EmptyMapFileException, UnusedException {
static class DeduplicatingIterator implements Iterator<Entry<LogFileKey,LogFileValue>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default method 'remove' should be overridden

ctubbsii added a commit to ctubbsii/accumulo that referenced this pull request May 2, 2018
@@ -106,12 +106,12 @@ private int findMaxTabletId(KeyExtent extent, List<Path> recoveryLogs) throws IO
checkState(key.event == DEFINE_TABLET); // should only fail if bug elsewhere

if (key.tablet.equals(extent) || key.tablet.equals(alternative)) {
checkState(key.tid >= 0, "Tid %s for %s is negative", key.tid, extent);
checkState(tabletId == -1 || key.tid >= tabletId); // should only fail if bug in
checkState(key.tabletId >= 0, "Tid %s for %s is negative", key.tabletId, extent);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could update the log messages to refer to "tabletId" instead of "Tid", too.

Copy link
Member

@mikewalch mikewalch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a test is failing...

https://jenkins.revelc.net/job/Accumulo-PR-Build/73/

@keith-turner
Copy link
Contributor Author

Using these changes, a ~24hr test of CI w/ agitation completed with no data loss. There were 8 nodes runnning tservers and tservers were killed 279 times during the test.

        org.apache.accumulo.test.continuous.ContinuousVerify$Counts
                REFERENCED=24169811377
                UNREFERENCED=7532412

keith-turner and others added 4 commits May 8, 2018 11:31
 * Fix bug where tablet is unloaded, reloaded on tserver, and then tserver dies
 * Fix bug with out of order logs.  Recovery code assumed logs were passed in
   time order.  However, since 1.8.0 they have been passed in random order. Rewrote
   recovery code to handle out of order logs.  The fix was to read all logs in
   a sorted merged way.
@keith-turner keith-turner merged commit b8f574f into apache:1.9 May 8, 2018
keith-turner added a commit that referenced this pull request May 8, 2018
 * Fix bug where tablet is unloaded, reloaded on tserver, and then tserver dies
 * Fix bug with out of order logs.  Recovery code assumed logs were passed in
   time order.  However, since 1.8.0 they have been passed in random order. Rewrote
   recovery code to handle out of order logs.  The fix was to read all logs in
   a sorted merged way.
keith-turner pushed a commit that referenced this pull request May 8, 2018
keith-turner added a commit that referenced this pull request May 8, 2018
@keith-turner keith-turner deleted the accumulo-449 branch May 8, 2018 15:36
@keith-turner keith-turner added v2.0.0 bug This issue has been verified to be a bug. labels May 8, 2018
keith-turner added a commit that referenced this pull request May 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue has been verified to be a bug.
Projects
No open projects
1.9.1
  
Done
2.0.0
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants