Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include translog path in error message when translog is corrupted #32251

Merged
merged 16 commits into from Aug 7, 2018

Conversation

andrershov
Copy link
Contributor

@andrershov andrershov commented Jul 20, 2018

Currently, when TranslogCorruptedException is thrown most of the times it does not contain information about the translog location on the file system. There is the translog recovery tool that accepts the translog path as an argument and users are constantly puzzled where to get the path.
See for example #24929.
This pull request adds "source" information to every TranslogCorruptedException thrown. The source could be local file, remote translog source (used for recovery), assertion (translog entry is constructed to perform some assertion) or translog is constructed inside the test.

Quick and dirty test, by copying corruptTranslog method
@andrershov andrershov requested a review from bleskes July 20, 2018 17:46
Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one comment. I think this is better tackled on the translog level. probably with a try catch here: org.elasticsearch.index.translog.BaseTranslogReader#read(org.elasticsearch.index.translog.BufferedChecksumStreamInput)

@@ -401,7 +402,9 @@ private void recoverFromTranslogInternal() throws IOException {
try (Translog.Snapshot snapshot = translog.newSnapshotFromGen(translogGen)) {
opsRecovered = config().getTranslogRecoveryRunner().run(this, snapshot);
} catch (Exception e) {
throw new EngineException(shardId, "failed to recover from translog", e);
throw new EngineException(shardId, "Failed to recover from translog. Translog path " + translog.location() + ". Consider " +
"running elasticsearch-translog tool",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't recommend running the corruption tool. It's a dangerous one and need much more context and explanation. We should leave it to the docs. Also, the exception should coming out of the translog should indicate what it is and if it's a corruption should include a path.

@andrershov andrershov added >enhancement :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Jul 23, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@andrershov andrershov self-assigned this Jul 23, 2018
@bleskes bleskes added the v6.4.0 label Jul 23, 2018
@bleskes
Copy link
Contributor

bleskes commented Jul 23, 2018

@andrershov I added the 6.4.0 label as well. FYI

@andrershov
Copy link
Contributor Author

@bleskes following your comment, I've realized that it's a good idea to include information about translog path into all TranslogCorruptedException's. However, translog is not always read from the local file (not the case for recovery) and sometimes the path is unknown. After talking to @DaveCTurner, I've decided to add nullable path to BufferedChecksumStreamInput, which is used in static methods that do not have access to the path.
Also I've semi-solved problem of repeated corruptTranslog method - now there is 2 corruptTranslog methods in one file, instead of 3 corruptTranslog methods in 3 files.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is heading the right direction, but I asked for a handful of minor changes. I also asked a question of @dnhatn about wrapping any old Exception as a TranslogCorruptedException, which I'm not sure is correct.

@@ -401,7 +402,7 @@ private void recoverFromTranslogInternal() throws IOException {
try (Translog.Snapshot snapshot = translog.newSnapshotFromGen(translogGen)) {
opsRecovered = config().getTranslogRecoveryRunner().run(this, snapshot);
} catch (Exception e) {
throw new EngineException(shardId, "failed to recover from translog", e);
throw new EngineException(shardId, "Failed to recover from translog.", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally we seem to prefer lowercase and no final punctuation in this sort of message - they tend to get put inside other messages etc. and capital letters and so on look strange in that context.

pendingTranslogRecovery.set(true); // just play safe and never allow commits on this see #ensureCanFlush
failEngine("failed to recover from translog", e);
failEngine("Failed to recover from translog", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, no need for capital letters here.

@@ -375,8 +375,9 @@ public InternalEngine recoverFromTranslog() throws IOException {
recoverFromTranslogInternal();
} catch (Exception e) {
try {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this extra blank line, I think.

}

protected Translog.Operation read(BufferedChecksumStreamInput inStream) throws IOException {
final Translog.Operation op = Translog.readOperation(inStream);
if (op.primaryTerm() > getPrimaryTerm() && getPrimaryTerm() != TranslogHeader.UNKNOWN_PRIMARY_TERM) {
throw new TranslogCorruptedException("Operation's term is newer than translog header term; " +
throw new TranslogCorruptedException(path, "Operation's term is newer than translog header term; " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not saying that we're 100% consistent about those capital letters :)

@@ -35,14 +36,23 @@
private static final int SKIP_BUFFER_SIZE = 1024;
private byte[] skipBuffer;
private final Checksum digest;
private final Path path;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this were String source instead then we could reasonably give it values (e.g. assertion, remote-recovery) avoiding the nullability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that nullability is evil, but I think Strings is evil as well. I'd better create source interface with multiple implementations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this field is purely to be logged in a log message for a human to read, so String seems appropriate.

@@ -1713,7 +1711,7 @@ private static Checkpoint readCheckpoint(Path location, String expectedTranslogU
} catch (TranslogCorruptedException ex) {
throw ex; // just bubble up.
} catch (Exception ex) {
throw new TranslogCorruptedException("Translog at [" + location + "] is corrupted", ex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this should be a TranslogCorruptedException for the same reasoning as in
#19256 (comment). @dnhatn does anything bad happen if we just propagate the exception as-is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DaveCTurner We need to convert to a TranslogCorruptedException so that the recovery target can detect the error then switch from a sequence-based to a file-based recovery. See https://github.com/elastic/elasticsearch/pull/28587/files#diff-cdd186e2c33a5fa4d4200f9119684443L384

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flow makes sense but I don't follow why this should happen via a TranslogCorruptedException.

}

public TranslogCorruptedException(StreamInput in) throws IOException{
private static String corruptedMessage(Path path, String details) {
return path == null ? "Translog is corrupted." : "Translog at location [" + path + "] is corrupted. " + details;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment as above re. caps and punctuation.

/**
* Randomly overwrite some bytes in the translog files
*/
public static void corruptTranslogFilesReliably(Logger logger, Random random, Path directory) throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get rid of this and just use the other implementation, which looks more reliable than this one. If it is not more reliable, can you look into why?

// note: with the current logic, this will sometimes be a no-op
long pos = RandomNumbers.randomIntBetween(random,0, (int) f.size());

ByteBuffer junk = ByteBuffer.wrap(new byte[]{(byte) random.nextInt()});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory this could overwrite a byte with the same byte, which isn't a corruption. But then it overwrites at least 10 bytes in this file, so the chances are it won't do this 10 times. But then overwriting 10-50 random bytes is not a very realistic corruption: I'd expect something small (a single bit flip) or else something enormous (whole sectors are zeroed or contain the wrong data). I prefer the other method.

@@ -656,6 +658,26 @@ public IndexSearcher wrap(IndexSearcher searcher) throws EngineException {
IOUtils.close(store, engine);
}


public void testTranslogRecoveryFailure() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this test makes sense, the bulk of this change is to the Translog so I would expect there to be changes/additions to TranslogTests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DaveCTurner I'll make sure that each test that deals with TranslogCorruptedException will check that it contains correct source in it. Adding more tests for scenarios when TranslogCorruptedException could occur seems to much for this PR, because I don't introduce new throw statements, but instead just add some information to TranslogCorruptedException

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense.

@DaveCTurner
Copy link
Contributor

The CI failure suggests you need to merge master into your PR branch.

@andrershov
Copy link
Contributor Author

@DaveCTurner I've managed to replace 2 existing methods for translog corruption with the single method and have figured out why the previous implementation was failing. Also, I've replaced "checksum test" with "translog corruption test". "checksum test" could previously fail (rarely), because if corruption was done in the file header it won't be noticed by the translog reader.
Regarding changes to corruption - it will corrupt single translog file. Previously 5 to 20 corruptions were introduced. Integration tests, that were using this method, should still pass with single corruption. I'm going to execute them overnight with many iterations to make sure they pass.

@andrershov
Copy link
Contributor Author

@DaveCTurner I've run CorruptedTranslogIT and TruncateTranslogIT with 10K iterations each. And they both have failed after ~6K iterations. But it was not because of some assertation failure, but because of OutOfMemoryError.
[junit4] Event bus dispatch error: java.lang.OutOfMemoryError: Java heap space
I've switched back to baseline and run the same tests and the result was the same, so OOM is not related to my recent changes.
For now, it's hard to say if an error is inside test framework (does anybody run an IT test for 10K iterations in a row?) or inside ES. I propose to open a separate issue to track this problem. @DaveCTurner what do you think?


package org.elasticsearch.index.translog.source;

public interface TranslogSource {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bleskes We're discussing with @DaveCTurner whether we need this interface and implementations or we could pass source as a string when constructing BufferedChecksumStreamInput and TranslogCorruptedException.
My argument for having this interface is because if you have String you can pass pretty much everything and even two methods that are using local file could pass different values and this might complicate debugging.
@DaveCTurner is generally for strong static typing, but he says that in this case introducing type hierarchy complicates reading the code.
What is your opinion?

@andrershov
Copy link
Contributor Author

@DaveCTurner running tests in micro-batches helped to avoid OOM and run for a longer time. During the execution CorruptedTranslogIT failed and it helped me to discover a bug and fix it. Now I can confirm that translog IT test ran successfully for 18 hours in a row. I think this is enough.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a handful of nits about spacing, sorry, as well as

  • one place where the original message should be restored
  • one lost assertion

Aside from that and the question of whether TranslogSource and its subclasses are too heavy, this looks good.


@Override
public String toString() {
return "constructed by test "+testName;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spaces around +

public class TranslogTestSource implements TranslogSource {
private String testName;

public TranslogTestSource(String testName){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space before {

corruptionsCaught.incrementAndGet();
try (Translog translog = openTranslog(config, uuid)){
try (Translog.Snapshot snapshot = translog.newSnapshot()) {
for (Location loc: locations){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spacing

}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: blank lines

@@ -655,6 +655,7 @@ public IndexSearcher wrap(IndexSearcher searcher) throws EngineException {
IOUtils.close(store, engine);
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: blank line

@@ -1427,18 +1429,18 @@ static void verifyChecksum(BufferedChecksumStreamInput in) throws IOException {
long expectedChecksum = in.getChecksum();
long readChecksum = Integer.toUnsignedLong(in.readInt());
if (readChecksum != expectedChecksum) {
throw new TranslogCorruptedException("translog stream is corrupted, expected: 0x" +
throw new TranslogCorruptedException(in.getSource(), "checksum verification failed - expected: 0x" +

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

public TruncatedTranslogException(TranslogSource source, String details, Throwable cause) {
super(source, details, cause);
}

This comment was marked as resolved.

}
assertThat("no translog file corrupted", corruptedFiles, not(empty()));
return corruptedFiles;
Path corruptedFile = RandomPicks.randomFrom(random, candidates);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrershov we lost an assertion here, that candidates is nonempty. This throws an IllegalArgumentException if so, but AssertionError is preferable.

@dnhatn pinging you for this particular change - I think it's right that we should be testing the effects of a single corruption, not 5-20 of them, but WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO is the TranslogSource an unneeded abstraction and we can just make do with non-null source string. I also left a random comment.

@@ -52,6 +50,12 @@ public BufferedChecksumStreamInput(StreamInput in, BufferedChecksumStreamInput r
}
}

public BufferedChecksumStreamInput(StreamInput in, TranslogSource source) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you pipe this into the other constructor which accepts null for reuse?

@andrershov
Copy link
Contributor Author

andrershov commented Aug 7, 2018

@bleskes thank you, issues fixed. @DaveCTurner could you please make a final review?

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the extra iterations @andrershov.

@andrershov andrershov merged commit 6449d9b into elastic:master Aug 7, 2018
andrershov pushed a commit that referenced this pull request Aug 7, 2018
…2251)

Currently, when TranslogCorruptedException is thrown most of the times it does not contain information about the translog location on the file system. There is the translog recovery tool that accepts the translog path as an argument and users are constantly puzzled where to get the path.
This pull request adds "source" information to every TranslogCorruptedException thrown. The source could be local file, remote translog source (used for recovery), assertion (translog entry is constructed to perform some assertion) or translog constructed inside the test.
Closes #24929
(cherry picked from commit 6449d9b)
@andrershov
Copy link
Contributor Author

Merged and backported to 6.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >enhancement v6.5.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants