New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include translog path in error message when translog is corrupted #32251
Include translog path in error message when translog is corrupted #32251
Conversation
Quick and dirty test, by copying corruptTranslog method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left one comment. I think this is better tackled on the translog level. probably with a try catch here: org.elasticsearch.index.translog.BaseTranslogReader#read(org.elasticsearch.index.translog.BufferedChecksumStreamInput)
@@ -401,7 +402,9 @@ private void recoverFromTranslogInternal() throws IOException { | |||
try (Translog.Snapshot snapshot = translog.newSnapshotFromGen(translogGen)) { | |||
opsRecovered = config().getTranslogRecoveryRunner().run(this, snapshot); | |||
} catch (Exception e) { | |||
throw new EngineException(shardId, "failed to recover from translog", e); | |||
throw new EngineException(shardId, "Failed to recover from translog. Translog path " + translog.location() + ". Consider " + | |||
"running elasticsearch-translog tool", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't recommend running the corruption tool. It's a dangerous one and need much more context and explanation. We should leave it to the docs. Also, the exception should coming out of the translog should indicate what it is and if it's a corruption should include a path.
Pinging @elastic/es-distributed |
@andrershov I added the 6.4.0 label as well. FYI |
@bleskes following your comment, I've realized that it's a good idea to include information about translog path into all TranslogCorruptedException's. However, translog is not always read from the local file (not the case for recovery) and sometimes the path is unknown. After talking to @DaveCTurner, I've decided to add nullable path to BufferedChecksumStreamInput, which is used in static methods that do not have access to the path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is heading the right direction, but I asked for a handful of minor changes. I also asked a question of @dnhatn about wrapping any old Exception
as a TranslogCorruptedException
, which I'm not sure is correct.
@@ -401,7 +402,7 @@ private void recoverFromTranslogInternal() throws IOException { | |||
try (Translog.Snapshot snapshot = translog.newSnapshotFromGen(translogGen)) { | |||
opsRecovered = config().getTranslogRecoveryRunner().run(this, snapshot); | |||
} catch (Exception e) { | |||
throw new EngineException(shardId, "failed to recover from translog", e); | |||
throw new EngineException(shardId, "Failed to recover from translog.", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally we seem to prefer lowercase and no final punctuation in this sort of message - they tend to get put inside other messages etc. and capital letters and so on look strange in that context.
pendingTranslogRecovery.set(true); // just play safe and never allow commits on this see #ensureCanFlush | ||
failEngine("failed to recover from translog", e); | ||
failEngine("Failed to recover from translog", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, no need for capital letters here.
@@ -375,8 +375,9 @@ public InternalEngine recoverFromTranslog() throws IOException { | |||
recoverFromTranslogInternal(); | |||
} catch (Exception e) { | |||
try { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this extra blank line, I think.
} | ||
|
||
protected Translog.Operation read(BufferedChecksumStreamInput inStream) throws IOException { | ||
final Translog.Operation op = Translog.readOperation(inStream); | ||
if (op.primaryTerm() > getPrimaryTerm() && getPrimaryTerm() != TranslogHeader.UNKNOWN_PRIMARY_TERM) { | ||
throw new TranslogCorruptedException("Operation's term is newer than translog header term; " + | ||
throw new TranslogCorruptedException(path, "Operation's term is newer than translog header term; " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not saying that we're 100% consistent about those capital letters :)
@@ -35,14 +36,23 @@ | |||
private static final int SKIP_BUFFER_SIZE = 1024; | |||
private byte[] skipBuffer; | |||
private final Checksum digest; | |||
private final Path path; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this were String source
instead then we could reasonably give it values (e.g. assertion
, remote-recovery
) avoiding the nullability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that nullability is evil, but I think Strings is evil as well. I'd better create source interface with multiple implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of this field is purely to be logged in a log message for a human to read, so String
seems appropriate.
@@ -1713,7 +1711,7 @@ private static Checkpoint readCheckpoint(Path location, String expectedTranslogU | |||
} catch (TranslogCorruptedException ex) { | |||
throw ex; // just bubble up. | |||
} catch (Exception ex) { | |||
throw new TranslogCorruptedException("Translog at [" + location + "] is corrupted", ex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this should be a TranslogCorruptedException
for the same reasoning as in
#19256 (comment). @dnhatn does anything bad happen if we just propagate the exception as-is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DaveCTurner We need to convert to a TranslogCorruptedException
so that the recovery target can detect the error then switch from a sequence-based to a file-based recovery. See https://github.com/elastic/elasticsearch/pull/28587/files#diff-cdd186e2c33a5fa4d4200f9119684443L384
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flow makes sense but I don't follow why this should happen via a TranslogCorruptedException
.
} | ||
|
||
public TranslogCorruptedException(StreamInput in) throws IOException{ | ||
private static String corruptedMessage(Path path, String details) { | ||
return path == null ? "Translog is corrupted." : "Translog at location [" + path + "] is corrupted. " + details; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment as above re. caps and punctuation.
/** | ||
* Randomly overwrite some bytes in the translog files | ||
*/ | ||
public static void corruptTranslogFilesReliably(Logger logger, Random random, Path directory) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get rid of this and just use the other implementation, which looks more reliable than this one. If it is not more reliable, can you look into why?
// note: with the current logic, this will sometimes be a no-op | ||
long pos = RandomNumbers.randomIntBetween(random,0, (int) f.size()); | ||
|
||
ByteBuffer junk = ByteBuffer.wrap(new byte[]{(byte) random.nextInt()}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory this could overwrite a byte with the same byte, which isn't a corruption. But then it overwrites at least 10 bytes in this file, so the chances are it won't do this 10 times. But then overwriting 10-50 random bytes is not a very realistic corruption: I'd expect something small (a single bit flip) or else something enormous (whole sectors are zeroed or contain the wrong data). I prefer the other method.
@@ -656,6 +658,26 @@ public IndexSearcher wrap(IndexSearcher searcher) throws EngineException { | |||
IOUtils.close(store, engine); | |||
} | |||
|
|||
|
|||
public void testTranslogRecoveryFailure() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this test makes sense, the bulk of this change is to the Translog
so I would expect there to be changes/additions to TranslogTests
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DaveCTurner I'll make sure that each test that deals with TranslogCorruptedException will check that it contains correct source in it. Adding more tests for scenarios when TranslogCorruptedException could occur seems to much for this PR, because I don't introduce new throw statements, but instead just add some information to TranslogCorruptedException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense.
The CI failure suggests you need to merge |
@DaveCTurner I've managed to replace 2 existing methods for translog corruption with the single method and have figured out why the previous implementation was failing. Also, I've replaced "checksum test" with "translog corruption test". "checksum test" could previously fail (rarely), because if corruption was done in the file header it won't be noticed by the translog reader. |
@DaveCTurner I've run CorruptedTranslogIT and TruncateTranslogIT with 10K iterations each. And they both have failed after ~6K iterations. But it was not because of some assertation failure, but because of OutOfMemoryError. |
|
||
package org.elasticsearch.index.translog.source; | ||
|
||
public interface TranslogSource { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bleskes We're discussing with @DaveCTurner whether we need this interface and implementations or we could pass source as a string when constructing BufferedChecksumStreamInput and TranslogCorruptedException.
My argument for having this interface is because if you have String you can pass pretty much everything and even two methods that are using local file could pass different values and this might complicate debugging.
@DaveCTurner is generally for strong static typing, but he says that in this case introducing type hierarchy complicates reading the code.
What is your opinion?
@DaveCTurner running tests in micro-batches helped to avoid OOM and run for a longer time. During the execution CorruptedTranslogIT failed and it helped me to discover a bug and fix it. Now I can confirm that translog IT test ran successfully for 18 hours in a row. I think this is enough. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a handful of nits about spacing, sorry, as well as
- one place where the original message should be restored
- one lost assertion
Aside from that and the question of whether TranslogSource
and its subclasses are too heavy, this looks good.
|
||
@Override | ||
public String toString() { | ||
return "constructed by test "+testName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: spaces around +
public class TranslogTestSource implements TranslogSource { | ||
private String testName; | ||
|
||
public TranslogTestSource(String testName){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: space before {
corruptionsCaught.incrementAndGet(); | ||
try (Translog translog = openTranslog(config, uuid)){ | ||
try (Translog.Snapshot snapshot = translog.newSnapshot()) { | ||
for (Location loc: locations){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: spacing
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: blank lines
@@ -655,6 +655,7 @@ public IndexSearcher wrap(IndexSearcher searcher) throws EngineException { | |||
IOUtils.close(store, engine); | |||
} | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: blank line
@@ -1427,18 +1429,18 @@ static void verifyChecksum(BufferedChecksumStreamInput in) throws IOException { | |||
long expectedChecksum = in.getChecksum(); | |||
long readChecksum = Integer.toUnsignedLong(in.readInt()); | |||
if (readChecksum != expectedChecksum) { | |||
throw new TranslogCorruptedException("translog stream is corrupted, expected: 0x" + | |||
throw new TranslogCorruptedException(in.getSource(), "checksum verification failed - expected: 0x" + |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
public TruncatedTranslogException(TranslogSource source, String details, Throwable cause) { | ||
super(source, details, cause); | ||
} | ||
|
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
} | ||
assertThat("no translog file corrupted", corruptedFiles, not(empty())); | ||
return corruptedFiles; | ||
Path corruptedFile = RandomPicks.randomFrom(random, candidates); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrershov we lost an assertion here, that candidates
is nonempty. This throws an IllegalArgumentException
if so, but AssertionError
is preferable.
@dnhatn pinging you for this particular change - I think it's right that we should be testing the effects of a single corruption, not 5-20 of them, but WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO is the TranslogSource
an unneeded abstraction and we can just make do with non-null source string. I also left a random comment.
@@ -52,6 +50,12 @@ public BufferedChecksumStreamInput(StreamInput in, BufferedChecksumStreamInput r | |||
} | |||
} | |||
|
|||
public BufferedChecksumStreamInput(StreamInput in, TranslogSource source) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you pipe this into the other constructor which accepts null for reuse?
@bleskes thank you, issues fixed. @DaveCTurner could you please make a final review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the extra iterations @andrershov.
…2251) Currently, when TranslogCorruptedException is thrown most of the times it does not contain information about the translog location on the file system. There is the translog recovery tool that accepts the translog path as an argument and users are constantly puzzled where to get the path. This pull request adds "source" information to every TranslogCorruptedException thrown. The source could be local file, remote translog source (used for recovery), assertion (translog entry is constructed to perform some assertion) or translog constructed inside the test. Closes #24929 (cherry picked from commit 6449d9b)
Merged and backported to 6.x |
Currently, when TranslogCorruptedException is thrown most of the times it does not contain information about the translog location on the file system. There is the translog recovery tool that accepts the translog path as an argument and users are constantly puzzled where to get the path.
See for example #24929.
This pull request adds "source" information to every TranslogCorruptedException thrown. The source could be local file, remote translog source (used for recovery), assertion (translog entry is constructed to perform some assertion) or translog is constructed inside the test.