Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Never delete translog-N.tlog file when creation fails #15788

Merged
merged 2 commits into from Jan 6, 2016

Conversation

Projects
None yet
3 participants
@s1monw
Copy link
Contributor

commented Jan 6, 2016

We today delete the translog-N.tlog file if any subsequent operation fails
but we might actually be in a good state if for instance the creation of the writer
failes after we sucessfully baked the new translog generation into the checkpoint. In this situation
we used to delete the translog-N.tlog file and failed on the next recovery of the translog with a
NoSuchFileException | FileNotFoundException just like in https://discuss.elastic.co/t/cannot-recover-index-because-of-missing-tanslog-files/38336

This commit changes the behavior and cleans up that limbo state on recovery if we already have a generation+1 file written but not baked into
the checkpoint we remove that file but only if the previous ckp file has already been renamed otherwise we know we can't be in this state.

Never delete translog-N.tlog file when creation fails
We today delete the translog-N.tlog file if any subsequent operation fails
but we might actually be in a good state if for instance the creation of the writer
failes after we sucessfully baked the new translog generation into the checkpoint. In this situation
we used to delete the translog-N.tlog file and failed on the next recovery of the translog with a
NoSuchFileException | FileNotFoundException just like in https://discuss.elastic.co/t/cannot-recover-index-because-of-missing-tanslog-files/38336

This commit changes the behavior and cleans up that limbo state on recovery if we already have a generation+1 file written but not baked into
the checkpoint we remove that file but only if the previous ckp file has already been renamed otherwise we know we can't be in this state.
assert Files.exists(nextTranslogFile) == false || Files.size(nextTranslogFile) <= TranslogWriter.getHeaderLength(translogUUID);
if (Files.exists(currentCheckpointFile) // current checkpoint is already copied
&& Files.deleteIfExists(nextTranslogFile)) { // delete it and log a warning
logger.warn("Deleted invalid next generation before opening writer {} this is like caused by some previously tragic exception ", nextTranslogFile.getFileName());

This comment has been minimized.

Copy link
@bleskes

bleskes Jan 6, 2016

Member

Can we soften this message? maybe "deleted previously created, but not yet committed, next generation [{}]. This can happen due to a tragic exception when creating a new generation"

//
// For this to happen we must have already copied the translog.ckp file into translog-gen.ckp so we first check if that file exists
// if not we don't even try to clean it up and wait until we fail creating it
assert Files.exists(nextTranslogFile) == false || Files.size(nextTranslogFile) <= TranslogWriter.getHeaderLength(translogUUID);

This comment has been minimized.

Copy link
@bleskes

bleskes Jan 6, 2016

Member

can we add the name of the file to the failure message?

assertNotNull(translogGeneration);
assertFalse(tlog.syncNeeded());
try (Translog.Snapshot snapshot = tlog.newSnapshot()) {
for (int i = 0; i < 1; i++) {

This comment has been minimized.

Copy link
@bleskes

bleskes Jan 6, 2016

Member

I think you mean to write more than one doc? if not - just remove the for loop?

This comment has been minimized.

Copy link
@bleskes

bleskes Jan 6, 2016

Member

oh, I see. It to make this code the same as the follow up loop.

@bleskes

This comment has been minimized.

Copy link
Member

commented Jan 6, 2016

LGTM. Left some minor comments. Great catch.

s1monw added a commit that referenced this pull request Jan 6, 2016

Merge pull request #15788 from s1monw/dont_delete_tlog_file
Never delete translog-N.tlog file when creation fails

@s1monw s1monw merged commit 8a90c80 into elastic:master Jan 6, 2016

1 check passed

CLA Commit author is a member of Elasticsearch
Details

@s1monw s1monw deleted the s1monw:dont_delete_tlog_file branch Jan 6, 2016

s1monw added a commit to s1monw/elasticsearch that referenced this pull request Jan 6, 2016

Merge pull request elastic#15788 from s1monw/dont_delete_tlog_file
Never delete translog-N.tlog file when creation fails

s1monw added a commit that referenced this pull request Jan 6, 2016

Merge pull request #15788 from s1monw/dont_delete_tlog_file
Never delete translog-N.tlog file when creation fails

s1monw added a commit that referenced this pull request Jan 6, 2016

Merge pull request #15788 from s1monw/dont_delete_tlog_file
Never delete translog-N.tlog file when creation fails

s1monw added a commit that referenced this pull request Jan 6, 2016

Merge pull request #15788 from s1monw/dont_delete_tlog_file
Never delete translog-N.tlog file when creation fails

s1monw added a commit to s1monw/elasticsearch that referenced this pull request Jan 7, 2016

[TEST] Test that translog can recover after random IOException
This commit adds a new test that can throw an IOException at any point in time
and ensures that all previously synced documents can be successfully recovered after hitting
an excepiton.

Relates to elastic#15788

s1monw added a commit that referenced this pull request Jan 7, 2016

[TEST] Test that translog can recover after random IOException
This commit adds a new test that can throw an IOException at any point in time
and ensures that all previously synced documents can be successfully recovered after hitting
an excepiton.

Relates to #15788

s1monw added a commit that referenced this pull request Jan 7, 2016

[TEST] Test that translog can recover after random IOException
This commit adds a new test that can throw an IOException at any point in time
and ensures that all previously synced documents can be successfully recovered after hitting
an excepiton.

Relates to #15788

s1monw added a commit that referenced this pull request Jan 7, 2016

[TEST] Test that translog can recover after random IOException
This commit adds a new test that can throw an IOException at any point in time
and ensures that all previously synced documents can be successfully recovered after hitting
an excepiton.

Relates to #15788

s1monw added a commit that referenced this pull request Jan 7, 2016

[TEST] Test that translog can recover after random IOException
This commit adds a new test that can throw an IOException at any point in time
and ensures that all previously synced documents can be successfully recovered after hitting
an excepiton.

Relates to #15788
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.