New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] CorruptedFileIT#testReplicaCorruption fails on Windows #28435
Labels
:Distributed/Recovery
Anything around constructing a new shard, either from a local or a remote source.
>test-failure
Triaged test failures from CI
Comments
dnhatn
added
>test-failure
Triaged test failures from CI
:Distributed/Recovery
Anything around constructing a new shard, either from a local or a remote source.
labels
Jan 30, 2018
Another 6.2 instance on Windows https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.2+multijob-windows-compatibility/40/consoleText |
Another instance on Windows with the same error. CI: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-windows-compatibility/1199/console |
dnhatn
added a commit
to dnhatn/elasticsearch
that referenced
this issue
Feb 8, 2018
A recovering replica can be broken forever if its translog does not belong to its index. This can happen as follows. 1. A replica executes a file-based recovery 2. Index files are copied to replica but it crashed before finishing the recovery 3. Replica starts recovery again with a sequence-based recovery as the copied commit is safe 4. Replica fails to open engine because its translog and Lucene index are not matched 5. This replica won't be able to recover from primary This commit makes sure the translog belong to the index commit before executing the sequence-based recovery, otherwise fallback to a file-based recovery. Closes elastic#28435
ywelsch
added a commit
that referenced
this issue
Feb 9, 2018
After copying over the Lucene segments during peer recovery, we call cleanupAndVerify which removes all other files in the directory and which then calls getMetadata to check if the resulting files are a proper index. There are two issues with this: - the directory is not fsynced after the deletions, so that the call to getMetadata, which lists files in the directory, can get a stale view, possibly seeing a deleted corruption marker (which leads to the exception seen in #28435) - failing to delete a corruption marker should result in a hard failure, as the shard is otherwise unusable.
ywelsch
added a commit
that referenced
this issue
Feb 9, 2018
After copying over the Lucene segments during peer recovery, we call cleanupAndVerify which removes all other files in the directory and which then calls getMetadata to check if the resulting files are a proper index. There are two issues with this: - the directory is not fsynced after the deletions, so that the call to getMetadata, which lists files in the directory, can get a stale view, possibly seeing a deleted corruption marker (which leads to the exception seen in #28435) - failing to delete a corruption marker should result in a hard failure, as the shard is otherwise unusable.
Closed by #28604 |
dnhatn
added a commit
that referenced
this issue
Feb 12, 2018
Today we use the persisted global checkpoint to calculate the starting seqno in peer-recovery. However we do not check whether the translog actually belongs to the existing Lucene index when reading the global checkpoint. In some rare cases if the translog does not match the Lucene index, that recovering replica won't be able to complete its recovery. This can happen as follows. 1. Replica executes a file-based recovery 2. Index files are copied to replica but crashed before finishing the recovery 3. Replica starts recovery again with seq-based as the copied commit is safe 4. Replica fails to open engine because translog and Lucene index are not matched 5. Replica won't be able to recover from primary This commit enforces the translogUUID requirement when reading the global checkpoint directly from the checkpoint file. Relates #28435
dnhatn
added a commit
that referenced
this issue
Feb 13, 2018
Today we use the persisted global checkpoint to calculate the starting seqno in peer-recovery. However we do not check whether the translog actually belongs to the existing Lucene index when reading the global checkpoint. In some rare cases if the translog does not match the Lucene index, that recovering replica won't be able to complete its recovery. This can happen as follows. 1. Replica executes a file-based recovery 2. Index files are copied to replica but crashed before finishing the recovery 3. Replica starts recovery again with seq-based as the copied commit is safe 4. Replica fails to open engine because translog and Lucene index are not matched 5. Replica won't be able to recover from primary This commit enforces the translogUUID requirement when reading the global checkpoint directly from the checkpoint file. Relates #28435
ywelsch
added a commit
that referenced
this issue
Feb 15, 2018
After copying over the Lucene segments during peer recovery, we call cleanupAndVerify which removes all other files in the directory and which then calls getMetadata to check if the resulting files are a proper index. There are two issues with this: - the directory is not fsynced after the deletions, so that the call to getMetadata, which lists files in the directory, can get a stale view, possibly seeing a deleted corruption marker (which leads to the exception seen in #28435) - failing to delete a corruption marker should result in a hard failure, as the shard is otherwise unusable.
dnhatn
added a commit
that referenced
this issue
Mar 4, 2018
Today we use the persisted global checkpoint to calculate the starting seqno in peer-recovery. However we do not check whether the translog actually belongs to the existing Lucene index when reading the global checkpoint. In some rare cases if the translog does not match the Lucene index, that recovering replica won't be able to complete its recovery. This can happen as follows. 1. Replica executes a file-based recovery 2. Index files are copied to replica but crashed before finishing the recovery 3. Replica starts recovery again with seq-based as the copied commit is safe 4. Replica fails to open engine because translog and Lucene index are not matched 5. Replica won't be able to recover from primary This commit enforces the translogUUID requirement when reading the global checkpoint directly from the checkpoint file. Relates #28435
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
:Distributed/Recovery
Anything around constructing a new shard, either from a local or a remote source.
>test-failure
Triaged test failures from CI
CI: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-windows-compatibility/1179/console
Log: testReplicaCorruption.txt
We intentionally corrupted every segment files in
#testReplicaCorruption
. This caused file-based recovery occurring as replicas could not to read commit snapshot. However, there was one replica[test][7]
failed to clean up its store after index files were copied. The corruption markercorrupted_WzIGIIeIQvqVgKvEIVtRwA
was deleted but still in the listing and inaccessible(on Windows).As the target failed to clean up, it retried another recovery. At that moment the corrupted marker was removed and the copied commit is safe, hence the replica starts sequence-based recovery. Unfortunately, translog and index commit on the target were mismatch because the Lucene commit was copied from the primary. This replica
[test][7]
could not recover from primary.The text was updated successfully, but these errors were encountered: