Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] CorruptedFileIT#testReplicaCorruption fails on Windows #28435

Closed
dnhatn opened this issue Jan 30, 2018 · 3 comments
Closed

[CI] CorruptedFileIT#testReplicaCorruption fails on Windows #28435

dnhatn opened this issue Jan 30, 2018 · 3 comments
Assignees
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI

Comments

@dnhatn
Copy link
Member

dnhatn commented Jan 30, 2018

CI: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-windows-compatibility/1179/console
Log: testReplicaCorruption.txt

We intentionally corrupted every segment files in #testReplicaCorruption. This caused file-based recovery occurring as replicas could not to read commit snapshot. However, there was one replica [test][7] failed to clean up its store after index files were copied. The corruption marker corrupted_WzIGIIeIQvqVgKvEIVtRwA was deleted but still in the listing and inaccessible(on Windows).

1> Caused by: java.nio.file.AccessDeniedException: C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\server\build\testrun\integTest\J1\temp\org.elasticsearch.index.store.CorruptedFileIT_883F8FC2C6E448B5-001\tempDir-002\data\nodes\0\indices\0tscMvZ-Q-21DRHviJsJxw\7\index\corrupted_WzIGIIeIQvqVgKvEIVtRwA
  1> 	at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:83) ~[?:?]
  1> 	at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97) ~[?:?]
  1> 	at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102) ~[?:?]
  1> 	at sun.nio.fs.WindowsFileSystemProvider.newByteChannel(WindowsFileSystemProvider.java:230) ~[?:?]
  1> 	at org.apache.lucene.mockfile.FilterFileSystemProvider.newByteChannel(FilterFileSystemProvider.java:212) ~[lucene-test-framework-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:44]
  1> 	at org.apache.lucene.mockfile.FilterFileSystemProvider.newByteChannel(FilterFileSystemProvider.java:212) ~[lucene-test-framework-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:44]
  1> 	at org.apache.lucene.mockfile.FilterFileSystemProvider.newByteChannel(FilterFileSystemProvider.java:212) ~[lucene-test-framework-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:44]
  1> 	at org.apache.lucene.mockfile.HandleTrackingFS.newByteChannel(HandleTrackingFS.java:240) ~[lucene-test-framework-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:44]
  1> 	at org.apache.lucene.mockfile.FilterFileSystemProvider.newByteChannel(FilterFileSystemProvider.java:212) ~[lucene-test-framework-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:44]
  1> 	at org.apache.lucene.mockfile.HandleTrackingFS.newByteChannel(HandleTrackingFS.java:240) ~[lucene-test-framework-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:44]
  1> 	at java.nio.file.Files.newByteChannel(Files.java:361) ~[?:1.8.0_144]
  1> 	at java.nio.file.Files.newByteChannel(Files.java:407) ~[?:1.8.0_144]
  1> 	at org.apache.lucene.store.SimpleFSDirectory.openInput(SimpleFSDirectory.java:77) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
  1> 	at org.apache.lucene.util.LuceneTestCase.slowFileExists(LuceneTestCase.java:2709) ~[lucene-test-framework-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:44]
  1> 	at org.apache.lucene.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:749) ~[lucene-test-framework-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
  1> 	at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:99) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
  1> 	at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:119) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
  1> 	at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:615) ~[main/:?]
  1> 	at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:607) ~[main/:?]
  1> 	at org.elasticsearch.index.store.Store.getMetadata(Store.java:281) ~[main/:?]
  1> 	at org.elasticsearch.index.store.Store.getMetadata(Store.java:253) ~[main/:?]
  1> 	at org.elasticsearch.index.store.Store.cleanupAndVerify(Store.java:683) ~[main/:?]
  1> 	at org.elasticsearch.indices.recovery.RecoveryTarget.cleanFiles(RecoveryTarget.java:444) [main/:?]
  1> 	... 10 more

As the target failed to clean up, it retried another recovery. At that moment the corrupted marker was removed and the copied commit is safe, hence the replica starts sequence-based recovery. Unfortunately, translog and index commit on the target were mismatch because the Lucene commit was copied from the primary. This replica [test][7] could not recover from primary.

  1> Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: expected shard UUID [32 5f 7a 61 42 43 56 4d 53 4f 47 4a 6a 35 30 4f 6c 69 65 43 44 51] but got: [54 62 51 6c 30 6a 49 4f 53 50 53 68 5a 76 58 2d 2d 44 46 70 58 41] this translog file belongs to a different translog. path:C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\server\build\testrun\integTest\J1\temp\org.elasticsearch.index.store.CorruptedFileIT_883F8FC2C6E448B5-001\tempDir-002\data\nodes\0\indices\0tscMvZ-Q-21DRHviJsJxw\7\translog\translog-2.tlog
  1> 	at org.elasticsearch.index.translog.TranslogReader.open(TranslogReader.java:131) ~[main/:?]
  1> 	at org.elasticsearch.index.translog.Translog.openReader(Translog.java:292) ~[main/:?]
  1> 	at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:238) ~[main/:?]
  1> 	at org.elasticsearch.index.translog.Translog.<init>(Translog.java:176) ~[main/:?]
  1> 	at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:491) ~[main/:?]
  1> 	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:180) ~[main/:?]
  1> 	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:151) ~[main/:?]
  1> 	at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25) ~[main/:?]
  1> 	at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:2149) ~[main/:?]
  1> 	at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:2131) ~[main/:?]
  1> 	at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1361) ~[main/:?]
  1> 	at org.elasticsearch.index.shard.IndexShard.openIndexAndSkipTranslogRecovery(IndexShard.java:1327) ~[main/:?]
  1> 	at org.elasticsearch.indices.recovery.RecoveryTarget.prepareForTranslogOperations(RecoveryTarget.java:371) ~[main/:?]
  1> 	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:396) ~[main/:?]
  1> 	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:390) ~[main/:?]
@dnhatn dnhatn added >test-failure Triaged test failures from CI :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Jan 30, 2018
@dnhatn dnhatn self-assigned this Jan 30, 2018
@dnhatn
Copy link
Member Author

dnhatn commented Jan 30, 2018

@dnhatn
Copy link
Member Author

dnhatn commented Feb 2, 2018

dnhatn added a commit to dnhatn/elasticsearch that referenced this issue Feb 8, 2018
A recovering replica can be broken forever if its translog does not belong to its index. This can happen as follows.

1. A replica executes a file-based recovery
2. Index files are copied to replica but it crashed before finishing the recovery
3. Replica starts recovery again with a sequence-based recovery as the copied commit is safe
4. Replica fails to open engine because its translog and Lucene index are not matched
5. This replica won't be able to recover from primary

This commit makes sure the translog belong to the index commit before executing the sequence-based recovery, otherwise fallback to a file-based recovery.

Closes elastic#28435
ywelsch added a commit that referenced this issue Feb 9, 2018
After copying over the Lucene segments during peer recovery, we call cleanupAndVerify which removes all other files in the directory and which then calls getMetadata to check if the resulting files are a proper index. There are two issues with this:

- the directory is not fsynced after the deletions, so that the call to getMetadata, which lists files in the directory, can get a stale view, possibly seeing a deleted corruption marker (which leads to the exception seen in #28435)
- failing to delete a corruption marker should result in a hard failure, as the shard is otherwise unusable.
ywelsch added a commit that referenced this issue Feb 9, 2018
After copying over the Lucene segments during peer recovery, we call cleanupAndVerify which removes all other files in the directory and which then calls getMetadata to check if the resulting files are a proper index. There are two issues with this:

- the directory is not fsynced after the deletions, so that the call to getMetadata, which lists files in the directory, can get a stale view, possibly seeing a deleted corruption marker (which leads to the exception seen in #28435)
- failing to delete a corruption marker should result in a hard failure, as the shard is otherwise unusable.
@dnhatn
Copy link
Member Author

dnhatn commented Feb 9, 2018

Closed by #28604

@dnhatn dnhatn closed this as completed Feb 9, 2018
dnhatn added a commit that referenced this issue Feb 12, 2018
Today we use the persisted global checkpoint to calculate the starting 
seqno in peer-recovery. However we do not check whether the translog 
actually belongs to the existing Lucene index when reading the global
checkpoint. In some rare cases if the translog does not match the Lucene
index, that recovering replica won't be able to complete its recovery.
This can happen as follows.

1. Replica executes a file-based recovery
2. Index files are copied to replica but crashed before finishing the recovery
3. Replica starts recovery again with seq-based as the copied commit is safe
4. Replica fails to open engine because translog and Lucene index are not matched
5. Replica won't be able to recover from primary

This commit enforces the translogUUID requirement when reading the 
global checkpoint directly from the checkpoint file.

Relates #28435
dnhatn added a commit that referenced this issue Feb 13, 2018
Today we use the persisted global checkpoint to calculate the starting
seqno in peer-recovery. However we do not check whether the translog
actually belongs to the existing Lucene index when reading the global
checkpoint. In some rare cases if the translog does not match the Lucene
index, that recovering replica won't be able to complete its recovery.
This can happen as follows.

1. Replica executes a file-based recovery
2. Index files are copied to replica but crashed before finishing the recovery
3. Replica starts recovery again with seq-based as the copied commit is safe
4. Replica fails to open engine because translog and Lucene index are not matched
5. Replica won't be able to recover from primary

This commit enforces the translogUUID requirement when reading the
global checkpoint directly from the checkpoint file.

Relates #28435
ywelsch added a commit that referenced this issue Feb 15, 2018
After copying over the Lucene segments during peer recovery, we call cleanupAndVerify which removes all other files in the directory and which then calls getMetadata to check if the resulting files are a proper index. There are two issues with this:

- the directory is not fsynced after the deletions, so that the call to getMetadata, which lists files in the directory, can get a stale view, possibly seeing a deleted corruption marker (which leads to the exception seen in #28435)
- failing to delete a corruption marker should result in a hard failure, as the shard is otherwise unusable.
dnhatn added a commit that referenced this issue Mar 4, 2018
Today we use the persisted global checkpoint to calculate the starting
seqno in peer-recovery. However we do not check whether the translog
actually belongs to the existing Lucene index when reading the global
checkpoint. In some rare cases if the translog does not match the Lucene
index, that recovering replica won't be able to complete its recovery.
This can happen as follows.

1. Replica executes a file-based recovery
2. Index files are copied to replica but crashed before finishing the recovery
3. Replica starts recovery again with seq-based as the copied commit is safe
4. Replica fails to open engine because translog and Lucene index are not matched
5. Replica won't be able to recover from primary

This commit enforces the translogUUID requirement when reading the
global checkpoint directly from the checkpoint file.

Relates #28435
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

1 participant