Address some CCR REST test case flakiness #38975

jasontedor · 2019-02-15T17:51:51Z

The CCR REST tests that rely on these assertions are flaky. They are flaky since the introduction of recovery from the remote.

The underlying problem is this: these tests are making assertions about the number of operations read by the shard following task. However, with recovery from remote, we no longer have guarantees that the assumptions these tests were relying on hold. Namely, these tests were assuming that the only way that a document could land in the follower index is via the shard following task. With recovery from remote, there is another way, which is via the files that are copied over during the recovery phase. Most of the time this will not be a problem because with the small number of documents that we are indexing in these tests, it is usally not the case that a flush would occur and so there would not be any documents in the files copied over. However, a flush can occur any time at which point all of the indexed documents could end up in a safe commit and copied over during recovery from remote. This commit modifies these assertions to ones that are not prone to this issue, yet still validate the health of the follower shard.

Relates #38829

The CCR REST tests that rely on these assertions are flaky. They are flaky since the introduction of recovery from the remote. The underlying problem is this: these tests are making assertions about the number of operations read by the shard following task. However, with recovery from remote, we no longer have guarantees that the assumptions these tests were relying on hold. Namely, these tests were assuming that the only way that a document could land in the follower index is via the shard following task. With recovery from remote, there is another way, which is via the files that are copied over during the recovery phase. Most of the time this will not be a problem because with the small number of documents that we are indexing in these tests, it is usally not the case that a flush would occur and so there would not be any documents in the files copied over. However, a flush can occur any time at which point all of the indexed documents could end up in a safe commit and copied over during recovery from remote. This commit modifies these assertions to ones that are not prone to this issue, yet still validate the health of the follower shard.

elasticmachine · 2019-02-15T17:51:52Z

Pinging @elastic/es-distributed

martijnvg

👍

The CCR REST tests that rely on these assertions are flaky. They are flaky since the introduction of recovery from the remote. The underlying problem is this: these tests are making assertions about the number of operations read by the shard following task. However, with recovery from remote, we no longer have guarantees that the assumptions these tests were relying on hold. Namely, these tests were assuming that the only way that a document could land in the follower index is via the shard following task. With recovery from remote, there is another way, which is via the files that are copied over during the recovery phase. Most of the time this will not be a problem because with the small number of documents that we are indexing in these tests, it is usally not the case that a flush would occur and so there would not be any documents in the files copied over. However, a flush can occur any time at which point all of the indexed documents could end up in a safe commit and copied over during recovery from remote. This commit modifies these assertions to ones that are not prone to this issue, yet still validate the health of the follower shard.

* master: Address some CCR REST test case flakiness (elastic#38975) Edits to text in Completion Suggester doc (elastic#38980) SQL: doc polishing [DOCS] Fixes broken formatting SQL: Polish the rest chapter (elastic#38971) Remove `nGram` and `edgeNGram` token filter names (elastic#38911) Add an exception throw if waiting on transport port file fails (elastic#37574) Improve testcluster distribution artifact handling (elastic#38933) Advance max_seq_no before add operation to Lucene (elastic#38879) Reduce global checkpoint sync interval in disruption tests (elastic#38931) [test] disable packaging tests for suse boxes Relax testStressMaybeFlushOrRollTranslogGeneration (elastic#38918) [DOCS] Edits warning in put watch API (elastic#38582) Fix serialization bug in ShardFollowTask after cutting this class over to extend from ImmutableFollowParameters. [DOCS] Updates methods for upgrading machine learning (elastic#38876)

jasontedor added >test Issues or PRs that are addressing/adding tests v7.0.0 :Distributed/CCR Issues around the Cross Cluster State Replication features v6.7.0 v8.0.0 v7.2.0 labels Feb 15, 2019

jasontedor requested review from martijnvg and dnhatn February 15, 2019 17:51

jasontedor mentioned this pull request Feb 15, 2019

Integrate retention leases to recovery from remote #38829

Merged

Rewrite assertions

1765156

dnhatn approved these changes Feb 15, 2019

View reviewed changes

martijnvg approved these changes Feb 15, 2019

View reviewed changes

jasontedor merged commit 7342d5a into elastic:master Feb 15, 2019

jakelandis added v7.0.0-rc2 and removed v7.0.0 labels Apr 3, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address some CCR REST test case flakiness #38975

Address some CCR REST test case flakiness #38975

jasontedor commented Feb 15, 2019

elasticmachine commented Feb 15, 2019

martijnvg left a comment

Address some CCR REST test case flakiness #38975

Address some CCR REST test case flakiness #38975

Conversation

jasontedor commented Feb 15, 2019

elasticmachine commented Feb 15, 2019

martijnvg left a comment

Choose a reason for hiding this comment