Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address some CCR REST test case flakiness #38975

Merged
merged 2 commits into from
Feb 15, 2019

Conversation

jasontedor
Copy link
Member

The CCR REST tests that rely on these assertions are flaky. They are flaky since the introduction of recovery from the remote.

The underlying problem is this: these tests are making assertions about the number of operations read by the shard following task. However, with recovery from remote, we no longer have guarantees that the assumptions these tests were relying on hold. Namely, these tests were assuming that the only way that a document could land in the follower index is via the shard following task. With recovery from remote, there is another way, which is via the files that are copied over during the recovery phase. Most of the time this will not be a problem because with the small number of documents that we are indexing in these tests, it is usally not the case that a flush would occur and so there would not be any documents in the files copied over. However, a flush can occur any time at which point all of the indexed documents could end up in a safe commit and copied over during recovery from remote. This commit modifies these assertions to ones that are not prone to this issue, yet still validate the health of the follower shard.

Relates #38829

The CCR REST tests that rely on these assertions are flaky. They are
flaky since the introduction of recovery from the remote.

The underlying problem is this: these tests are making assertions about
the number of operations read by the shard following task. However, with
recovery from remote, we no longer have guarantees that the assumptions
these tests were relying on hold. Namely, these tests were assuming that
the only way that a document could land in the follower index is via the
shard following task. With recovery from remote, there is another way,
which is via the files that are copied over during the recovery
phase. Most of the time this will not be a problem because with the
small number of documents that we are indexing in these tests, it is
usally not the case that a flush would occur and so there would not be
any documents in the files copied over. However, a flush can occur any
time at which point all of the indexed documents could end up in a safe
commit and copied over during recovery from remote. This commit modifies
these assertions to ones that are not prone to this issue, yet still
validate the health of the follower shard.
@jasontedor jasontedor added >test Issues or PRs that are addressing/adding tests v7.0.0 :Distributed/CCR Issues around the Cross Cluster State Replication features v6.7.0 v8.0.0 v7.2.0 labels Feb 15, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jasontedor jasontedor merged commit 7342d5a into elastic:master Feb 15, 2019
jasontedor added a commit that referenced this pull request Feb 15, 2019
The CCR REST tests that rely on these assertions are flaky. They are
flaky since the introduction of recovery from the remote.

The underlying problem is this: these tests are making assertions about
the number of operations read by the shard following task. However, with
recovery from remote, we no longer have guarantees that the assumptions
these tests were relying on hold. Namely, these tests were assuming that
the only way that a document could land in the follower index is via the
shard following task. With recovery from remote, there is another way,
which is via the files that are copied over during the recovery
phase. Most of the time this will not be a problem because with the
small number of documents that we are indexing in these tests, it is
usally not the case that a flush would occur and so there would not be
any documents in the files copied over. However, a flush can occur any
time at which point all of the indexed documents could end up in a safe
commit and copied over during recovery from remote. This commit modifies
these assertions to ones that are not prone to this issue, yet still
validate the health of the follower shard.
jasontedor added a commit that referenced this pull request Feb 15, 2019
The CCR REST tests that rely on these assertions are flaky. They are
flaky since the introduction of recovery from the remote.

The underlying problem is this: these tests are making assertions about
the number of operations read by the shard following task. However, with
recovery from remote, we no longer have guarantees that the assumptions
these tests were relying on hold. Namely, these tests were assuming that
the only way that a document could land in the follower index is via the
shard following task. With recovery from remote, there is another way,
which is via the files that are copied over during the recovery
phase. Most of the time this will not be a problem because with the
small number of documents that we are indexing in these tests, it is
usally not the case that a flush would occur and so there would not be
any documents in the files copied over. However, a flush can occur any
time at which point all of the indexed documents could end up in a safe
commit and copied over during recovery from remote. This commit modifies
these assertions to ones that are not prone to this issue, yet still
validate the health of the follower shard.
jasontedor added a commit that referenced this pull request Feb 15, 2019
The CCR REST tests that rely on these assertions are flaky. They are
flaky since the introduction of recovery from the remote.

The underlying problem is this: these tests are making assertions about
the number of operations read by the shard following task. However, with
recovery from remote, we no longer have guarantees that the assumptions
these tests were relying on hold. Namely, these tests were assuming that
the only way that a document could land in the follower index is via the
shard following task. With recovery from remote, there is another way,
which is via the files that are copied over during the recovery
phase. Most of the time this will not be a problem because with the
small number of documents that we are indexing in these tests, it is
usally not the case that a flush would occur and so there would not be
any documents in the files copied over. However, a flush can occur any
time at which point all of the indexed documents could end up in a safe
commit and copied over during recovery from remote. This commit modifies
these assertions to ones that are not prone to this issue, yet still
validate the health of the follower shard.
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Feb 15, 2019
* master:
  Address some CCR REST test case flakiness (elastic#38975)
  Edits to text in Completion Suggester doc (elastic#38980)
  SQL: doc polishing
  [DOCS] Fixes broken formatting
  SQL: Polish the rest chapter (elastic#38971)
  Remove `nGram` and  `edgeNGram` token filter names (elastic#38911)
  Add an exception throw if waiting on transport port file fails (elastic#37574)
  Improve testcluster distribution artifact handling (elastic#38933)
  Advance max_seq_no before add operation to Lucene (elastic#38879)
  Reduce global checkpoint sync interval in disruption tests (elastic#38931)
  [test] disable packaging tests for suse boxes
  Relax testStressMaybeFlushOrRollTranslogGeneration (elastic#38918)
  [DOCS] Edits warning in put watch API (elastic#38582)
  Fix serialization bug in ShardFollowTask after cutting this class over to extend from ImmutableFollowParameters.
  [DOCS] Updates methods for upgrading machine learning (elastic#38876)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/CCR Issues around the Cross Cluster State Replication features >test Issues or PRs that are addressing/adding tests v6.7.0 v7.0.0-rc2 v7.2.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants