Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEST: Retry synced-flush if ongoing ops on primary #30978

Merged
merged 6 commits into from
Jun 5, 2018

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented May 30, 2018

When the last indexing operation is completed, we will fire a global
checkpoint sync. Since a global checkpoint sync request is a
replication request, it will acquire an index shard permit on the
primary when executing. If this happens at the same time while we are
issuing the synced-flush, the synced-flush request will fail as it
thinks there are in-flight operations. We can avoid such situation by
retrying another synced-flush if the current request fails due to
ongoing operations on the primary.

Closes #29392

When the last indexing operation is completed, we will fire a global
checkpoint sync.  Since a global checkpoint sync request is a
replication request, it will acquire an index shard permit on the
primary when executing.  If this happens at the same time while we are
issuing the synced-flush, the synced-flush request will fail as it
thinks there are in-flight operations. We can avoid such situation by
not issue the synced-flush until the global checkpoint on the primary is
propagated to replicas.
@dnhatn dnhatn added >non-issue >test Issues or PRs that are addressing/adding tests v7.0.0 :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. v6.4.0 labels May 30, 2018
@dnhatn dnhatn requested a review from bleskes May 30, 2018 21:01
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn
Copy link
Member Author

dnhatn commented May 30, 2018

I traced the list of index shard permit acquirers when issuing the synced-flush.

1> [2018-05-30T04:00:48,248][TRACE][o.e.i.f.SyncedFlushService] [node_s1] in flight operations 1, acquirers [GlobalCheckpointSyncAction.Request{shardId=[test][0], timeout=1m, index='test', waitForActiveShards=1}
1> 	at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:229)
1> 	at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2178)
1> 	at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryShardReference(TransportReplicationAction.java:968)
1> 	at org.elasticsearch.action.support.replication.TransportReplicationAction.access$500(TransportReplicationAction.java:98)
1> 	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:318)
1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
1> 	at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:293)
1> 	at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:280)
1> 	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66)
1> 	at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:656)
1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724)
1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
1> 	at java.lang.Thread.run(Thread.java:748)]
1> [2018-05-30T04:00:48,248][INFO ][o.e.i.f.FlushIT          ] Third seal: Total shards: [2], failed: [true], reason: [[1] ongoing operations on primary], detail: []

@dnhatn
Copy link
Member Author

dnhatn commented May 31, 2018

run sample packaging tests

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@bleskes
Copy link
Contributor

bleskes commented May 31, 2018

I wonder if we still have a race condition - we periodically acquire a permit to do a background global checkpoint sync . Since synced flush is a best effort anyway (other stuff may fail in the future), maybe a better direction is to repeat it until successful?

@dnhatn
Copy link
Member Author

dnhatn commented May 31, 2018

@bleskes

maybe a better direction is to repeat it until successful?

Unfortunately, this is not the case because we are testing a different kind of results: unchanged, partially successful, and entirely successful in these tests.

we periodically acquire a permit to do a background global checkpoint sync

This is fired for every 30 seconds which I think is long enough for the synced-flush tests. How about disable the background global checkpoint sync in the synced-flush tests?

@dnhatn
Copy link
Member Author

dnhatn commented May 31, 2018

Talked to Boaz, we decided to take a slightly different approach for this. We will retry another synced-flush if the current request fails due to ongoing operations on the primary.

@bleskes and @jasontedor Can you please take another look?

@dnhatn dnhatn changed the title TEST: Only issue synced-flush after global checkpoint synced TEST: Retry synced-flush if ongoing ops on primary May 31, 2018
@dnhatn
Copy link
Member Author

dnhatn commented May 31, 2018

I should have provided the reasons for this change. The current approach might not cover these cases:

  • Background global checkpoint sync which should be unlikely but is not eliminated entirely in the current solution.
  • Global checkpoint sync request is async, thus it's possible that we check an old global checkpoint value instead of the latest one. Here we can not use the max_seqno as we have an artificial out-of-sync replicas test.

} catch (InterruptedException e) {
Thread.currentThread().interrupt();
if (listener.error != null) {
return; // stop here so that we can preserve the error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just fold this into an if else?

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}
if (listener.result.failed()) {
// only retry if request failed due to ongoing operations on primary
assertThat(listener.result.failureReason(), not(containsString("ongoing operations on primary")));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally favor something like:

if (listener.result != null && listener. result.failureReason != null && 
     listener. result.failureReason.contains("ongoing operations on primary") {
    throw new AsserionError(listener.reasult.failureReason); // cause the assert busy to retry
}

instead of these two ifs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@dnhatn
Copy link
Member Author

dnhatn commented Jun 5, 2018

Thanks @jasontedor and @bleskes for reviewing!

@dnhatn dnhatn merged commit 4b893c1 into elastic:master Jun 5, 2018
@dnhatn dnhatn deleted the fix-flush-it branch June 5, 2018 13:02
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jun 5, 2018
* elastic/master:
  [Tests] Muting RatedRequestsTests#testXContentParsingIsNotLenient
  TEST:  Retry synced-flush if ongoing ops on primary (elastic#30978)
  Fix docs build.
  Only auto-update license signature if all nodes ready (elastic#30859)
  Add BlobContainer.writeBlobAtomic() (elastic#30902)
  Add a doc value format to binary fields. (elastic#30860)
  Take into account the return value of TcpTransport.readMessageLength(...) in Netty4SizeHeaderFrameDecoder
  Move caching of the size of a directory to `StoreDirectory`. (elastic#30581)
  Clarify docs about boolean operator precedence. (elastic#30808)
  Docs: remove notes on sparsity. (elastic#30905)
  Fix MatchPhrasePrefixQueryBuilderTests#testPhraseOnFieldWithNoTerms
  run overflow forecast a 2nd time as regression test for elastic/ml-cpp#110 (elastic#30969)
  Improve documentation of dynamic mappings. (elastic#30952)
  Decouple MultiValueMode. (elastic#31075)
  Docs: Clarify constraints on scripted similarities. (elastic#31076)
dnhatn added a commit that referenced this pull request Jun 5, 2018
* master:
  Removing erroneous repeat
  Adapt bwc versions after backporting #30983 to 6.4
  [Tests] Muting RatedRequestsTests#testXContentParsingIsNotLenient
  TEST:  Retry synced-flush if ongoing ops on primary (#30978)
  Fix docs build.
  Only auto-update license signature if all nodes ready (#30859)
  Add BlobContainer.writeBlobAtomic() (#30902)
  Add a doc value format to binary fields. (#30860)
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jun 5, 2018
* ccr:
  [DOCS] Creates rest-api folder in docs
  [Rollup] Disallow index patterns that match the rollup index (elastic#30491)
  Replace exact numDocs by soft-del count in SegmentInfo (elastic#31086)
  Upgrade to Lucene-7.4.0-snapshot-0a7c3f462f (elastic#31073)
  Add cors support to NioHttpServerTransport (elastic#30827)
  [DOCS] Fixes security example (elastic#31082)
  Allow terms query in _rollup_search (elastic#30973)
  Removing erroneous repeat
  Adapt bwc versions after backporting elastic#30983 to 6.4
  [Tests] Muting RatedRequestsTests#testXContentParsingIsNotLenient
  TEST:  Retry synced-flush if ongoing ops on primary (elastic#30978)
  Fix docs build.
  Only auto-update license signature if all nodes ready (elastic#30859)
  Add BlobContainer.writeBlobAtomic() (elastic#30902)
  Add a doc value format to binary fields. (elastic#30860)
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jun 5, 2018
* ccr:
  [DOCS] Creates rest-api folder in docs
  [Rollup] Disallow index patterns that match the rollup index (elastic#30491)
  Replace exact numDocs by soft-del count in SegmentInfo (elastic#31086)
  Upgrade to Lucene-7.4.0-snapshot-0a7c3f462f (elastic#31073)
  Add cors support to NioHttpServerTransport (elastic#30827)
  [DOCS] Fixes security example (elastic#31082)
  Allow terms query in _rollup_search (elastic#30973)
  Removing erroneous repeat
  Adapt bwc versions after backporting elastic#30983 to 6.4
  [Tests] Muting RatedRequestsTests#testXContentParsingIsNotLenient
  TEST:  Retry synced-flush if ongoing ops on primary (elastic#30978)
  Fix docs build.
  Only auto-update license signature if all nodes ready (elastic#30859)
  Add BlobContainer.writeBlobAtomic() (elastic#30902)
  Add a doc value format to binary fields. (elastic#30860)
dnhatn added a commit that referenced this pull request Jun 10, 2018
When the last indexing operation is completed, we will fire a global
checkpoint sync. Since a global checkpoint sync request is a replication
request, it will acquire an index shard permit on the primary when
executing. If this happens at the same time while we are issuing the
synced-flush, the synced-flush request will fail as it thinks there are
in-flight operations. We can avoid such situation by retrying another
synced-flush if the current request fails due to ongoing operations on
the primary.

Closes #29392
dnhatn added a commit that referenced this pull request Jun 10, 2018
When the last indexing operation is completed, we will fire a global
checkpoint sync. Since a global checkpoint sync request is a replication
request, it will acquire an index shard permit on the primary when
executing. If this happens at the same time while we are issuing the
synced-flush, the synced-flush request will fail as it thinks there are
in-flight operations. We can avoid such situation by retrying another
synced-flush if the current request fails due to ongoing operations on
the primary.

Closes #29392
dnhatn added a commit that referenced this pull request Jun 14, 2018
* 6.x:
  SQL: Fix build on Java 10
  [Tests] Mutualize fixtures code in BaseHttpFixture (#31210)
  [TEST] Fix RemoteClusterClientTests#testEnsureWeReconnect
  [ML] Update test thresholds to account for changes to memory control (#31289)
  Reenable Checkstyle's unused import rule (#31270)
  [ML] Check licence when datafeeds use cross cluster search  (#31247)
  Fix non-REST doc snippet
  [DOC] Extend SQL docs
  [DOCS] Shortens ML API intros
  Use quotes in the call invocation (#31249)
  move security ingest processors to a sub ingest directory (#31306)
  SQL: Whitelist SQL utility class for better scripting (#30681)
  Add 5.6.11 version constant.
  Fix version detection.
  [Docs] All Rollup docs experimental, agg limitations, clarify DeleteJob (#31299)
  Add missing release notes.
  Security: fix token bwc with pre 6.0.0-beta2 (#31254)
  Fix compilation error in UpdateSettingsIT (#31304)
  Test: Remove broken yml test feature (#31255)
  Add unreleased version 6.3.1
  [Rollup] Metric config parser must use builder so validation runs (#31159)
  Removes experimental tag from scripted_metric aggregation (#31298)
  [DOCS] Removes coming tag from 6.3.0 release notes
  6.3 release notes.
  Add notion of internal index settings (#31286)
  REST high-level client: add Cluster Health API (#29331)
  Remove leftover usage of deprecated client API
  SyncedFlushResponse to implement ToXContentObject (#31155)
  Add Get Aliases API to the high-level REST client (#28799)
  HLRest: Add get index templates API (#31161)
  Log warnings when cluster state publication failed to some nodes (#31233)
  Fix AntFixture waiting condition (#31272)
  [TEST] Mute RecoveryIT.testHistoryUUIDIsGenerated
  Ignore numeric shard count if waiting for ALL (#31265)
  Update checkstyle to 8.10.1 (#31269)
  Set analyzer version in PreBuiltAnalyzerProviderFactory (#31202)
  Revert upgrade to Netty 4.1.25.Final (#31282)
  Use armored input stream for reading public key (#31229)
  [DOCS] Added 'fail_on_unsupported_field' param to MLT. Closes #28008 (#31160)
  Fix Netty 4 Server Transport tests. Again.
  [DOCS] Fixed typo.
  [DOCS] Added release highlights for 6.3 (#31256)
  [DOCS] Mark SQL feature as experimental
  [DOCS] Updates machine learning custom URL screenshots (#31222)
  Fix naming conventions check for XPackTestCase
  Fix security Netty 4 transport tests
  Fix race in clear scroll (#31259)
  [DOCS] Clarify audit index settings when remote indexing (#30923)
  [ML][TEST] Mute tests using rules (#31204)
  Support RequestedAuthnContext (#31238)
  Validate xContentType in PutWatchRequest. (#31088)
  [INGEST] Interrupt the current thread if evaluation grok expressions take too long (#31024)
  Upgrade to Netty 4.1.25.Final (#31232)
  Suppress extras FS on caching directory tests
  Revert "[DOCS] Added 6.3 info & updated the upgrade table. (#30940)"
  Revert "Fix snippets in upgrade docs"
  Fix snippets in upgrade docs
  [DOCS] Added 6.3 info & updated the upgrade table. (#30940)
  Enable custom credentials for core REST tests (#31235)
  Move ESIndexLevelReplicationTestCase to test framework (#31243)
  Encapsulate Translog in Engine (#31220)
  [DOCS] Adds machine learning 6.3.0 release notes (#31217)
  Remove all unused imports and fix CRLF (#31207)
  [TEST] Fix testRecoveryAfterPrimaryPromotion
  [Docs] Remove mention pattern files in Grok processor (#31170)
  Use stronger write-once semantics for Azure repository (#30437)
  Don't swallow exceptions on replication (#31179)
  Compliant SAML Response destination check (#31175)
  Move java version checker back to its own jar (#30708)
  TEST:  Retry synced-flush if ongoing ops on primary (#30978)
  [test] add fix for rare virtualbox error (#31212)
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >non-issue >test Issues or PRs that are addressing/adding tests v6.3.0 v6.4.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants