Inactive shard flush should wait for ongoing one #89430

kingherc · 2022-08-17T16:10:46Z

org.elasticsearch.indices.flush.FlushIT#testFlushOnInactive would
sometimes fail in the following case:

SHARD_MEMORY_INTERVAL_TIME_SETTING is set very low, e.g., 10ms
The regularly scheduled multiple flushes proceed to
org.elasticsearch.index.shard.IndexShard#flushOnIdle
There, the first flush will handle e.g., the first document
that was indexed. The second flush will arrive shortly after,
before the first flush finishes.
The second flush will find that wasActive = true (due to the
indexing of the remaining documents), and will set it to false.
However, the second flush will not be executed because
waitIfOngoing = false, and there is the ongoing first flush.
No other flush is scheduled (since any next regularly scheduled
flush will find wasActive = false), which creates the problem.

elasticsearchmachine · 2022-08-17T16:14:03Z

Pinging @elastic/es-distributed (Team:Distributed)

kingherc · 2022-08-17T16:52:00Z

Hi @henningandersen and @DaveCTurner . After investigating the test failure, we (with the help of @fcofdez ) figured out the situation, which is presented in the PR description. It can happen in some rare situations when the SHARD_MEMORY_INTERVAL_TIME_SETTING is set too low and/or if a flush takes too long.

There are 3 solutions we discussed so far:

The current PR's solution, to change the waitIfOngoing to true. Drawback: one of the flush threadpool's threads may wait on the flushLock. But for normal values of SHARD_MEMORY_INTERVAL_TIME_SETTING (e.g. the default is 5 seconds), this drawback would appear in some very weirdly long-running flush that was being executed already. This is why the PR has this solution.
To avoid the drawback of the first solution, we could "chain" a second flush request to the first ongoing request. And make the first request, when finished, execute any chained requests. That way, we do not have a thread waiting on the flushLock. Drawback: complexity vs the other approaches.
We remove the wasActive smartness. Drawback: every SHARD_MEMORY_INTERVAL_TIME_SETTING ms, there will be a possible no-op flush request scheduled.

Feel free to tell us your opinion on the selected approach and any other thoughts you may have.

henningandersen · 2022-08-17T17:19:04Z

It does look like the old synced-flush would use waitIfOngoing=true, see here (and this PR). However, I wonder if we could let flush return a boolean if it attempted the flush or not - and if not, set active back to true?

A more straightforward solution could also be to move the setting of active to be after indexing is done rather than before? This also affects scheduledRefresh, but I think it is fine to not consider an indexing request that is ongoing active there too. I think I'd prefer that option, unless I am overseeing something.

fcofdez · 2022-08-18T07:35:09Z

A more straightforward solution could also be to move the setting of active to be after indexing is done rather than before? This also affects scheduledRefresh, but I think it is fine to not consider an indexing request that is ongoing active there too. I think I'd prefer that option, unless I am overseeing something.

I'm not sure if this solves the issue? Theoretically we could still miss the last flush if the first flush is still running after the remaining documents have been indexed. But maybe I'm missing something here.

henningandersen · 2022-08-18T07:53:21Z

I'm not sure if this solves the issue? Theoretically we could still miss the last flush if the first flush is still running after the remaining documents have been indexed. But maybe I'm missing something here.

You might be right 🙂 . The idea would be that by marking active after the indexing has occurred, the next round of flushOnIdle would know that it needs to flush?

kingherc · 2022-08-18T11:12:54Z

Hi! Thanks for the awesome conversation. I think I agree with @fcofdez . If we follow that approach @henningandersen , I so see a rare situation where:

There's an ongoing flush of docs [1,N]
A new doc N+1 is indexed and sets active = true
The next flushOnIdle() is called, which sets active = false, and tries another flush request, which fails because of the ongoing flush.
The ongoing flush finishes, but has missed N+1 document.
Any next scheduled flushOnIdle() will not try flush requests since active = false.
Right?

However, I wonder if we could let flush return a boolean if it attempted the flush or not - and if not, set active back to true?

I think this is also a nice solution. I can try that if you agree.

fcofdez · 2022-08-18T12:23:25Z

Just to add more context here, this only affects in cases where we stop indexing after the latest flush is skipped (a rare edge case)

I think this is also a nice solution. I can try that if you agree.

👍 let's try that.

henningandersen · 2022-08-18T12:36:23Z

Makes sense, the active flag would need to be cleared differently too (inside flush) if were to pursue the solution where we set active after indexing.

org.elasticsearch.indices.flush.FlushIT#testFlushOnInactive would sometimes fail in the following case: * SHARD_MEMORY_INTERVAL_TIME_SETTING is set very low, e.g., 10ms * The regularly scheduled multiple flushes proceed to org.elasticsearch.index.shard.IndexShard#flushOnIdle * There, the first flush will handle e.g., the first document that was indexed. The second flush will arrive shortly after, before the first flush finishes. * The second flush will find that wasActive = true (due to the indexing of the remaining documents), and will set it to false. * However, the second flush will not be executed because waitIfOngoing = false, and there is the ongoing first flush. * No other flush is scheduled (since any next regularly scheduled flush will find wasActive = false), which creates the problem. Solution: if a flush request does not happen, revert active flag, so that a next flush request can happen. Fixes elastic#87888

kingherc · 2022-08-18T12:51:13Z

Hi @fcofdez , @henningandersen . Thanks for the conversation. The method where if flush does not wait for the ongoing flush, it returns false, and we set the active flag back to true, works. I did that -- feel free to review the PR.

fcofdez · 2022-08-18T13:27:21Z

It looks like there are some related test failures, additionally we usually avoid force-pushing since Github gets confused and it hides/removes some review comments.

kingherc · 2022-08-18T13:46:15Z

It looks like there are some related test failures, additionally we usually avoid force-pushing since Github gets confused and it hides/removes some review comments.

Fixed the test.

Oh, about the force pushes, I will avoid from now on. Either way the commits get squashed when merging the PR, so indeed I do not see a reason now why I did it :)
Did you miss adding a comment? if yes, please mention it again please.

fcofdez · 2022-08-18T13:55:16Z

Oh, about the force pushes, I will avoid from now on. Either way the commits get squashed when merging the PR, so indeed I do not see a reason now why I did it :)

No worries, it's just a trade-off, sometimes it's easier to review a set of clean commits, but I'm not sure if GitHub will ever fix the force-push issue 🤔

henningandersen

This directions looks good to me. I have a few comments that I'd like to see addressed though.

server/src/main/java/org/elasticsearch/index/engine/Engine.java

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

fcofdez

I left a small comment, the direction looks good.

server/src/main/java/org/elasticsearch/index/engine/ReadOnlyEngine.java

And fix some PR review feedback

henningandersen

I think there is a problem with the new test, otherwise this looks good.

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

server/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java

Co-authored-by: Henning Andersen <33268011+henningandersen@users.noreply.github.com>

…ush-on-inactive

henningandersen

LGTM.

server/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java

fcofdez

LGTM 👍

Fix some javadoc

kingherc · 2022-08-22T14:12:15Z

@elasticmachine run elasticsearch-ci/part-1 please

server/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java

elasticsearchmachine added needs:triage Requires assignment of a team area label v8.5.0 labels Aug 17, 2022

kingherc added :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >test-failure Triaged test failures from CI labels Aug 17, 2022

elasticsearchmachine added Team:Distributed Meta label for distributed team and removed needs:triage Requires assignment of a team area label labels Aug 17, 2022

kingherc added the needs:triage Requires assignment of a team area label label Aug 17, 2022

elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Aug 17, 2022

kingherc requested a review from fcofdez August 17, 2022 16:15

kingherc force-pushed the test-failure/87888-flush-on-inactive branch from 460eb61 to e95404b Compare August 18, 2022 12:43

kingherc force-pushed the test-failure/87888-flush-on-inactive branch from e95404b to c5509c4 Compare August 18, 2022 12:45

kingherc self-assigned this Aug 18, 2022

kingherc requested a review from henningandersen August 18, 2022 12:47

Fix test

a1285b3

henningandersen reviewed Aug 18, 2022

View reviewed changes

fcofdez reviewed Aug 18, 2022

View reviewed changes

server/src/main/java/org/elasticsearch/index/engine/ReadOnlyEngine.java Show resolved Hide resolved

New test for concurrent flushes

bdb7a4b

And fix some PR review feedback

kingherc requested review from fcofdez and henningandersen August 18, 2022 17:47

henningandersen reviewed Aug 18, 2022

View reviewed changes

kingherc and others added 2 commits August 22, 2022 11:07

Apply suggestions from code review

d1cae13

Co-authored-by: Henning Andersen <33268011+henningandersen@users.noreply.github.com>

Fix test and comments

92e9b3f

kingherc requested a review from henningandersen August 22, 2022 09:23

Merge remote-tracking branch 'origin/main' into test-failure/87888-fl…

3ee32e8

…ush-on-inactive

henningandersen approved these changes Aug 22, 2022

View reviewed changes

server/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java Show resolved Hide resolved

fcofdez approved these changes Aug 22, 2022

View reviewed changes

Test active flag as well

d23a145

Fix some javadoc

henningandersen reviewed Aug 22, 2022

View reviewed changes

server/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java Outdated Show resolved Hide resolved

Improve test comments

7531b4f

kingherc merged commit 824bfd0 into elastic:main Aug 22, 2022

kingherc deleted the test-failure/87888-flush-on-inactive branch August 22, 2022 15:12

mark-vieira mentioned this pull request Aug 22, 2022

[CI] IndexShardTests testFlushOnIdleConcurrentFlushDoesNotWait failing #89518

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inactive shard flush should wait for ongoing one #89430

Inactive shard flush should wait for ongoing one #89430

kingherc commented Aug 17, 2022

elasticsearchmachine commented Aug 17, 2022

kingherc commented Aug 17, 2022

henningandersen commented Aug 17, 2022 •

edited

Loading

fcofdez commented Aug 18, 2022

henningandersen commented Aug 18, 2022

kingherc commented Aug 18, 2022

fcofdez commented Aug 18, 2022

henningandersen commented Aug 18, 2022

kingherc commented Aug 18, 2022

fcofdez commented Aug 18, 2022

kingherc commented Aug 18, 2022

fcofdez commented Aug 18, 2022

henningandersen left a comment

fcofdez left a comment

henningandersen left a comment

henningandersen left a comment

fcofdez left a comment

kingherc commented Aug 22, 2022

Inactive shard flush should wait for ongoing one #89430

Inactive shard flush should wait for ongoing one #89430

Conversation

kingherc commented Aug 17, 2022

elasticsearchmachine commented Aug 17, 2022

kingherc commented Aug 17, 2022

henningandersen commented Aug 17, 2022 • edited Loading

fcofdez commented Aug 18, 2022

henningandersen commented Aug 18, 2022

kingherc commented Aug 18, 2022

fcofdez commented Aug 18, 2022

henningandersen commented Aug 18, 2022

kingherc commented Aug 18, 2022

fcofdez commented Aug 18, 2022

kingherc commented Aug 18, 2022

fcofdez commented Aug 18, 2022

henningandersen left a comment

Choose a reason for hiding this comment

fcofdez left a comment

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

fcofdez left a comment

Choose a reason for hiding this comment

kingherc commented Aug 22, 2022

henningandersen commented Aug 17, 2022 •

edited

Loading