Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force Refresh Listeners when Acquiring all Operation Permits #36835

Merged

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented Dec 19, 2018

  • Fixes the issue reproduced in the added tests:
    • When having open index requests on a shard that are waiting for a refresh, relocating that shard
      becomes blocked until that refresh happens (which could be never as in the test scenario).
  • Fixed by:
    • Before trying to aquire all permits for relocation, refresh if there are outstanding operations

PS: I ran the added tests for a few thousand runs without trouble.

@original-brownbear original-brownbear added >bug v7.0.0 :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. v6.6.0 labels Dec 19, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

* @throws InterruptedException if calling thread is interrupted
* @throws TimeoutException if timed out waiting for in-flight operations to finish
* @throws IndexShardClosedException if operation permit has been closed
*/
<E extends Exception> void blockOperations(
final long timeout,
final TimeUnit timeUnit,
final CheckedRunnable<E> onActiveOperations,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the only production use case for this method is in relocation so admit it's a little noisy to add this kind of general callback here, but it still seems like the smallest possible change to get a hook to run the refresh conditionally here (after preventing new operations from piling on more waits concurrently).

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has still a race I think. The issue is that there's no guarantee that the refresh will happen after all pending requests have registered a refresh listener. To ensure this, we need multiple steps. First, ensure that no new listeners are registered (this can be achieved by setting getMaxRefreshListeners to 0), and then doing a manual refresh to free all existing listeners. There is no need I think to inline all this into blockOperations, it can be done before calling this method.

* Fixes the issue reproduced in the added tests:
   * When having open index requests on a shard that are waiting for a refresh, relocating that shard
becomes blocked until that refresh happens (which could be never as in the test scenario).
* Fixed by:
  * Before trying to aquire all permits for relocation, refresh if there are outstanding operations
@original-brownbear
Copy link
Member Author

@ywelsch alright, yours is a much better plan :) => reverted my approach and implemented that in f40c6a6 (sorry for accidental rebase)

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @original-brownbear. The concurrency looks better. I think we need to extend this to all actions that can possibly acquire all operation permits. In particular, I think this might also cause problems on replicas, e.g. when a replica learns of a new primary and tries to bump its term (see IndexShard#bumpPrimaryTerm). If it then has a refresh=wait_for op waiting (from the old primary), it will run into the same issue, and indefinitely stop accepting any writes from the new primary.

try {
if (refreshListeners.refreshNeeded()) {
refresh("relocated");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we always want to do the refresh after calling disallowAdd, I wonder if we should combine both into one method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a neat way of doing that since we have the async case and the blocking case of acquiring all the permits and we want to enforce some try-finally semantics for allowing the listeners again for both now. I'm not sure it actually makes things more readable if we hide the handling of exceptions from refresh(...) in some other method. I can try finding a nice way though :)

@original-brownbear
Copy link
Member Author

@ywelsch all points addressed I think => should be good for another review.

I think we need to extend this to all actions that can possibly acquire all operation permits.

Done, I wrapped all cases of acquiring all permits I could find. It seems though, that IndexShard#bumpPrimaryTerm was the only production code use-case.
The other two methods where I added the logic seem to only be called from tests.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed
daa9fc7 which simplifies the code imho. We will also need unit tests for the RefreshListeners class.

@original-brownbear
Copy link
Member Author

Ok thanks, I'll add some tests :)

@original-brownbear original-brownbear requested review from ywelsch and removed request for ywelsch December 28, 2018 12:22
@original-brownbear
Copy link
Member Author

@ywelsch ok fixed :) Added test in 71bef56 Should be good for review now

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

throw e;
}
}
return () -> runOnce.run();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could assert before this line here that assert refreshListeners == null?

@ywelsch
Copy link
Contributor

ywelsch commented Dec 28, 2018

Please adapt PR title to make it clear this is not only for relocations. For example: "Force refresh listeners when acquiring all operation permits"

@original-brownbear original-brownbear changed the title RELOCATION:Fix Indef. Block when Wait on Refresh Force Refresh Listeners when Acquiring all Operation Permits Dec 28, 2018
@original-brownbear original-brownbear merged commit 4ac8fc6 into elastic:master Dec 28, 2018
@original-brownbear original-brownbear deleted the relocation-refresh-fix branch December 28, 2018 15:42
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Dec 28, 2018
…#36835)

* Fixes the issue reproduced in the added tests:
   * When having open index requests on a shard that are waiting for a refresh, relocating that shard
becomes blocked until that refresh happens (which could be never as in the test scenario).
original-brownbear added a commit that referenced this pull request Dec 28, 2018
* Force Refresh Listeners when Acquiring all Operation Permits (#36835)

* Fixes the issue reproduced in the added tests:
   * When having open index requests on a shard that are waiting for a refresh, relocating that shard
becomes blocked until that refresh happens (which could be never as in the test scenario).
@tlrx tlrx mentioned this pull request Jan 3, 2019
50 tasks
@tlrx
Copy link
Member

tlrx commented Jan 7, 2019

Thanks for fixing this @original-brownbear !

@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. v6.7.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants