Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow force-merges to run in parallel on a node #69416

Merged
merged 7 commits into from
Feb 25, 2021

Conversation

ywelsch
Copy link
Contributor

@ywelsch ywelsch commented Feb 23, 2021

Increasing the number of threads to be used for force-merging does not automatically give you any parallelism, even if
you have many shards per node, as force-merge requests are split into node-level subrequests (see
TransportBroadcastByNodeAction, superclass of TransportForceMergeAction), one for each node, and these
subrequests are then executing sequentially for all the shards on that node.

Closes #69327

@ywelsch ywelsch added >enhancement :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. v8.0.0 v7.13.0 >bug and removed >enhancement labels Feb 23, 2021
@ywelsch ywelsch marked this pull request as ready for review February 23, 2021 14:59
@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Feb 23, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I left a few small comments.

ActionListener<TransportBroadcastByNodeAction.EmptyResult> listener) {
threadPool.executor(ThreadPool.Names.FORCE_MERGE).execute(ActionRunnable.run(listener,
() -> {
if (task instanceof CancellableTask && ((CancellableTask)task).isCancelled()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task is never a CancellableTask here is it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not right now. Given that there was already a general integration of cancellation in the base class, and that with my change here it might introduce a pitfall for cancelling force-merges in case we ever activated cancellation at the level of BroadcastRequest (which is my plan), I decided to keep that here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer assert (task instanceof CancellableTask) == false for now, otherwise we'll forget to remove this if when a force-merge shard task becomes cancellable and then we'll wonder why it's here some years later.

List<BroadcastShardOperationFailedException> accumulatedExceptions = new ArrayList<>();
List<ShardOperationResult> results = new ArrayList<>();
for (int i = 0; i < totalShards; i++) {
if (shardResultOrExceptions[i] instanceof BroadcastShardOperationFailedException) {
accumulatedExceptions.add((BroadcastShardOperationFailedException) shardResultOrExceptions[i]);
if (shardResultOrExceptions.get(i) instanceof TaskCancelledException) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just check the task for cancellation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++, changed in b0cc8f3

logger.trace("[{}] executing operation for shard [{}]", actionName, shardRouting.shortSummary());
}
final Consumer<Exception> failureHandler = e -> {
if (e instanceof TaskCancelledException) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just checked the task for cancellation in finishHim, I think we wouldn't need this special case handling: we could just treat it like any other shard-level exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++, fixed in b0cc8f3

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I left one optional request

ActionListener<TransportBroadcastByNodeAction.EmptyResult> listener) {
threadPool.executor(ThreadPool.Names.FORCE_MERGE).execute(ActionRunnable.run(listener,
() -> {
if (task instanceof CancellableTask && ((CancellableTask)task).isCancelled()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer assert (task instanceof CancellableTask) == false for now, otherwise we'll forget to remove this if when a force-merge shard task becomes cancellable and then we'll wonder why it's here some years later.

@ywelsch ywelsch merged commit e768605 into elastic:master Feb 25, 2021
ywelsch added a commit that referenced this pull request Feb 25, 2021
Increasing the number of threads to be used for force-merging does not automatically give you any parallelism, even if
you have many shards per node, as force-merge requests are split into node-level subrequests (see
TransportBroadcastByNodeAction, superclass of TransportForceMergeAction), one for each node, and these
subrequests are then executing sequentially for all the shards on that node.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Mar 1, 2021
DaveCTurner added a commit that referenced this pull request Mar 1, 2021
DaveCTurner added a commit that referenced this pull request Mar 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. Team:Distributed Meta label for distributed team v7.13.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Force-merges do not parallelize well at the node level
4 participants