Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up ReplicationTracker Logic on Data Nodes a Little #79837

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

original-brownbear
Copy link
Member

This stuff shows up a little in profiling when the data-node holds a lot of shards.
Mainly just avoiding extra loops in calculateReplicationGroup (biggest cost in profiling), allocating streams to iterate 2 steps and some other indirection that starts to hurt if executed over thousands of shards.
Also, we were submitting a lot of empty tasks to the generic pool at times, which temporarily showed up in profiling.

I think we can make bigger improvements here and run this logic more selectively, it seems a non-trivial number of noops are executed here.

relates #77466

This stuff shows up a little in profiling when the datanode holds a lot of shards.
@original-brownbear original-brownbear added >non-issue :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 labels Oct 26, 2021
@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Oct 26, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on the potential benefit of these changes? I think most of this is on what I would normally consider the "slow-is-ok" path, i.e., a place where code clarity normally trumps micro-optimizations.

The bit in PendingReplicationActions are easily correct and just as readable so those are good to go separately I think.

this.routingTable = routingTable;
this.inSyncAllocationIds = inSyncAllocationIds;
this.trackedAllocationIds = trackedAllocationIds;
this.version = version;

this.unavailableInSyncShards = Sets.difference(inSyncAllocationIds, routingTable.getAllAllocationIds());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the overhead here is really noticeable, since the inSyncAllocationIds would typically be just 2-3 entries?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the profiling for a data node where nothing really changes in terms of shards at the point of taking the profile (hot lots of CS updates at the time of creating the profile though):

image

The code adjusted here is in aggregate as expensive as building RoutingNodes which I found surprisingly expensive given that routing nodes are pretty expensive here in a cluster with close to 100k shards. I highlighted the relative importance of this particular line.
I think the issue here is mainly with how badly all this high-level stream stuff inlines. It's also particularly unfortunate for 2-3 entry sets because setting up the stream etc. takes more time than just doing the iteration/hashing/equals checks and also adds more CPU cycles wasted on the GC end of things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the specific line, perhaps we could optimize Sets.difference by not using streams?

Other than that, a big contribution here seems to be the synchronized block (which may admittedly improve once some of the inner workings is reduced). Can you share the full flame graph too?

I think we can make bigger improvements here and run this logic more selectively, it seems a non-trivial number of noops are executed here.

I would be interested in this, I think you mean to shortcut this logic at a higher level? If we do that, I think this PR becomes obsolete and perhaps it is worth looking into this first?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full profile is this:

image

(sorry could find a way to upload the html here for the life of me, GH doesn't even allow a zip of it, happy to provide it on another channel).

This is not a node under heavy load but it's quite jittery in terms of how long each individual CS application takes which I blame on this code at least in part. Since the sync blocks also have some contention with other functionality I think we should be careful about the performance here because side-effects of slowness aren't entirely trivial to predict.

I would be interested in this, I think you mean to shortcut this logic at a higher level? If we do that, I think this PR becomes obsolete and perhaps it is worth looking into this first?

This I could code up in 20 minutes and it at least eliminates a little of the data node jitter in CS application :) (note that the code here is about 25% of the total CS applier time) Fixing this at a higher level will be trickier, at least it seemed that way when I tried. I agree that it makes fixes here less relevant though I still think it's nice to not have any obviously inefficient code in sync blocks the the applier thread has to enter.

About the specific line, perhaps we could optimize Sets.difference by not using streams?

We could do that. This is the only hot user of that code that I could find in what we are currently benchmarking though and this use case is very specific due to the tiny sets involved that make the current implementation so unexpectedly inefficient.

@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/part-2 (known issue)

@original-brownbear
Copy link
Member Author

@elasticmachine update branch

@original-brownbear
Copy link
Member Author

@elasticmachine update branch

@original-brownbear
Copy link
Member Author

@elasticmachine update branch

@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/part-2 (known issue)

@arteam arteam added v8.1.0 and removed v8.0.0 labels Jan 12, 2022
@mark-vieira mark-vieira added v8.2.0 and removed v8.1.0 labels Feb 2, 2022
@original-brownbear
Copy link
Member Author

@elasticmachine update branch

@elasticsearchmachine elasticsearchmachine changed the base branch from master to main July 22, 2022 23:10
@mark-vieira mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022
@tlrx
Copy link
Member

tlrx commented Aug 22, 2022

@original-brownbear do you think this is worth pursuing? This pull request is opened since October.

@original-brownbear
Copy link
Member Author

@tlrx yea I'd still love a review from @henningandersen here ;) this one would still save tons of cycles on large data nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >non-issue Team:Distributed Meta label for distributed team v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet