New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up ReplicationTracker Logic on Data Nodes a Little #79837
base: main
Are you sure you want to change the base?
Speed up ReplicationTracker Logic on Data Nodes a Little #79837
Conversation
This stuff shows up a little in profiling when the datanode holds a lot of shards.
Pinging @elastic/es-distributed (Team:Distributed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on the potential benefit of these changes? I think most of this is on what I would normally consider the "slow-is-ok" path, i.e., a place where code clarity normally trumps micro-optimizations.
The bit in PendingReplicationActions
are easily correct and just as readable so those are good to go separately I think.
this.routingTable = routingTable; | ||
this.inSyncAllocationIds = inSyncAllocationIds; | ||
this.trackedAllocationIds = trackedAllocationIds; | ||
this.version = version; | ||
|
||
this.unavailableInSyncShards = Sets.difference(inSyncAllocationIds, routingTable.getAllAllocationIds()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the overhead here is really noticeable, since the inSyncAllocationIds
would typically be just 2-3 entries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the profiling for a data node where nothing really changes in terms of shards at the point of taking the profile (hot lots of CS updates at the time of creating the profile though):
The code adjusted here is in aggregate as expensive as building RoutingNodes
which I found surprisingly expensive given that routing nodes are pretty expensive here in a cluster with close to 100k shards. I highlighted the relative importance of this particular line.
I think the issue here is mainly with how badly all this high-level stream stuff inlines. It's also particularly unfortunate for 2-3 entry sets because setting up the stream etc. takes more time than just doing the iteration/hashing/equals checks and also adds more CPU cycles wasted on the GC end of things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the specific line, perhaps we could optimize Sets.difference
by not using streams?
Other than that, a big contribution here seems to be the synchronized block (which may admittedly improve once some of the inner workings is reduced). Can you share the full flame graph too?
I think we can make bigger improvements here and run this logic more selectively, it seems a non-trivial number of noops are executed here.
I would be interested in this, I think you mean to shortcut this logic at a higher level? If we do that, I think this PR becomes obsolete and perhaps it is worth looking into this first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The full profile is this:
(sorry could find a way to upload the html here for the life of me, GH doesn't even allow a zip of it, happy to provide it on another channel).
This is not a node under heavy load but it's quite jittery in terms of how long each individual CS application takes which I blame on this code at least in part. Since the sync blocks also have some contention with other functionality I think we should be careful about the performance here because side-effects of slowness aren't entirely trivial to predict.
I would be interested in this, I think you mean to shortcut this logic at a higher level? If we do that, I think this PR becomes obsolete and perhaps it is worth looking into this first?
This I could code up in 20 minutes and it at least eliminates a little of the data node jitter in CS application :) (note that the code here is about 25% of the total CS applier time) Fixing this at a higher level will be trickier, at least it seemed that way when I tried. I agree that it makes fixes here less relevant though I still think it's nice to not have any obviously inefficient code in sync blocks the the applier thread has to enter.
About the specific line, perhaps we could optimize Sets.difference by not using streams?
We could do that. This is the only hot user of that code that I could find in what we are currently benchmarking though and this use case is very specific due to the tiny sets involved that make the current implementation so unexpectedly inefficient.
...er/src/main/java/org/elasticsearch/action/support/replication/PendingReplicationActions.java
Outdated
Show resolved
Hide resolved
Jenkins run elasticsearch-ci/part-2 (known issue) |
@elasticmachine update branch |
@elasticmachine update branch |
@elasticmachine update branch |
Jenkins run elasticsearch-ci/part-2 (known issue) |
@elasticmachine update branch |
@original-brownbear do you think this is worth pursuing? This pull request is opened since October. |
@tlrx yea I'd still love a review from @henningandersen here ;) this one would still save tons of cycles on large data nodes. |
This stuff shows up a little in profiling when the data-node holds a lot of shards.
Mainly just avoiding extra loops in
calculateReplicationGroup
(biggest cost in profiling), allocating streams to iterate 2 steps and some other indirection that starts to hurt if executed over thousands of shards.Also, we were submitting a lot of empty tasks to the generic pool at times, which temporarily showed up in profiling.
I think we can make bigger improvements here and run this logic more selectively, it seems a non-trivial number of noops are executed here.
relates #77466