SOLR-17076: Optimize `OrderedNodePlacementPlugin#getAllReplicasOnNode` #2076

patsonluk · 2023-11-15T00:46:13Z

https://issues.apache.org/jira/browse/SOLR-17076

Description

OrderedNodePlacementPlugin#getAllReplicasOnNode can be slow in a cluster with lot of replicas. The effect could compound with new collection creation with many shards.

Solution

Introduced a new field allReplicas which keeps track of all the replicas added/removed in the Plugin. Return a copy of that set instead for getAllReplicasOnNode.

Also added a new convenient method getAllReplicaCount if only the count is needed

Tests

A minor change, rely on existing unit test case

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Reference Guide

…` field, which keeps track of all replicas. Instead of computing a new list every time. 2. Added a getAllReplicaCount method to avoid creating new list of replicas if only the count is required

patsonluk · 2023-11-15T17:43:34Z

Got this while running ./gradlew check locally

java.lang.AssertionError: Doc Counts do not add up expected:<67> but was:<66>
   >         at __randomizedtesting.SeedInfo.seed([5A95C1706F927E37:D2C1FEAAC16E13CF]:0)
   >         at org.junit.Assert.fail(Assert.java:89)
   >         at org.junit.Assert.failNotEquals(Assert.java:835)
   >         at org.junit.Assert.assertEquals(Assert.java:647)
   >         at org.apache.solr.cloud.AbstractFullDistribZkTestBase.assertDocCounts(AbstractFullDistribZkTestBase.java:1851)
   >         at org.apache.solr.cloud.AbstractBasicDistributedZk2TestBase.test(AbstractBasicDistributedZk2TestBase.java:109)

...

ERROR: The following test(s) have failed:
  - org.apache.solr.cloud.BasicDistributedZk2Test.test (:solr:core)
    Test output: /Users/patson/src/cowpath-solr/solr/core/build/test-results/test/outputs/OUTPUT-org.apache.solr.cloud.BasicDistributedZk2Test.txt
    Reproduce with: gradlew :solr:core:test --tests "org.apache.solr.cloud.BasicDistributedZk2Test.test" -Ptests.jvms=5 "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1 -XX:ReservedCodeCacheSize=120m" -Ptests.seed=5A95C1706F927E37 -Ptests.file.encoding=ISO-8859-1

However, re-running ./gradlew check with -Ptests.seed=5A95C1706F927E37 no longer triggers the same issue

magibney

Suggested a couple of changes, but overall this looks good to me!

magibney · 2023-11-15T20:48:11Z

solr/core/src/java/org/apache/solr/cluster/placement/plugins/OrderedNodePlacementPlugin.java

@@ -515,6 +521,7 @@ public final void removeReplica(Replica replica) {
          });
      if (hasReplica.get()) {
        removeProjectedReplicaWeights(replica);
+        allReplicas.remove(replica);


Now that we're separately tracking replicas in a top-level map (and assuming that this is all accessed by a single thread), I think this method could be simplified to:

if (allReplicas.remove(replica)) { // [the nested removal logic] }

Good catch @magibney ! I'm wondering if these 2 statements should just be placed in

solr/solr/core/src/java/org/apache/solr/cluster/placement/plugins/OrderedNodePlacementPlugin.java

Line 516 in d890246

hasReplica.set(true);

directly?

Perhaps I have overlooked some edge cases which using the AtomicBoolean hasReplica was necessary ?

After i read the code again, I'm not 100% about the original purpose of hasReplica flag.

However, it does make the logic a bit more isolated and easier to follow - iterate the list first, figure out if the replica exists, if it does, deal with it AFTEr finishing the traversal. In a logical level, it's pretty clean and avoid any concurrent modification exception (though the map/list being iterated on is private field anyway).

Hm...I simply move the allReplicas.remove to within the if (reps.remove(replica)) block now, cause i do think it's the best if the operations (add/remove) on the 2 collections stay close to each other.

Anyway, I think these are very minor concerns? 😊 either ways are fine

magibney · 2023-11-15T20:53:30Z

solr/core/src/java/org/apache/solr/cluster/placement/plugins/OrderedNodePlacementPlugin.java

-          .flatMap(shard -> shard.values().stream())
-          .flatMap(Collection::stream)
-          .collect(Collectors.toSet());
+      return new HashSet<>(allReplicas);


based on how this is being used currently, there's no need to wrap this in a new HashSet. The method is public, so you never know who might call it and what their assumptions might be; but you are replacing Collectors.toSet(), which specifically makes no guarantees about mutability/immutability, so I'd be inclined to just do Collections.unmodifableSet(allReplicas), taking this opportunity to make the API more conservative, without breaking any backcompat.

Also, if the motivating concern is "many replicas", the cost of instantiating a new HashMap that we don't really need is not worth it.

As discussed, the current fix is to use the most defensive approach which does not modify the behavior at all. The minor concern with unmodifableSet was that the "view" could still mutate after this method returns and this could break certain expectation of the caller.

Otherwise I don't have any objection using unmodifableSet. Perhaps @HoustonPutman can share some thoughts? 😊

#2076) * 1. Changed getAllReplicasOnNode to just return a copy of `allReplicas` field, which keeps track of all replicas. Instead of computing a new list every time. 2. Added a getAllReplicaCount method to avoid creating new list of replicas if only the count is required * ./gradlew tidy * minor refactoring

apache#2076) * 1. Changed getAllReplicasOnNode to just return a copy of `allReplicas` field, which keeps track of all replicas. Instead of computing a new list every time. 2. Added a getAllReplicaCount method to avoid creating new list of replicas if only the count is required * ./gradlew tidy * minor refactoring

dsmiley · 2024-05-01T14:06:14Z

Just a little reminder... when squash-merging, remember to edit the commit message so it's cleaned up to reflect the final result in totality; it's almost always necessary. Should not refer to running "tidy", for example.

justinrsweeney · 2024-05-01T14:20:51Z

Thanks for the reminder, will make sure next time!

…

On Wed, May 1, 2024 at 10:06 AM David Smiley ***@***.***> wrote: Just a little reminder... when squash-merging, remember to edit the commit message so it's cleaned up to reflect the final result in totality; it's almost always necessary. Should not refer to running "tidy", for example. — Reply to this email directly, view it on GitHub <#2076 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF67P4HYVNZXBTISWFQ2DJTZADZGZAVCNFSM6AAAAAA7LV4DYCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBYGUYTKMBWGU> . You are receiving this because you modified the open/close state.Message ID: ***@***.***>

patsonluk added 2 commits November 14, 2023 16:28

1. Changed getAllReplicasOnNode to just return a copy of `allReplicas…

d067c9b

…` field, which keeps track of all replicas. Instead of computing a new list every time. 2. Added a getAllReplicaCount method to avoid creating new list of replicas if only the count is required

./gradlew tidy

d890246

patsonluk changed the title ~~Patsonluk/solr 17076/optimize placement factory get all replica~~ SOLR-17076: Optimize placement factory get all replica Nov 15, 2023

patsonluk changed the title ~~SOLR-17076: Optimize placement factory get all replica~~ SOLR-17076: Optimize OrderedNodePlacementPlugin#getAllReplicasOnNode Nov 15, 2023

patsonluk marked this pull request as draft November 15, 2023 17:43

patsonluk marked this pull request as ready for review November 15, 2023 19:27

magibney requested changes Nov 15, 2023

View reviewed changes

minor refactoring

d0599c8

justinrsweeney merged commit 26c286a into apache:main Nov 21, 2023
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR-17076: Optimize `OrderedNodePlacementPlugin#getAllReplicasOnNode` #2076

SOLR-17076: Optimize `OrderedNodePlacementPlugin#getAllReplicasOnNode` #2076

patsonluk commented Nov 15, 2023 •

edited

patsonluk commented Nov 15, 2023 •

edited

magibney left a comment

magibney Nov 15, 2023

patsonluk Nov 15, 2023 •

edited

patsonluk Nov 20, 2023 •

edited

magibney Nov 15, 2023

patsonluk Nov 15, 2023 •

edited

dsmiley commented May 1, 2024

justinrsweeney commented May 1, 2024 via email

SOLR-17076: Optimize OrderedNodePlacementPlugin#getAllReplicasOnNode #2076

SOLR-17076: Optimize OrderedNodePlacementPlugin#getAllReplicasOnNode #2076

Conversation

patsonluk commented Nov 15, 2023 • edited

Description

Solution

Tests

Checklist

patsonluk commented Nov 15, 2023 • edited

magibney left a comment

Choose a reason for hiding this comment

magibney Nov 15, 2023

Choose a reason for hiding this comment

patsonluk Nov 15, 2023 • edited

Choose a reason for hiding this comment

patsonluk Nov 20, 2023 • edited

Choose a reason for hiding this comment

magibney Nov 15, 2023

Choose a reason for hiding this comment

patsonluk Nov 15, 2023 • edited

Choose a reason for hiding this comment

dsmiley commented May 1, 2024

justinrsweeney commented May 1, 2024 via email

SOLR-17076: Optimize `OrderedNodePlacementPlugin#getAllReplicasOnNode` #2076

SOLR-17076: Optimize `OrderedNodePlacementPlugin#getAllReplicasOnNode` #2076

patsonluk commented Nov 15, 2023 •

edited

patsonluk commented Nov 15, 2023 •

edited

patsonluk Nov 15, 2023 •

edited

patsonluk Nov 20, 2023 •

edited

patsonluk Nov 15, 2023 •

edited