Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-17076: Optimize OrderedNodePlacementPlugin#getAllReplicasOnNode #2076

Conversation

patsonluk
Copy link
Contributor

@patsonluk patsonluk commented Nov 15, 2023

https://issues.apache.org/jira/browse/SOLR-17076

Description

OrderedNodePlacementPlugin#getAllReplicasOnNode can be slow in a cluster with lot of replicas. The effect could compound with new collection creation with many shards.

Solution

Introduced a new field allReplicas which keeps track of all the replicas added/removed in the Plugin. Return a copy of that set instead for getAllReplicasOnNode.

Also added a new convenient method getAllReplicaCount if only the count is needed

Tests

A minor change, rely on existing unit test case

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

…` field, which keeps track of all replicas. Instead of computing a new list every time.

2. Added a getAllReplicaCount method to avoid creating new list of replicas if only the count is required
@patsonluk patsonluk changed the title Patsonluk/solr 17076/optimize placement factory get all replica SOLR-17076: Optimize placement factory get all replica Nov 15, 2023
@patsonluk patsonluk changed the title SOLR-17076: Optimize placement factory get all replica SOLR-17076: Optimize OrderedNodePlacementPlugin#getAllReplicasOnNode Nov 15, 2023
@patsonluk patsonluk marked this pull request as draft November 15, 2023 17:43
@patsonluk
Copy link
Contributor Author

patsonluk commented Nov 15, 2023

Got this while running ./gradlew check locally

java.lang.AssertionError: Doc Counts do not add up expected:<67> but was:<66>
   >         at __randomizedtesting.SeedInfo.seed([5A95C1706F927E37:D2C1FEAAC16E13CF]:0)
   >         at org.junit.Assert.fail(Assert.java:89)
   >         at org.junit.Assert.failNotEquals(Assert.java:835)
   >         at org.junit.Assert.assertEquals(Assert.java:647)
   >         at org.apache.solr.cloud.AbstractFullDistribZkTestBase.assertDocCounts(AbstractFullDistribZkTestBase.java:1851)
   >         at org.apache.solr.cloud.AbstractBasicDistributedZk2TestBase.test(AbstractBasicDistributedZk2TestBase.java:109)

...

ERROR: The following test(s) have failed:
  - org.apache.solr.cloud.BasicDistributedZk2Test.test (:solr:core)
    Test output: /Users/patson/src/cowpath-solr/solr/core/build/test-results/test/outputs/OUTPUT-org.apache.solr.cloud.BasicDistributedZk2Test.txt
    Reproduce with: gradlew :solr:core:test --tests "org.apache.solr.cloud.BasicDistributedZk2Test.test" -Ptests.jvms=5 "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1 -XX:ReservedCodeCacheSize=120m" -Ptests.seed=5A95C1706F927E37 -Ptests.file.encoding=ISO-8859-1

However, re-running ./gradlew check with -Ptests.seed=5A95C1706F927E37 no longer triggers the same issue

@patsonluk patsonluk marked this pull request as ready for review November 15, 2023 19:27
Copy link
Contributor

@magibney magibney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested a couple of changes, but overall this looks good to me!

@@ -515,6 +521,7 @@ public final void removeReplica(Replica replica) {
});
if (hasReplica.get()) {
removeProjectedReplicaWeights(replica);
allReplicas.remove(replica);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we're separately tracking replicas in a top-level map (and assuming that this is all accessed by a single thread), I think this method could be simplified to:

if (allReplicas.remove(replica)) {
  // [the nested removal logic]
}

Copy link
Contributor Author

@patsonluk patsonluk Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @magibney ! I'm wondering if these 2 statements should just be placed in

directly?

Perhaps I have overlooked some edge cases which using the AtomicBoolean hasReplica was necessary ?

Copy link
Contributor Author

@patsonluk patsonluk Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After i read the code again, I'm not 100% about the original purpose of hasReplica flag.

However, it does make the logic a bit more isolated and easier to follow - iterate the list first, figure out if the replica exists, if it does, deal with it AFTEr finishing the traversal. In a logical level, it's pretty clean and avoid any concurrent modification exception (though the map/list being iterated on is private field anyway).

Hm...I simply move the allReplicas.remove to within the if (reps.remove(replica)) block now, cause i do think it's the best if the operations (add/remove) on the 2 collections stay close to each other.

Anyway, I think these are very minor concerns? 😊 either ways are fine

.flatMap(shard -> shard.values().stream())
.flatMap(Collection::stream)
.collect(Collectors.toSet());
return new HashSet<>(allReplicas);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on how this is being used currently, there's no need to wrap this in a new HashSet. The method is public, so you never know who might call it and what their assumptions might be; but you are replacing Collectors.toSet(), which specifically makes no guarantees about mutability/immutability, so I'd be inclined to just do Collections.unmodifableSet(allReplicas), taking this opportunity to make the API more conservative, without breaking any backcompat.

Also, if the motivating concern is "many replicas", the cost of instantiating a new HashMap that we don't really need is not worth it.

Copy link
Contributor Author

@patsonluk patsonluk Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, the current fix is to use the most defensive approach which does not modify the behavior at all. The minor concern with unmodifableSet was that the "view" could still mutate after this method returns and this could break certain expectation of the caller.

Otherwise I don't have any objection using unmodifableSet. Perhaps @HoustonPutman can share some thoughts? 😊

@justinrsweeney justinrsweeney merged commit 26c286a into apache:main Nov 21, 2023
1 of 2 checks passed
justinrsweeney pushed a commit that referenced this pull request Nov 21, 2023
#2076)

* 1. Changed getAllReplicasOnNode to just return a copy of `allReplicas` field, which keeps track of all replicas. Instead of computing a new list every time.
2. Added a getAllReplicaCount method to avoid creating new list of replicas if only the count is required

* ./gradlew tidy

* minor refactoring
patsonluk added a commit to cowpaths/fullstory-solr that referenced this pull request Nov 24, 2023
apache#2076)

* 1. Changed getAllReplicasOnNode to just return a copy of `allReplicas` field, which keeps track of all replicas. Instead of computing a new list every time.
2. Added a getAllReplicaCount method to avoid creating new list of replicas if only the count is required

* ./gradlew tidy

* minor refactoring
justinrsweeney pushed a commit to cowpaths/fullstory-solr that referenced this pull request Apr 26, 2024
apache#2076)

* 1. Changed getAllReplicasOnNode to just return a copy of `allReplicas` field, which keeps track of all replicas. Instead of computing a new list every time.
2. Added a getAllReplicaCount method to avoid creating new list of replicas if only the count is required

* ./gradlew tidy

* minor refactoring
@dsmiley
Copy link
Contributor

dsmiley commented May 1, 2024

Just a little reminder... when squash-merging, remember to edit the commit message so it's cleaned up to reflect the final result in totality; it's almost always necessary. Should not refer to running "tidy", for example.

@justinrsweeney
Copy link
Contributor

justinrsweeney commented May 1, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants