SOLR-17153: CloudSolrClient should not throw "Collection not found" with an out-dated ClusterState #2363

aparnasuresh85 · 2024-03-21T14:22:10Z

…ted cluster state

https://issues.apache.org/jira/browse/SOLR-17153

Description

At present, when CloudSolrClient is instructed to send a request to a collection it deems non-existent due to its outdated local ClusterState view, it fails locally, which is undesired. This failure should be addressed. Moreover, many SolrCloud tests should remove their waitForState calls following collection creation. Matters concerning other stale states are not addressed here.

Similar changes will be required on the server side to handle "Collection Not Found" errors resulting from outdated cluster state.

Solution

If neither CloudSolrClient nor HttpSolrCall (server) can locate a collection, trigger a forced fetch of the collection from ZooKeeper and update the local cluster state accordingly.

Tests

Please describe the tests you've developed or run to confirm this patch implements the feature or solves the problem.
Change existing tests to eliminate waitForState() and waitForActiveCollection() calls that were previously present after collection creation. These calls are unnecessary due to the enhancements introduced in this PR. All modified tests have been thoroughly verified to ensure their successful execution. These changes will be part of a separate commit to follow.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
[] I have added tests for my changes.
[] I have added documentation for the Reference Guide

…ted cluster state

dsmiley

I'm glad to see the logic that was in V2HttpCall made more common (move to base class) as it should not have been specific to V2. Perhaps my oversight 6 years ago.

Updating ZkClientClusterStateProvider is only one of the two ClusterStateProviders. I suspect the other Http one can be tested to show that this problem existed (and still exists now!) whereas probably too hard to bother with ZkClientClusterStateProvider; unclear how to trigger a pause in ZK watching without being too invasive.

Assuming we address this for both providers, the javadocs for this method ought to communicate the guarantee on not returning null if the collection exists.

solr/core/src/java/org/apache/solr/servlet/HttpSolrCall.java

dsmiley · 2024-03-21T14:40:43Z

...solrj-zookeeper/src/java/org/apache/solr/client/solrj/impl/ZkClientClusterStateProvider.java

-    if (clusterState != null) {
-      return clusterState.getCollectionRef(collection);
-    } else {
+    if (clusterState == null) {


When would this happen? If it's a startup/initialization scenario, maybe getClusterState() should wait until it exists? Okay to defer this question to another PR as this concern pre-existed albeit maybe it could be another reason why we aren't resolving the collection.

aparnasuresh85 · 2024-03-21T19:39:17Z

I'm glad to see the logic that was in V2HttpCall made more common (move to base class) as it should not have been specific to V2. Perhaps my oversight 6 years ago.

Updating ZkClientClusterStateProvider is only one of the two ClusterStateProviders. I suspect the other Http one can be tested to show that this problem existed (and still exists now!) whereas probably too hard to bother with ZkClientClusterStateProvider; unclear how to trigger a pause in ZK watching without being too invasive.

Assuming we address this for both providers, the javadocs for this method ought to communicate the guarantee on not returning null if the collection exists.

The reason I did not make the logic consistent across both providers is because BaseHttpClusterStateProvider, where the logic for getState() is implemented for the http version, checks all livenodes for presence of the collection/collectionRef via a SolrQuery. I felt that was comprehensive enough.

I can update the javadoc for the method as per your suggestion.

aparnasuresh85 · 2024-03-21T23:17:17Z

I'm glad to see the logic that was in V2HttpCall made more common (move to base class) as it should not have been specific to V2. Perhaps my oversight 6 years ago.

Updating ZkClientClusterStateProvider is only one of the two ClusterStateProviders. I suspect the other Http one can be tested to show that this problem existed (and still exists now!) whereas probably too hard to bother with ZkClientClusterStateProvider; unclear how to trigger a pause in ZK watching without being too invasive.

Assuming we address this for both providers, the javadocs for this method ought to communicate the guarantee on not returning null if the collection exists.

I also noticed javadoc for ClusterStateProvider.getState() does mention null means collection does not exit. I refined the wording a bit.

dsmiley

The reason I did not make the logic consistent across both providers is because BaseHttpClusterStateProvider, where the logic for getState() is implemented for the http version, checks all livenodes for presence of the collection/collectionRef via a SolrQuery. I felt that was comprehensive enough.

OMG I didn't look; so glad to see the correct implementation there already :-)

Please add a CHANGES.txt for the next 9.x version under Improvements. I suppose this may be ready to merge.

In a separate PR, I'd like to see most waitForState calls that follow collection creation to be removed.

solr/solrj/src/java/org/apache/solr/client/solrj/impl/ClusterStateProvider.java

aparnasuresh85 · 2024-03-22T15:31:18Z

The reason I did not make the logic consistent across both providers is because BaseHttpClusterStateProvider, where the logic for getState() is implemented for the http version, checks all livenodes for presence of the collection/collectionRef via a SolrQuery. I felt that was comprehensive enough.

OMG I didn't look; so glad to see the correct implementation there already :-)

Please add a CHANGES.txt for the next 9.x version under Improvements. I suppose this may be ready to merge.

In a separate PR, I'd like to see most waitForState calls that follow collection creation to be removed.

I pushed a commit to exclude waitForState() calls after collection creation. Will do another pass to identify other hidden calls.

dsmiley · 2024-03-22T19:33:11Z

solr/core/src/test/org/apache/solr/cloud/AliasIntegrationTest.java

-    cluster.waitForActiveCollection("collection1meta", 2, 2);
-    cluster.waitForActiveCollection("collection2meta", 1, 1);
-
-    waitForState(
-        "Expected collection1 to be created with 2 shards and 1 replica",
-        "collection1meta",
-        clusterShape(2, 2));
-    waitForState(
-        "Expected collection2 to be created with 1 shard and 1 replica",
-        "collection2meta",
-        clusterShape(1, 1));


The point of this issue/PR is to be able to use a CloudSolrClient (such as cluster.getSolrClient) after collection creation to make requests to the collection (index/query) without explicitly waiting for its existence. Nothing more. This will enable some tests to remove waitForState calls that follow collection creation as needless. Maybe a comment would be nice on waitForState to elaborate on when it might not be needed.

Here where I comment, I don't see that we can remove these calls. If you've thought about this one and think otherwise, let me know. At this time I don't plan to review the remainder of the tests; you may need to revert some. And to reduce test flapping risk, I would prefer that these waitForState removal changes are decoupled to another PR that is merged separately. We might intentionally be delay the 9x backport to make changes here more apparent. This would also be good for code archeology to see the substance of the change without distraction of touching 30+ other files.

Got it and I agree - will revert the last commit.

dsmiley · 2024-03-23T05:45:56Z

Proposed CHANGES.txt under 9.6 Bug Fix:

CloudSolrClient and Solr could in rare circumstances report that a collection does not exist right after it was actually just created (an eventually consistent race issue). Now it double-checks before yielding such an error.

It's tempting to call it an improvement but this ought to have been an explicit guarantee all along for such a trivial create-then-index (as seen in tests we have on a high scale cluster).

I plan to merge late Tuesday, or maybe sooner if I get another +1.

markrmiller · 2024-03-25T09:47:09Z

I'd be careful removing all the waits in tests without some pretty thorough testing in some slow environments - for one, it's a bandaid rather than a fix, there is still a race, it's just harder to hit, and for two, the bug does not just apply to the collection level. Collections are created piecemeal, and the collection can exist but, for example, a necessary shard may not yet.

dsmiley · 2024-03-26T08:28:26Z

I was aware we may just be moving to a new problem, albeit a situation that is even more rare than what we have today. @markrmiller can you recommend a specific test or a way to make a test that has an incomplete DocCollection and tries to use it client side to route? As I write that; maybe it's not too hard to create a cluster state provider in a SolrJ test that has the state we want. The behavior we want to see client side is that it gets routed to Solr; don't give up client-side. In practice this gives even more time for the server to get the subsequent request and for this DocCollection to become complete. But if it's incomplete, I suspect an error. I've not seen this happen in practice, at least to my knowledge.

markrmiller · 2024-03-26T08:51:38Z

Not off hand, but you can likely just create a collection, create a DocCollection with the state you want, convert it to json with the code that does that, and then overwrite cluster state straight to ZK.

But my point is, at least when I glanced at the change, it looked like it retried if the collection wasn't there. If that is the case, even if that worked 100%,it would only cover a fraction of the fails the waitFor calls handle in tests. They wait for every shard and every replica that a collection api call will create (it's not just collection create calls).

Without ensuring every shard and replica is there, rare, hard to debug timing fails will return (less rare on CI runs or in resource constrained envs), as a test will often do something that expects all parts of what they just said to create to exist once that call has returned.

dsmiley · 2024-04-01T16:52:03Z

Maybe in this PR or another we'll add a test like you (Mark) describe to ensure that CloudSolrClient directs requests to Solr in all circumstances (giving up on smart routing to a core if this can't be determined) so long as we have confirmed the collection in fact exists (which again requires talking to the backend). My concern is that, as we saw here before the changes, CloudSolrClient would yield an error prematurely.

Hmmm; I wonder what the state.json looks like at each incremental step of a trivial one-shard 2-replica collection. Something to investigate. If it's a bit incremental; it'd be cool if Solr might be improved to get the collection to exist with at least a core in one-shot, thus eliminating the possibility a Solr node having a collection's state that is so partial that it can't even route a request to it.

It'd be nice to remove so many waitForState's but maybe we can't in good conscience.

…ith an out-dated ClusterState (#2363) ZkClientClusterStateProvider needed to double-check a collection truly doesn't exist. HttpClusterStateProvider is already correct. HttpSolrCall on the server side was hardened similarly. This could happen for a highly burdened cluster / zookeeper. (cherry picked from commit 5c399dd)

…found" with an out-dated ClusterState (apache#2363)" This reverts commit 5c399dd.

CloudSolrClient should not throw "Collection not found" with an outda…

a24b486

…ted cluster state

github-actions bot added jetty-server client:solrj cat:api labels Mar 21, 2024

aparnasuresh85 changed the title ~~CloudSolrClient should not throw "Collection not found" with an outda…~~ SOLR-17153: CloudSolrClient should not throw "Collection not found" with an out-dated ClusterState Mar 21, 2024

dsmiley reviewed Mar 21, 2024

View reviewed changes

minor refactoring

68333ab

minor refactoring

112e3f4

dsmiley reviewed Mar 22, 2024

View reviewed changes

solr/solrj/src/java/org/apache/solr/client/solrj/impl/ClusterStateProvider.java Show resolved Hide resolved

dsmiley requested a review from markrmiller March 22, 2024 13:41

github-actions bot added test-framework tests cat:search cat:cloud cat:index cat:cli cat:security cat:packagemanager labels Mar 22, 2024

dsmiley reviewed Mar 22, 2024

View reviewed changes

aparnasuresh85 force-pushed the solr-17153 branch 3 times, most recently from 5c66049 to c10d83e Compare March 23, 2024 04:52

github-actions bot removed test-framework tests cat:search cat:cloud cat:index labels Mar 23, 2024

github-actions bot removed cat:cli cat:security cat:packagemanager labels Mar 23, 2024

aparnasuresh85 force-pushed the solr-17153 branch from cf4df4a to 112e3f4 Compare March 23, 2024 05:07

aparnasuresh85 added 2 commits April 2, 2024 15:22

CHANGES.txt

465eb97

Commit after tidy

e9c2cec

dsmiley merged commit 5c399dd into apache:main Apr 3, 2024
4 checks passed

gerlowskija added a commit to gerlowskija/solr that referenced this pull request Apr 5, 2024

Revert "SOLR-17153: CloudSolrClient should not throw "Collection not …

1fb376b

…found" with an out-dated ClusterState (apache#2363)" This reverts commit 5c399dd.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR-17153: CloudSolrClient should not throw "Collection not found" with an out-dated ClusterState #2363

SOLR-17153: CloudSolrClient should not throw "Collection not found" with an out-dated ClusterState #2363

aparnasuresh85 commented Mar 21, 2024 •

edited

dsmiley left a comment

dsmiley Mar 21, 2024

aparnasuresh85 commented Mar 21, 2024

aparnasuresh85 commented Mar 21, 2024

dsmiley left a comment

aparnasuresh85 commented Mar 22, 2024

dsmiley Mar 22, 2024

aparnasuresh85 Mar 22, 2024

dsmiley commented Mar 23, 2024

markrmiller commented Mar 25, 2024

dsmiley commented Mar 26, 2024

markrmiller commented Mar 26, 2024

dsmiley commented Apr 1, 2024

SOLR-17153: CloudSolrClient should not throw "Collection not found" with an out-dated ClusterState #2363

SOLR-17153: CloudSolrClient should not throw "Collection not found" with an out-dated ClusterState #2363

Conversation

aparnasuresh85 commented Mar 21, 2024 • edited

Description

Solution

Tests

Checklist

dsmiley left a comment

Choose a reason for hiding this comment

dsmiley Mar 21, 2024

Choose a reason for hiding this comment

aparnasuresh85 commented Mar 21, 2024

aparnasuresh85 commented Mar 21, 2024

dsmiley left a comment

Choose a reason for hiding this comment

aparnasuresh85 commented Mar 22, 2024

dsmiley Mar 22, 2024

Choose a reason for hiding this comment

aparnasuresh85 Mar 22, 2024

Choose a reason for hiding this comment

dsmiley commented Mar 23, 2024

markrmiller commented Mar 25, 2024

dsmiley commented Mar 26, 2024

markrmiller commented Mar 26, 2024

dsmiley commented Apr 1, 2024

aparnasuresh85 commented Mar 21, 2024 •

edited