Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-17153: CloudSolrClient should not throw "Collection not found" with an out-dated ClusterState #2363

Merged
merged 5 commits into from Apr 3, 2024

Conversation

aparnasuresh85
Copy link
Contributor

@aparnasuresh85 aparnasuresh85 commented Mar 21, 2024

…ted cluster state

https://issues.apache.org/jira/browse/SOLR-17153

Description

At present, when CloudSolrClient is instructed to send a request to a collection it deems non-existent due to its outdated local ClusterState view, it fails locally, which is undesired. This failure should be addressed. Moreover, many SolrCloud tests should remove their waitForState calls following collection creation. Matters concerning other stale states are not addressed here.

Similar changes will be required on the server side to handle "Collection Not Found" errors resulting from outdated cluster state.

Solution

If neither CloudSolrClient nor HttpSolrCall (server) can locate a collection, trigger a forced fetch of the collection from ZooKeeper and update the local cluster state accordingly.

Tests

Please describe the tests you've developed or run to confirm this patch implements the feature or solves the problem.
Change existing tests to eliminate waitForState() and waitForActiveCollection() calls that were previously present after collection creation. These calls are unnecessary due to the enhancements introduced in this PR. All modified tests have been thoroughly verified to ensure their successful execution. These changes will be part of a separate commit to follow.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • [] I have added tests for my changes.
  • [] I have added documentation for the Reference Guide

@aparnasuresh85 aparnasuresh85 changed the title CloudSolrClient should not throw "Collection not found" with an outda… SOLR-17153: CloudSolrClient should not throw "Collection not found" with an out-dated ClusterState Mar 21, 2024
Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad to see the logic that was in V2HttpCall made more common (move to base class) as it should not have been specific to V2. Perhaps my oversight 6 years ago.

Updating ZkClientClusterStateProvider is only one of the two ClusterStateProviders. I suspect the other Http one can be tested to show that this problem existed (and still exists now!) whereas probably too hard to bother with ZkClientClusterStateProvider; unclear how to trigger a pause in ZK watching without being too invasive.

Assuming we address this for both providers, the javadocs for this method ought to communicate the guarantee on not returning null if the collection exists.

if (clusterState != null) {
return clusterState.getCollectionRef(collection);
} else {
if (clusterState == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would this happen? If it's a startup/initialization scenario, maybe getClusterState() should wait until it exists? Okay to defer this question to another PR as this concern pre-existed albeit maybe it could be another reason why we aren't resolving the collection.

@aparnasuresh85
Copy link
Contributor Author

I'm glad to see the logic that was in V2HttpCall made more common (move to base class) as it should not have been specific to V2. Perhaps my oversight 6 years ago.

Updating ZkClientClusterStateProvider is only one of the two ClusterStateProviders. I suspect the other Http one can be tested to show that this problem existed (and still exists now!) whereas probably too hard to bother with ZkClientClusterStateProvider; unclear how to trigger a pause in ZK watching without being too invasive.

Assuming we address this for both providers, the javadocs for this method ought to communicate the guarantee on not returning null if the collection exists.

The reason I did not make the logic consistent across both providers is because BaseHttpClusterStateProvider, where the logic for getState() is implemented for the http version, checks all livenodes for presence of the collection/collectionRef via a SolrQuery. I felt that was comprehensive enough.

I can update the javadoc for the method as per your suggestion.

@aparnasuresh85
Copy link
Contributor Author

I'm glad to see the logic that was in V2HttpCall made more common (move to base class) as it should not have been specific to V2. Perhaps my oversight 6 years ago.

Updating ZkClientClusterStateProvider is only one of the two ClusterStateProviders. I suspect the other Http one can be tested to show that this problem existed (and still exists now!) whereas probably too hard to bother with ZkClientClusterStateProvider; unclear how to trigger a pause in ZK watching without being too invasive.

Assuming we address this for both providers, the javadocs for this method ought to communicate the guarantee on not returning null if the collection exists.

I also noticed javadoc for ClusterStateProvider.getState() does mention null means collection does not exit. I refined the wording a bit.

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I did not make the logic consistent across both providers is because BaseHttpClusterStateProvider, where the logic for getState() is implemented for the http version, checks all livenodes for presence of the collection/collectionRef via a SolrQuery. I felt that was comprehensive enough.

OMG I didn't look; so glad to see the correct implementation there already :-)

Please add a CHANGES.txt for the next 9.x version under Improvements. I suppose this may be ready to merge.

In a separate PR, I'd like to see most waitForState calls that follow collection creation to be removed.

@aparnasuresh85
Copy link
Contributor Author

The reason I did not make the logic consistent across both providers is because BaseHttpClusterStateProvider, where the logic for getState() is implemented for the http version, checks all livenodes for presence of the collection/collectionRef via a SolrQuery. I felt that was comprehensive enough.

OMG I didn't look; so glad to see the correct implementation there already :-)

Please add a CHANGES.txt for the next 9.x version under Improvements. I suppose this may be ready to merge.

In a separate PR, I'd like to see most waitForState calls that follow collection creation to be removed.

I pushed a commit to exclude waitForState() calls after collection creation. Will do another pass to identify other hidden calls.

Comment on lines 102 to 112
cluster.waitForActiveCollection("collection1meta", 2, 2);
cluster.waitForActiveCollection("collection2meta", 1, 1);

waitForState(
"Expected collection1 to be created with 2 shards and 1 replica",
"collection1meta",
clusterShape(2, 2));
waitForState(
"Expected collection2 to be created with 1 shard and 1 replica",
"collection2meta",
clusterShape(1, 1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this issue/PR is to be able to use a CloudSolrClient (such as cluster.getSolrClient) after collection creation to make requests to the collection (index/query) without explicitly waiting for its existence. Nothing more. This will enable some tests to remove waitForState calls that follow collection creation as needless. Maybe a comment would be nice on waitForState to elaborate on when it might not be needed.

Here where I comment, I don't see that we can remove these calls. If you've thought about this one and think otherwise, let me know. At this time I don't plan to review the remainder of the tests; you may need to revert some. And to reduce test flapping risk, I would prefer that these waitForState removal changes are decoupled to another PR that is merged separately. We might intentionally be delay the 9x backport to make changes here more apparent. This would also be good for code archeology to see the substance of the change without distraction of touching 30+ other files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it and I agree - will revert the last commit.

@dsmiley
Copy link
Contributor

dsmiley commented Mar 23, 2024

Proposed CHANGES.txt under 9.6 Bug Fix:

CloudSolrClient and Solr could in rare circumstances report that a collection does not exist right after it was actually just created (an eventually consistent race issue). Now it double-checks before yielding such an error.

It's tempting to call it an improvement but this ought to have been an explicit guarantee all along for such a trivial create-then-index (as seen in tests we have on a high scale cluster).

I plan to merge late Tuesday, or maybe sooner if I get another +1.

@markrmiller
Copy link
Member

I'd be careful removing all the waits in tests without some pretty thorough testing in some slow environments - for one, it's a bandaid rather than a fix, there is still a race, it's just harder to hit, and for two, the bug does not just apply to the collection level. Collections are created piecemeal, and the collection can exist but, for example, a necessary shard may not yet.

@dsmiley
Copy link
Contributor

dsmiley commented Mar 26, 2024

I was aware we may just be moving to a new problem, albeit a situation that is even more rare than what we have today. @markrmiller can you recommend a specific test or a way to make a test that has an incomplete DocCollection and tries to use it client side to route? As I write that; maybe it's not too hard to create a cluster state provider in a SolrJ test that has the state we want. The behavior we want to see client side is that it gets routed to Solr; don't give up client-side. In practice this gives even more time for the server to get the subsequent request and for this DocCollection to become complete. But if it's incomplete, I suspect an error. I've not seen this happen in practice, at least to my knowledge.

@markrmiller
Copy link
Member

Not off hand, but you can likely just create a collection, create a DocCollection with the state you want, convert it to json with the code that does that, and then overwrite cluster state straight to ZK.

But my point is, at least when I glanced at the change, it looked like it retried if the collection wasn't there. If that is the case, even if that worked 100%,it would only cover a fraction of the fails the waitFor calls handle in tests. They wait for every shard and every replica that a collection api call will create (it's not just collection create calls).

Without ensuring every shard and replica is there, rare, hard to debug timing fails will return (less rare on CI runs or in resource constrained envs), as a test will often do something that expects all parts of what they just said to create to exist once that call has returned.

@dsmiley
Copy link
Contributor

dsmiley commented Apr 1, 2024

Maybe in this PR or another we'll add a test like you (Mark) describe to ensure that CloudSolrClient directs requests to Solr in all circumstances (giving up on smart routing to a core if this can't be determined) so long as we have confirmed the collection in fact exists (which again requires talking to the backend). My concern is that, as we saw here before the changes, CloudSolrClient would yield an error prematurely.

Hmmm; I wonder what the state.json looks like at each incremental step of a trivial one-shard 2-replica collection. Something to investigate. If it's a bit incremental; it'd be cool if Solr might be improved to get the collection to exist with at least a core in one-shot, thus eliminating the possibility a Solr node having a collection's state that is so partial that it can't even route a request to it.

It'd be nice to remove so many waitForState's but maybe we can't in good conscience.

@dsmiley dsmiley merged commit 5c399dd into apache:main Apr 3, 2024
4 checks passed
dsmiley pushed a commit that referenced this pull request Apr 3, 2024
…ith an out-dated ClusterState (#2363)

ZkClientClusterStateProvider needed to double-check a collection truly doesn't exist.  HttpClusterStateProvider is already correct.
HttpSolrCall on the server side was hardened similarly.
This could happen for a highly burdened cluster / zookeeper.

(cherry picked from commit 5c399dd)
gerlowskija added a commit to gerlowskija/solr that referenced this pull request Apr 5, 2024
…found" with an out-dated ClusterState (apache#2363)"

This reverts commit 5c399dd.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants