Skip to content

NO-JIRA: Flaky test fix testDistributedQueryDoesNotReadFromZk #4429

Open
mlbiscoc wants to merge 1 commit into
apache:mainfrom
mlbiscoc:flaky-distib-zk-query-test
Open

NO-JIRA: Flaky test fix testDistributedQueryDoesNotReadFromZk #4429
mlbiscoc wants to merge 1 commit into
apache:mainfrom
mlbiscoc:flaky-distib-zk-query-test

Conversation

@mlbiscoc
Copy link
Copy Markdown
Contributor

testDistributedQueryDoesNotReadFromZk is a flaky test on Crave and Jenkins. I am not able to reproduce this locally but the error is:

2> 38735 INFO  (TEST-DistributedQueryComponentOptimizationTest.testDistributedQueryDoesNotReadFromZk-seed#[2DA40BF573442949]) [] o.a.s.SolrTestCaseJ4 ###Ending testDistributedQueryDoesNotReadFromZk
   >     org.apache.solr.client.solrj.RemoteSolrException: Error from server at http://127.0.0.1:46591/solr/optimize/select?q=*%3A*&collection=optimize%2CsecondColl&wt=javabin: org.apache.solr.common.SolrException: no active servers hosting shard: secondColl_shard1

My suspicion is that there is a race condition making it flaky where the jetty node has not yet seen the updated state of the collection that it is ready but the client has because we were waiting on separate zkStateReader which is separate from the cached state from the actually jetty node and collection. So instead, we wait on that jetty nodes ready state using it's zkStateReader and not clients.

Copy link
Copy Markdown
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that node 1 doesn't even have secondColl, I'm surprised we can nonetheless do a waitForState on this collection since I don't expect it'd be notified that it even exists. Am I wrong?

I add some other trivial comments; ignore or do as you like.

public void testDistributedQueryDoesNotReadFromZk() throws Exception {
final String secondColl = "secondColl";

// Create a collection on only 1 node so the other node uses LazyCollectionRef for state
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Create a collection on only node 0; node 1 will use LazyCollectionRef for state

Clarifying the node numbers rather than generally referring to a single/other node. Definitely wasn't wrong but now it jives with get(0) vs get(1).

cluster

// Wait on node 1's ZkStateReader (not the cluster client's) to check for ready state
JettySolrRunner nodeWithoutSecondColl = jettys.get(1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this var name. Maybe node 0 should have a similar var name we can use like "nodeWithBothColls". Just an idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants