Support disruption free rolling restart #529

janhoy · 2023-03-07T11:36:57Z

As discussed in slack https://apachesolr.slack.com/archives/C022UMAPZ0V/p1676970790552379

When the operator restarts the cluster, e.g. during a version upgrade, there is no guarantee that a Solr POD is marked as not ready before solr stop is called. Thus clients may experience connection error during the restart.

@HoustonPutman suggests we can implement a custom readiness gate https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-readiness-gate to control this better.

The text was updated successfully, but these errors were encountered:

HoustonPutman · 2023-03-28T15:35:14Z

@janhoy we should also create a Solr JIRA issue for this, to fix Cloud-aware clients and internal shard requests.

More info: We can fix this for simple use cases where users have clouds that all collections are single-sharded and each collection has a replica on all nodes. That way, Solr has no need to send the request to another node internally.
If a collection is multi-sharded, or a replica of the collection does not exist on all nodes, then Solr might have to forward requests throughout the cluster. Solr is not aware of the podConditions we are using to solve this in Kubernetes, so we need to think of another solution to fix this inside of Solr.

In the meantime #530 is a great start.

janhoy · 2023-03-28T17:57:12Z

@janhoy we should also create a Solr JIRA issue for this, to fix Cloud-aware clients and internal shard requests.

Sure, I can create one. Do you have a clear idea of how it would work? Now, SolrJ considers collection-state combined with live_nodes to decide what replicas to query. Would we need some new per-node-state znode in Zookeeper to flag a node as "draining", and then let SolrJ act on that?

HoustonPutman · 2023-03-28T19:09:48Z

Not a clear idea yet.

Would we need some new per-node-state znode in Zookeeper to flag a node as "draining", and then let SolrJ act on that?

That would work, but I'm not sure we'd want to restrict it to just "draining". We might want to send requests elsewhere for other reasons too.

janhoy · 2023-03-28T20:28:59Z

https://issues.apache.org/jira/browse/SOLR-16722

janhoy added the enhancement New feature or request label Mar 7, 2023

HoustonPutman mentioned this issue Mar 9, 2023

Add readinessCondition to stop traffic to pods that will be stopped #530

Merged

6 tasks

HoustonPutman closed this as completed in #530 Apr 3, 2023

HoustonPutman added this to the v0.7.0 milestone Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support disruption free rolling restart #529

Support disruption free rolling restart #529

janhoy commented Mar 7, 2023

HoustonPutman commented Mar 28, 2023

janhoy commented Mar 28, 2023

HoustonPutman commented Mar 28, 2023

janhoy commented Mar 28, 2023

Support disruption free rolling restart #529

Support disruption free rolling restart #529

Comments

janhoy commented Mar 7, 2023

HoustonPutman commented Mar 28, 2023

janhoy commented Mar 28, 2023

HoustonPutman commented Mar 28, 2023

janhoy commented Mar 28, 2023