Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore concurrency schemes to compensate for blocking ES calls #1161

Closed
sebgl opened this issue Jun 27, 2019 · 4 comments
Closed

Explore concurrency schemes to compensate for blocking ES calls #1161

sebgl opened this issue Jun 27, 2019 · 4 comments
Labels
>enhancement Enhancement of existing functionality

Comments

@sebgl
Copy link
Contributor

sebgl commented Jun 27, 2019

There are several places where we do synchronous calls in the reconciliation loop. Sometimes these calls may not be necessary, but we need to make sure we correctly handle stale state cache.

An example: https://github.com/elastic/cloud-on-k8s/pull/1160/files#r298192964

This is an optimization, not something absolutely necessary to fix.

@sebgl sebgl added the >enhancement Enhancement of existing functionality label Jun 27, 2019
@pebrc
Copy link
Collaborator

pebrc commented Jun 27, 2019

Related issue that suggested decoupling the main reconciliation loop and the any ES interactions #1016 which outlines abstracting the notion of an Elasticsearch cluster via a potentially asynchronous API (Draft PR #1116)

We discussed in that context, as a potential low level implementation of that idea to decouple the main reconciliation loop from Elasticsearch requests using a task worker concept. Instead of requesting Elasticsearch directly you would run a task, something like

type LogicalTime int

// Run triggers execution of the given task at logical time t with params ps.
func (t Tasks) Run(task Task, t LogicalTime, ps Param...)

// ConsumeResult reads the result generated after asOf time with params ps returns nil if not ready.
func (t Tasks) ConsumeResult(t Task, asOf LogicalTime, ps Param...) interface{} // not sure if we can do better?  

// Peek is a non-destructive read of the result generated
func (t Tasks) Peek(t Task, t LogicalTime) interface{}

@pebrc
Copy link
Collaborator

pebrc commented Oct 16, 2019

In a recent discussion we arrived at the conclusion that removal of blocking calls and an async task API as suggested above is not necessary after all for two reasons:

  • with the move to StatefulSets a whole lot of orchestration logic has been delegated to the k8s StatefulSet controller and blocking in the Elasticsearch controller has become less of an issue due to that
  • we could therefore just do blocking HTTP calls in the reconciliation loop and compensate for the loss of throughput by increasing MaxConcurrentReconciles in the controller.Options

The resulting code should be conceptually much simpler than any asynchronous indirection of HTTP requests as suggested above and the overall capacity of the controller should still be acceptable.

That being said it is worth pointing out that the current (and in many ways problematic) observer implementation as an example of asynchronous ES request handling has one advantage over this suggested approach which is that it dynamically adjusts to the number of ES clusters being handled. We start a goroutine for every cluster we manage to observe state/health. In a similar fashion the suggested task API would also dynamically adjust to the number of clusters being managed by spawning a go routine for every asynchronous task. Compared to that improving overall throughput by setting MaxConcurrentReconciles is a static setting we cannot adjust afaik as the controller starts managing more and more clusters.

@david-kow david-kow changed the title Avoid blocking ES calls in the reconciliation loops Explore concurrency schemes to compensate for blocking ES calls Oct 18, 2019
@charith-elastic
Copy link
Contributor

The observers seem to be the main bottleneck when managing a large number of Elasticsearch clusters. TLS negotiation and blocking network calls compound the effect as more clusters are added (see #357 (comment)).

An interesting avenue to explore might be the delegation of observation work to sidecars so that the operator only needs a single connection to each managed cluster (pull model) or a handful of connections to an event queue (push model).

@pebrc
Copy link
Collaborator

pebrc commented Feb 24, 2020

Closing in favour of an issue to use parallel reconciliation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement Enhancement of existing functionality
Projects
None yet
Development

No branches or pull requests

3 participants