Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diagnostics about Elasticsearch client sockets #134362

Closed
rudolf opened this issue Jun 14, 2022 · 8 comments
Closed

Diagnostics about Elasticsearch client sockets #134362

rudolf opened this issue Jun 14, 2022 · 8 comments
Assignees
Labels
Feature:elasticsearch performance Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@rudolf
Copy link
Contributor

rudolf commented Jun 14, 2022

The Elasticsearch-js client is currently configured to use maxSockets: Infinity which means connections aren't being re-used causing every outgoing connection to have to establish a new connection + TLS. We know we need to reduce this value but it's really hard to choose an appropriate number.

In order to tune this value we need to expose the number of actual sockets used by the http agent that the Elasticsearch-js client is using. That way we could log a warning when Kibana hits the limit. If at the limit the event loop is still really healthy then performance could be improved by increasing the number of sockets.

In addition to a warning when we're at the limit, it would be useful to expose the number of sockets as part of our monitoring data as well as through the ops.metrics logger.

Context:
#112756 (comment)

@rudolf rudolf added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc performance labels Jun 14, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@rudolf rudolf added the Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. label Jun 15, 2022
@gsoldevila
Copy link
Contributor

After experimenting a bit with elasticsearch-js client and with Node's Http Agents, the concepts are much more clear in my head now. Just to make sure we are all in the same page:

ATM connections will be reused if they are idle*.

The maxSockets parameter determines how many concurrent sockets the agent can have open per origin.

The maxTotalSockets is a more global limit that applies to all origins of the connections managed by the agent.
Note that agents have one pool of sockets for each origin.

When calling Node's http.request(), the corresponding agent first checks the pool matching the desired origin to see if there are any idle sockets available. If there aren't any, it will check maxSockets and maxTotalSockets to see if it can open a new one. Finally, idle connections are closed / removed from the pool after the keepAliveMsecs timeout.

Let's say we have an Agent that connects to ES nodes A, B and C, and we define maxSockets: 5; maxTotalSockets: 10. In that scenario, we will have at most 5 concurrent connections to each origin (A, B and C), but not more than 10 in total.


*socket still open thanks to the keepAlive, but no request or response travelling through.

@gsoldevila
Copy link
Contributor

gsoldevila commented Aug 31, 2022

Currently, we are creating multiple elasticsearch-js Client instances, through Kibana's ClusterClient class.

Core elasticsearch service's contract exposes a method to createClient(...) that accepts a type parameter and creates a new ClusterClient() instance behind the scenes.
Each ClusterClient creates 2 elasticsearch-js Client instances (root Vs scoped, which use different credentials).

Each of the elasticsearch-js's Client instances creates 1 connection pool for each ES node, and 1 connection for each pool, each one using an independent Agent instance.

Thus, with the current implementation, each agent is targeting only a single origin, and it will manage a single pool of sockets:

  • We are not exploiting the agent's "multi origin" capabilities.
  • We can't benefit from the maxTotalSockets parameter.

As a result, in our deployments we have a bunch of elasticsearch-js's Client instances, each one using multiple agents (one per ES node). This makes it quite difficult to monitor (let alone limit) the number of open connections.

@gsoldevila
Copy link
Contributor

gsoldevila commented Aug 31, 2022

With that in mind, we can consider multiple initiatives:

  • Monitoring and limiting the number of open connections (this issue).
    • We must add some telemetry in order to find out where we stand.
    • We might need to define appropriate limit to the number of open connections, in order to protect the event loop and maximise performance.
  • Reducing the number of ES Client instances to minimise memory consumption and improve performance: Reduce the number of Http Agent instances #139809.

@pgayvallet
Copy link
Contributor

Reducing the number of ES Client instances

Would having all the instantiated client share a common parent to reuse the same ConnectionPool work here?

I guess not, given the ConnectionPool option passed to the client constructor is a class, and not an instance, right?

@gsoldevila
Copy link
Contributor

@pgayvallet yes, that works. When calling child() we inherit the connection pool.

@pgayvallet
Copy link
Contributor

Hum, good to know. So in theory, we could have a way for all ES client instances to use the same connection pool (so connection, and therefore agent, if I followed correctly) by having all the client created within ClusterClient to inherit from a same 'root' client?

We would need to check if all the options we're using (in parseClientOptions and configureClient) are working properly / can be used when calling parent.child() instead of using a new instantiation.

@gsoldevila
Copy link
Contributor

gsoldevila commented Aug 31, 2022

Wait, not so fast. It works in the sense that it allows sharing the Agent + pools across the instances.
But it does not cover all of Kibana's cases, cause there are some es-js Clients that are created with different, incompatible configurations, that make it impossible to have a unique Agent instance.

Plus there is the problem that the ClusterClient has a few methods (update(), empty(), ...) that impact the underlying pool and connections, and having this pool shared across instances might cause problems if/when these methods are called. Now that I think about it, instances created using child() are exposed to that problem ATM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:elasticsearch performance Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests

4 participants