Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CompletionStats need only be recomputed on a refresh #51915

Closed
DaveCTurner opened this issue Feb 5, 2020 · 8 comments · Fixed by #51991
Closed

CompletionStats need only be recomputed on a refresh #51915

DaveCTurner opened this issue Feb 5, 2020 · 8 comments · Fixed by #51991
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement

Comments

@DaveCTurner
Copy link
Contributor

Computing the completion stats involves walking every field of every segment of every relevant shard, looking for completion fields. By default the seemingly-innocuous GET _stats API does this for every shard in the cluster. I've seen more than a few cases where an external monitoring system is hitting an overly-broad stats API hard enough that the cluster can't keep up. The consequence is that these stats requests pile up in the management threadpool and interfere with the other users of that threadpool.

As far as I can tell, these stats only change on a refresh. In most cases this means they do not change much at all, so I think we can improve the situation by caching these stats between refreshes.

I also note that in #33847 we changed the source of these stats from the external searcher to the internal one. I'm not sure why - external seems more appropriate to me, and would help with the caching since external refreshes may be very infrequent indeed.

Relates:

@DaveCTurner DaveCTurner added >enhancement :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. team-discuss labels Feb 5, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Engine)

@rtkgjacobs
Copy link

rtkgjacobs commented Feb 5, 2020

There are number of external projects (just listing a few we've found with our audits so far) that hit this pinch point by default as an FYI

Having the ability to have a default _stats call be far less abusive / greedy r/e above would be ideal. Just tossing in my vote (otherwise for our own deployment, we may be forced to fork various assets and maintain private builds now)

Im sure other tools / fixtures out there can pinch others as it has our own ramping of a large scale set of clusters.

@dnhatn
Copy link
Member

dnhatn commented Feb 5, 2020

+1 to cache the completion stats (especially when we have many fields). Another optimization is to not compute the completion stats if we don't have any suggest field in the mapping.

@rtkgjacobs
Copy link

rtkgjacobs commented Feb 5, 2020

+1 to cache the completion stats (especially when we have many fields). Another optimization is to not compute the completion stats if we don't have any suggest field in the mapping.
We have turned off all things hitting variants of /stats API's and are still finding high management threading / queue backlogs. We confirmed our mappings do not call out or enable suggested fields.

Is ES7 set to always generate completion stats during ingestion ? Can we turn this off? It is possibly a blocker for us to upgrade to ES7 from prior (old) versions in production for us if so... We are finding with our latest tests removing the above fixtures, we are still seeing it stand out in our hot threads. Does x-pack monitoring hit or stimulate the gathering of completion stats?

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Feb 5, 2020

Is ES7 set to always generate completion stats during ingestion ?

It's not really anything to do with ingestion, nor is it specific to version 7. By default the completion stats are computed for each stats call and this seems to be true as far back as version 1.7 (I haven't checked further back).

Can we turn this off?

Indeed you can; if you don't need to monitor completion stats so frequently then you should exclude them from these stats requests. I can't comment on how you might configure the third-party products linked above to do this, sorry, but Elasticsearch has had support for selecting specific stats for a long long time.

@rtkkroland
Copy link

rtkkroland commented Feb 5, 2020

but Elasticsearch has had support for selecting specific stats for a long long time.

Is this true for /_cluster/stats as well? It doesn't mention it in the Guide https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-stats.html and it returns the full set of completion stats.

@DaveCTurner
Copy link
Contributor Author

@rtkkroland oh dear, yes, I didn't realise we also compute completion stats for the cluster-level API but you're right I think we do:

private static final CommonStatsFlags SHARD_STATS_FLAGS = new CommonStatsFlags(CommonStatsFlags.Flag.Docs, CommonStatsFlags.Flag.Store,
CommonStatsFlags.Flag.FieldData, CommonStatsFlags.Flag.QueryCache,
CommonStatsFlags.Flag.Completion, CommonStatsFlags.Flag.Segments);

It doesn't look like there are selectors for this API 😕

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Feb 6, 2020
Computing the stats for completion fields may involve a significant amount of
work since it walks every field of every segment looking for completion fields.
Innocuous-looking APIs like `GET _stats` or `GET _cluster/stats` do this for
every shard in the cluster. This repeated work is unnecessary since these stats
do not change between refreshes; in many indices they remain constant for a
long time.

This commit introduces a cache for these stats which is invalidated on a
refresh, allowing most stats calls to bypass the work needed to compute them on
most shards.

Closes elastic#51915
@rtkgjacobs
Copy link

Some additional follow up / findings on our side. We had Netdata installed on all our cluster nodes and its default configuration hits _cluster/stats - and its default refresh rate seems too frequent. Removing that vastly improved our performance (several orders of magnitude). We have slightly raised the maximum threads of type:management to match our core counts to offset what 30s prometheus scrapes we run. We'll be watching closely balancing these API's use, but passing on a warning for others that might fall into our trap.

Strongly encourage the cost or risks of API's like _cluster/stats and _all/stats etc are derisked where possible. A few of us on this project have more grey hairs now.

DaveCTurner added a commit that referenced this issue Feb 27, 2020
Computing the stats for completion fields may involve a significant amount of
work since it walks every field of every segment looking for completion fields.
Innocuous-looking APIs like `GET _stats` or `GET _cluster/stats` do this for
every shard in the cluster. This repeated work is unnecessary since these stats
do not change between refreshes; in many indices they remain constant for a
long time.

This commit introduces a cache for these stats which is invalidated on a
refresh, allowing most stats calls to bypass the work needed to compute them on
most shards.

Closes #51915
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Feb 27, 2020
Computing the stats for completion fields may involve a significant amount of
work since it walks every field of every segment looking for completion fields.
Innocuous-looking APIs like `GET _stats` or `GET _cluster/stats` do this for
every shard in the cluster. This repeated work is unnecessary since these stats
do not change between refreshes; in many indices they remain constant for a
long time.

This commit introduces a cache for these stats which is invalidated on a
refresh, allowing most stats calls to bypass the work needed to compute them on
most shards.

Closes elastic#51915
Backport of elastic#51991
DaveCTurner added a commit that referenced this issue Feb 27, 2020
Computing the stats for completion fields may involve a significant amount of
work since it walks every field of every segment looking for completion fields.
Innocuous-looking APIs like `GET _stats` or `GET _cluster/stats` do this for
every shard in the cluster. This repeated work is unnecessary since these stats
do not change between refreshes; in many indices they remain constant for a
long time.

This commit introduces a cache for these stats which is invalidated on a
refresh, allowing most stats calls to bypass the work needed to compute them on
most shards.

Closes #51915
Backport of #51991
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants