The real memory usage of LRUQueryCache is 40 times larger than estimated value in `_nodes/stats` #89715

boicehuang · 2022-08-30T04:49:11Z

In one of our production clusters, the real memory usage of LRUQueryCache could be 10GB, almost 40 times larger than the estimated value (247MB) in _nodes/stats .
I have met this problem a few times. It is easy to reproduce when the index is large enough, data size reaches up to 1TB or more. With term queries that will match a large number of docs in the index, LRUQueryCache will accumulate and consume much more memory than the estimated value in _nodes/stats.

Elasticsearch Version

7.14

Installed Plugins

none

Java Version

bundled

OS Version

CentOS6.6 X86_64i

Problem Description

Below is the stats of one es node.

indices stats

health status index           uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   index1          ulkMje4qRI2K-WCS2vdydA  64   0 6767968410   2574351645      1.7tb          1.7tb

indices.query_cache in _nodes/stats is 247MB

heap dump analysis result of the memory analyzer tool, with query cache 9.4GB

dominator tree of heap dump analysis results

Steps to Reproduce

indices stats of the cluster

health status index           uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   index1          ulkMje4qRI2K-WCS2vdydA  64   0 6767968410   2574351645      1.7tb          1.7tb

continuous query

POST index1/_search
{"size":20,"query":{"constant_score":{"filter":{"bool":{"must":[{"term":{"isDeleted":{"value":false,"boost":1.0}}},{"term":{"goodsVid":{"value":6001607752679,"boost":1.0}}},{"terms":{"goodsVid":[611001607752679],"boost":1.0}},{"terms":{"goodsId":[133322937580118],"boost":1.0}},{"terms":{"activityIdUnique":["3-21006095901112"],"boost":1.0}},{"range":{"bizSource":{"from":"101","to":"101","include_lower":true,"include_upper":true,"boost":1.0}}}],"must_not":[{"nested":{"query":{"terms":{"goodsTags.tagCode":[1007,1004],"boost":1.0}},"path":"goodsTags","ignore_unmapped":false,"score_mode":"min","boost":1.0}},{"terms":{"soldType":[2],"boost":1.0}},{"terms":{"abilityCodeList":["1002","1001"],"boost":1.0}},{"terms":{"subGoodsType":[202,203,204],"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0}},"boost":1.0}},"_source":{"includes":["vid","goodsId","templateId","activityId","activityIdUnique","createTime","goodsVid","activitySort"],"excludes":[]},"sort":[{"esCreateTime":{"order":"desc"}},{"bizSource":{"order":"desc"}},{"activityId":{"order":"desc"}},{"goodsId":{"order":"desc"}},{"goodsVid":{"order":"desc"}}]}

LRUQueryCache will slowly accumulate, and memory will continue to rise to 80%. I have met this problem a few times. It is easy to reproduce when the index is large enough, data size reaches up to 1TB or more. With term queries that will match a large number of docs in the index, LRUQueryCache will accumulate and consume much more memory than the estimated value in _nodes/stats.

Besides (if relevant)

I can provide a memory dump if you need it.

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2022-08-30T13:42:10Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

nik9000 · 2022-08-30T13:50:37Z

I can provide a memory dump if you need it.

We'll try and reproduce it locally, but a memory dump would be useful to make sure we reproduce it in the same way. OTOH, if you can reproduce it running a bash script on an empty cluster that'd be best. Because it'd be smaller and easy to post publicly. But if you can't, it's all good. I've just sent you an email with a place to upload the heap dump if you'd like to do that.

boicehuang · 2022-09-02T07:31:31Z

@nik9000 I have uploaded the heap dump.

nik9000 · 2022-09-08T21:14:39Z

OK! I've cracked open the heap dump. This has something do with the low level cancellation infrastructure. It may have been fixed in later versions. I'm investigating.

nik9000 · 2022-09-08T21:24:13Z

The heap dump looks to come from Elasticsearch 7.10.1. There's been quite a bit of memory work in this area since - #61788 comes to mind, though it doesn't look quite right. I think there is another one I'm missing. One moment.

nik9000 · 2022-09-08T21:37:04Z

Also #61788 is in 7.10.1 so it can't be that!

nik9000 · 2022-09-08T21:50:08Z

Most of the space seems to be going to something that looks like this:

Class Name                                                                                                      | Shallow Heap | Retained Heap
-----------------------------------------------------------------------------------------------------------------------------------------------
queryCancellationContext org.elasticsearch.search.internal.SearchContext$QueryCancellationContext @ 0x1003fa5db8|           32 |    70,427,416
-----------------------------------------------------------------------------------------------------------------------------------------------

But I don't see a QueryCancellationContext in 7.10 or our more current branches. Is it your by any chance?

nik9000 · 2022-09-08T21:59:54Z

It looks like there's a member on DefaultSearchContext called queryCancellationContext which closes over the DefaultSearchContext that contains it. That's fine, but the queries seem to have a reference to that thing. I'm not really used to reading MAT so I'm probably getting something wrong, but I'm a bit confused about what's up.

boicehuang added >bug needs:triage Requires assignment of a team area label labels Aug 30, 2022

boicehuang changed the title ~~The real memory usage of LRUQueryCache is 50 times larger than estimated value in _nodes/stats~~ The real memory usage of LRUQueryCache is 40 times larger than estimated value in _nodes/stats Aug 30, 2022

nik9000 added the :Core/Infra/Core Core issues without another label label Aug 30, 2022

elasticsearchmachine added Team:Core/Infra Meta label for core/infra team and removed needs:triage Requires assignment of a team area label labels Aug 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The real memory usage of LRUQueryCache is 40 times larger than estimated value in `_nodes/stats` #89715

The real memory usage of LRUQueryCache is 40 times larger than estimated value in `_nodes/stats` #89715

boicehuang commented Aug 30, 2022 •

edited

elasticsearchmachine commented Aug 30, 2022

nik9000 commented Aug 30, 2022

boicehuang commented Sep 2, 2022 •

edited

nik9000 commented Sep 8, 2022

nik9000 commented Sep 8, 2022

nik9000 commented Sep 8, 2022

nik9000 commented Sep 8, 2022

nik9000 commented Sep 8, 2022

The real memory usage of LRUQueryCache is 40 times larger than estimated value in _nodes/stats #89715

The real memory usage of LRUQueryCache is 40 times larger than estimated value in _nodes/stats #89715

Comments

boicehuang commented Aug 30, 2022 • edited

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Besides (if relevant)

elasticsearchmachine commented Aug 30, 2022

nik9000 commented Aug 30, 2022

boicehuang commented Sep 2, 2022 • edited

nik9000 commented Sep 8, 2022

nik9000 commented Sep 8, 2022

nik9000 commented Sep 8, 2022

nik9000 commented Sep 8, 2022

nik9000 commented Sep 8, 2022

The real memory usage of LRUQueryCache is 40 times larger than estimated value in `_nodes/stats` #89715

The real memory usage of LRUQueryCache is 40 times larger than estimated value in `_nodes/stats` #89715

boicehuang commented Aug 30, 2022 •

edited

boicehuang commented Sep 2, 2022 •

edited