Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The real memory usage of LRUQueryCache is 40 times larger than estimated value in _nodes/stats #89715

Open
boicehuang opened this issue Aug 30, 2022 · 8 comments
Labels
>bug :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team

Comments

@boicehuang
Copy link
Contributor

boicehuang commented Aug 30, 2022

In one of our production clusters, the real memory usage of LRUQueryCache could be 10GB, almost 40 times larger than the estimated value (247MB) in _nodes/stats .
I have met this problem a few times. It is easy to reproduce when the index is large enough, data size reaches up to 1TB or more. With term queries that will match a large number of docs in the index, LRUQueryCache will accumulate and consume much more memory than the estimated value in _nodes/stats.

Elasticsearch Version

7.14

Installed Plugins

none

Java Version

bundled

OS Version

CentOS6.6 X86_64i

Problem Description

Below is the stats of one es node.

  1. indices stats
health status index           uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   index1          ulkMje4qRI2K-WCS2vdydA  64   0 6767968410   2574351645      1.7tb          1.7tb
  1. indices.query_cache in _nodes/stats is 247MB

image

  1. heap dump analysis result of the memory analyzer tool, with query cache 9.4GB

image

  1. dominator tree of heap dump analysis results

image

Steps to Reproduce

  1. indices stats of the cluster
health status index           uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   index1          ulkMje4qRI2K-WCS2vdydA  64   0 6767968410   2574351645      1.7tb          1.7tb
  1. continuous query
POST index1/_search
{"size":20,"query":{"constant_score":{"filter":{"bool":{"must":[{"term":{"isDeleted":{"value":false,"boost":1.0}}},{"term":{"goodsVid":{"value":6001607752679,"boost":1.0}}},{"terms":{"goodsVid":[611001607752679],"boost":1.0}},{"terms":{"goodsId":[133322937580118],"boost":1.0}},{"terms":{"activityIdUnique":["3-21006095901112"],"boost":1.0}},{"range":{"bizSource":{"from":"101","to":"101","include_lower":true,"include_upper":true,"boost":1.0}}}],"must_not":[{"nested":{"query":{"terms":{"goodsTags.tagCode":[1007,1004],"boost":1.0}},"path":"goodsTags","ignore_unmapped":false,"score_mode":"min","boost":1.0}},{"terms":{"soldType":[2],"boost":1.0}},{"terms":{"abilityCodeList":["1002","1001"],"boost":1.0}},{"terms":{"subGoodsType":[202,203,204],"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0}},"boost":1.0}},"_source":{"includes":["vid","goodsId","templateId","activityId","activityIdUnique","createTime","goodsVid","activitySort"],"excludes":[]},"sort":[{"esCreateTime":{"order":"desc"}},{"bizSource":{"order":"desc"}},{"activityId":{"order":"desc"}},{"goodsId":{"order":"desc"}},{"goodsVid":{"order":"desc"}}]}

LRUQueryCache will slowly accumulate, and memory will continue to rise to 80%. I have met this problem a few times. It is easy to reproduce when the index is large enough, data size reaches up to 1TB or more. With term queries that will match a large number of docs in the index, LRUQueryCache will accumulate and consume much more memory than the estimated value in _nodes/stats.

Besides (if relevant)

I can provide a memory dump if you need it.

@boicehuang boicehuang added >bug needs:triage Requires assignment of a team area label labels Aug 30, 2022
@boicehuang boicehuang changed the title The real memory usage of LRUQueryCache is 50 times larger than estimated value in _nodes/stats The real memory usage of LRUQueryCache is 40 times larger than estimated value in _nodes/stats Aug 30, 2022
@nik9000 nik9000 added the :Core/Infra/Core Core issues without another label label Aug 30, 2022
@elasticsearchmachine elasticsearchmachine added Team:Core/Infra Meta label for core/infra team and removed needs:triage Requires assignment of a team area label labels Aug 30, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@nik9000
Copy link
Member

nik9000 commented Aug 30, 2022

I can provide a memory dump if you need it.

We'll try and reproduce it locally, but a memory dump would be useful to make sure we reproduce it in the same way. OTOH, if you can reproduce it running a bash script on an empty cluster that'd be best. Because it'd be smaller and easy to post publicly. But if you can't, it's all good. I've just sent you an email with a place to upload the heap dump if you'd like to do that.

@boicehuang
Copy link
Contributor Author

boicehuang commented Sep 2, 2022

@nik9000 I have uploaded the heap dump.

@nik9000
Copy link
Member

nik9000 commented Sep 8, 2022

OK! I've cracked open the heap dump. This has something do with the low level cancellation infrastructure. It may have been fixed in later versions. I'm investigating.

@nik9000
Copy link
Member

nik9000 commented Sep 8, 2022

The heap dump looks to come from Elasticsearch 7.10.1. There's been quite a bit of memory work in this area since - #61788 comes to mind, though it doesn't look quite right. I think there is another one I'm missing. One moment.

@nik9000
Copy link
Member

nik9000 commented Sep 8, 2022

Also #61788 is in 7.10.1 so it can't be that!

@nik9000
Copy link
Member

nik9000 commented Sep 8, 2022

Most of the space seems to be going to something that looks like this:

Class Name                                                                                                      | Shallow Heap | Retained Heap
-----------------------------------------------------------------------------------------------------------------------------------------------
queryCancellationContext org.elasticsearch.search.internal.SearchContext$QueryCancellationContext @ 0x1003fa5db8|           32 |    70,427,416
-----------------------------------------------------------------------------------------------------------------------------------------------

But I don't see a QueryCancellationContext in 7.10 or our more current branches. Is it your by any chance?

@nik9000
Copy link
Member

nik9000 commented Sep 8, 2022

It looks like there's a member on DefaultSearchContext called queryCancellationContext which closes over the DefaultSearchContext that contains it. That's fine, but the queries seem to have a reference to that thing. I'm not really used to reading MAT so I'm probably getting something wrong, but I'm a bit confused about what's up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team
Projects
None yet
Development

No branches or pull requests

3 participants