Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heap usage reduction in Elasticsearch #31479

Open
aesgithub opened this issue Jun 20, 2018 · 16 comments
Open

Heap usage reduction in Elasticsearch #31479

aesgithub opened this issue Jun 20, 2018 · 16 comments
Assignees
Labels
:Core/Infra/Core Core issues without another label >enhancement high hanging fruit :Search/Mapping Index mappings, including merging and defining field types Team:Core/Infra Meta label for core/infra team Team:Search Meta label for search team

Comments

@aesgithub
Copy link

This is a proposal to reduce heap allocations in Elasticsearch code by object reuse. Elasticsearch allocates a number of per-document heap objects, such as Field (and derived Lucene) heap objects for metadata and data fields during indexing. We implement object reuse for Field and ParseContext objects across documents during bulk indexing. The changes improve ES heap allocation rate by 30%, heap garbage by 30% and promotion rate by 25% during indexing, while keeping the indexing rate. Max. GC pause time drops 98% from 13s to 0.3s; and the API tail latencies drop significantly: by about 60% at 100th, and by 50% at 99.9th percentiles. All benchmarks were done using Rally's nyc_taxis dataset, against an i3.16xl single node cluster with 128GB heap and parallel GC (we see similar improvements with CMS; the code does not degrade throughput on small instance types).

The patch URL: https://gist.github.com/aesgithub/cc5b54fc3cf5a3a13f1f5ad3139dfd00
The patch is on top of commit hash 7376c35 (dated Jun 1, 2018). It uses multiple maps to cache Field objects. We did not implement cache eviction policies for the prototype (the indexing, however, is fully functional). The caching is also optimized for NUMA instances (i.e., caches local to a thread), since we found that on multi-socket instance types NUMA contention limits indexing rates.

We want to start this as a discussion on improving some of the heap allocations in Elasticsearch. Making this patch production-ready will take some investment and we want to ensure that it is done after we get feedback.

@colings86 colings86 added the :Search/Mapping Index mappings, including merging and defining field types label Jun 21, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@colings86
Copy link
Contributor

@jpountz could you take a look at this one?

@aesgithub
Copy link
Author

pinging back.. any thoughts on the proposal?

@redlus
Copy link

redlus commented Jul 15, 2018

One of the greatest limitations of elasticsearch in terms of data/node is the 1/48 to 1/96 ratio of the assigned heap to indices size. This optimization may drive major improvements in this respect, and we'd really like to hear what the elastic team has to say.

@jpountz
Copy link
Contributor

jpountz commented Jul 16, 2018

Sorry for the lack of response, I had missed this issue. I agree we should look into reuse of Field instances, which is recommended by Lucene. I see you were careful with multi-valued fields, which are the main source of complexity here since it is fine to reuse field instances across documents, but not within the same document. I also liked that you didn't try to optimize fields for which this is less likely to help like binary fields or percolator fields, which are less common than keyword and numerics.

This patch improves memory reuse but also adds a bit of CPU overhead because of the hashmap lookup for the cached field. Hashmap lookups are fast, but this parsing code is called in very tight loops. I'd be curious to know whether we can get rid of this lookup somehow to make this change more likely to be a net win for every user.

@aesgithub
Copy link
Author

Thanks for the support! Adrian, you are right that there is an expected tradeoff with CPU here, since managed allocations are relatively lower overhead. Based on our benchmarks, even with the beefy i3 instance types on AWS, we have not seen any increase in CPU utilization. On the other hand, cutting down young gen allocations (and allocation rate) significantly did cut down promotions (re: pauses) and the associated GC cycles spent.

We initially started with a global map across all ingestion threads. This turned out to have two limitations: heavy data structure operation throughput, and NUMA contention on multi-socket machines (both to operate on the map and GC-related stall cycles). We moved to thread-local maps, which improved the problems (with O(threads) heap instead of O(docs) heap). The code has a few more optimizations around caching. The number of map operations remains proportional to fields parsed, but the CPU cycles used is insignificant relative to the cycles used in bulk ingestion. Given that the field objects depend on the field names, I'm not sure if we could eliminate a lookup operation. Open to ideas!

@aesgithub
Copy link
Author

Hi Adrian, do you see a path forward? Any PRs that are relevant to the patch?

@jpountz
Copy link
Contributor

jpountz commented Aug 22, 2018

I have some ideas but I don't like them much due to the complexity that they would introduce... Looking at your patch again, I'm wondering that a simple improvement would be to stop caching on the field name? Said otherwise have a cache of an arbitrary number of SortedNumericDocValuesField, LongPoint, etc. and don't enforce that a given field name always reuses the same instance? This is not going to work with fields that need a configured FieldType but might be good enough?

@aesgithub
Copy link
Author

Hi Adrian, sorry for the delay. Any reduction in O(documents * fields) heap allocations is better than current! I'm not sure I understand your point yet. Given that field objects have to be created with the name of the field (which depends on the current data), how would we create the array of field objects as the cache?

@muralikpbhat
Copy link

Hi Adrain, if we re-use a field independent of the field name but type, won't it contradict with the point that you had in the previous comment "it is fine to reuse field instances across documents, but not within the same document" ?

@atris
Copy link

atris commented Jan 24, 2019

@jpountz @muralikpbhat One thing that can be done is create a whitelist (maybe just a bit vector with a bit corresponding to a cache) and have a document use all caches except its own. Or maybe just tag cached objects with their corresponding documents in the cache itself. So you look at all Field objects except the ones in your own document when searching the cache for reusing an instance.

Thoughts?

@jpountz
Copy link
Contributor

jpountz commented Jan 24, 2019

Given that field objects have to be created with the name of the field (which depends on the current data), how would we create the array of field objects as the cache?

Oops good catch, this doesn't work.

if we re-use a field independent of the field name but type, won't it contradict with the point that you had in the previous comment "it is fine to reuse field instances across documents, but not within the same document" ?

I was thinking of only doing it for the first value of a field.

have a document use all caches except its own

Sorry I don't get the idea.

So you look at all Field objects except the ones in your own document when searching the cache for reusing an instance.

One thing that I want to avoid it having to search a cache. Indexing a field doesn't use much CPU. For instance when Lucene processes a LongPoint, it mostly appends the value to the end of a buffer. Saving an object allocation by introducing a hash lookup introduces complexity and doesn't sound like a net win performance-wise to me.

One approach I had in mind was to replace FieldMapper#parse with something like FieldMapper#getPerThreadParser which would return an object that would be responsible for parsing but may only be called from a single thread. This object could cache some state locally to reuse fields. Then we would have to propagate it up to DocumentParser which would maintain ThreadLocal references to trees of per-thread parsers. This way we would still have only one threadlocal object, but no hash lookups anymore. The thing that I would watch if we explored that route would be the complexity that it introduces.

@atris
Copy link

atris commented Jan 25, 2019

Couple of more points:

  1. The patch lacks cache eviction
  2. I am not sure if we should be creating a cache class for each type. It looks a bit weird and hard to maintain.

I believe @jpountz 's idea of creating a fieldname independent cache should be a reasonable way of achieving a sizable improvement, without too much of an intrusive change.

@aesgithub
Copy link
Author

@jpountz we have tried the ThreadLocal idea and propagating that as far up in the call stack as possible. It's been a while, so I don't remember the implementation complexity upfront, but there was a performance hit (~10-20% drop in bulk throughput, iirc) that we incurred due to ThreadLocal (especially under multi-socket NUMA contention).

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@itiyamas
Copy link

One approach I had in mind was to replace FieldMapper#parse with something like FieldMapper#getPerThreadParser which would return an object that would be responsible for parsing but may only be called from a single thread. This object could cache some state locally to reuse fields. Then we would have to propagate it up to DocumentParser which would maintain ThreadLocal references to trees of per-thread parsers. This way we would still have only one threadlocal object, but no hash lookups anymore. The thing that I would watch if we explored that route would be the complexity that it introduces.

@jpountz Isn't what you suggested being done by the current PR itself(it can be refactored to look like your suggestion)? The object you are referring to is FieldObjectCache- which is a per thread object and maintains field level caches. I did not get the part that no hash lookups are needed if we maintain per thread parsers? How would the cache state be accessed for any field in the same thread- we need a lookup anyway, unless we know upfront that the mappings are fixed. Field objects are created by name, which is an immutable field in Field object. All you can change are the values. Am I missing anything here?

@rjernst rjernst added Team:Core/Infra Meta label for core/infra team Team:Search Meta label for search team labels May 4, 2020
@rjernst rjernst added the needs:triage Requires assignment of a team area label label Dec 3, 2020
@gwbrown gwbrown removed the needs:triage Requires assignment of a team area label label Dec 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label >enhancement high hanging fruit :Search/Mapping Index mappings, including merging and defining field types Team:Core/Infra Meta label for core/infra team Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests