New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heap usage reduction in Elasticsearch #31479
Comments
Pinging @elastic/es-search-aggs |
@jpountz could you take a look at this one? |
pinging back.. any thoughts on the proposal? |
One of the greatest limitations of elasticsearch in terms of data/node is the 1/48 to 1/96 ratio of the assigned heap to indices size. This optimization may drive major improvements in this respect, and we'd really like to hear what the elastic team has to say. |
Sorry for the lack of response, I had missed this issue. I agree we should look into reuse of Field instances, which is recommended by Lucene. I see you were careful with multi-valued fields, which are the main source of complexity here since it is fine to reuse field instances across documents, but not within the same document. I also liked that you didn't try to optimize fields for which this is less likely to help like binary fields or percolator fields, which are less common than keyword and numerics. This patch improves memory reuse but also adds a bit of CPU overhead because of the hashmap lookup for the cached field. Hashmap lookups are fast, but this parsing code is called in very tight loops. I'd be curious to know whether we can get rid of this lookup somehow to make this change more likely to be a net win for every user. |
Thanks for the support! Adrian, you are right that there is an expected tradeoff with CPU here, since managed allocations are relatively lower overhead. Based on our benchmarks, even with the beefy i3 instance types on AWS, we have not seen any increase in CPU utilization. On the other hand, cutting down young gen allocations (and allocation rate) significantly did cut down promotions (re: pauses) and the associated GC cycles spent. We initially started with a global map across all ingestion threads. This turned out to have two limitations: heavy data structure operation throughput, and NUMA contention on multi-socket machines (both to operate on the map and GC-related stall cycles). We moved to thread-local maps, which improved the problems (with O(threads) heap instead of O(docs) heap). The code has a few more optimizations around caching. The number of map operations remains proportional to fields parsed, but the CPU cycles used is insignificant relative to the cycles used in bulk ingestion. Given that the field objects depend on the field names, I'm not sure if we could eliminate a lookup operation. Open to ideas! |
Hi Adrian, do you see a path forward? Any PRs that are relevant to the patch? |
I have some ideas but I don't like them much due to the complexity that they would introduce... Looking at your patch again, I'm wondering that a simple improvement would be to stop caching on the field name? Said otherwise have a cache of an arbitrary number of |
Hi Adrian, sorry for the delay. Any reduction in O(documents * fields) heap allocations is better than current! I'm not sure I understand your point yet. Given that field objects have to be created with the name of the field (which depends on the current data), how would we create the array of field objects as the cache? |
Hi Adrain, if we re-use a field independent of the field name but type, won't it contradict with the point that you had in the previous comment "it is fine to reuse field instances across documents, but not within the same document" ? |
@jpountz @muralikpbhat One thing that can be done is create a whitelist (maybe just a bit vector with a bit corresponding to a cache) and have a document use all caches except its own. Or maybe just tag cached objects with their corresponding documents in the cache itself. So you look at all Field objects except the ones in your own document when searching the cache for reusing an instance. Thoughts? |
Oops good catch, this doesn't work.
I was thinking of only doing it for the first value of a field.
Sorry I don't get the idea.
One thing that I want to avoid it having to search a cache. Indexing a field doesn't use much CPU. For instance when Lucene processes a LongPoint, it mostly appends the value to the end of a buffer. Saving an object allocation by introducing a hash lookup introduces complexity and doesn't sound like a net win performance-wise to me. One approach I had in mind was to replace |
Couple of more points:
I believe @jpountz 's idea of creating a fieldname independent cache should be a reasonable way of achieving a sizable improvement, without too much of an intrusive change. |
@jpountz we have tried the ThreadLocal idea and propagating that as far up in the call stack as possible. It's been a while, so I don't remember the implementation complexity upfront, but there was a performance hit (~10-20% drop in bulk throughput, iirc) that we incurred due to ThreadLocal (especially under multi-socket NUMA contention). |
Pinging @elastic/es-core-infra |
@jpountz Isn't what you suggested being done by the current PR itself(it can be refactored to look like your suggestion)? The object you are referring to is FieldObjectCache- which is a per thread object and maintains field level caches. I did not get the part that no hash lookups are needed if we maintain per thread parsers? How would the cache state be accessed for any field in the same thread- we need a lookup anyway, unless we know upfront that the mappings are fixed. Field objects are created by name, which is an immutable field in Field object. All you can change are the values. Am I missing anything here? |
This is a proposal to reduce heap allocations in Elasticsearch code by object reuse. Elasticsearch allocates a number of per-document heap objects, such as Field (and derived Lucene) heap objects for metadata and data fields during indexing. We implement object reuse for Field and ParseContext objects across documents during bulk indexing. The changes improve ES heap allocation rate by 30%, heap garbage by 30% and promotion rate by 25% during indexing, while keeping the indexing rate. Max. GC pause time drops 98% from 13s to 0.3s; and the API tail latencies drop significantly: by about 60% at 100th, and by 50% at 99.9th percentiles. All benchmarks were done using Rally's nyc_taxis dataset, against an i3.16xl single node cluster with 128GB heap and parallel GC (we see similar improvements with CMS; the code does not degrade throughput on small instance types).
The patch URL: https://gist.github.com/aesgithub/cc5b54fc3cf5a3a13f1f5ad3139dfd00
The patch is on top of commit hash 7376c35 (dated Jun 1, 2018). It uses multiple maps to cache Field objects. We did not implement cache eviction policies for the prototype (the indexing, however, is fully functional). The caching is also optimized for NUMA instances (i.e., caches local to a thread), since we found that on multi-socket instance types NUMA contention limits indexing rates.
We want to start this as a discussion on improving some of the heap allocations in Elasticsearch. Making this patch production-ready will take some investment and we want to ensure that it is done after we get feedback.
The text was updated successfully, but these errors were encountered: