Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
facets OutofMemoryError #1531
running a term facet on a large string field runs into an OutOfMemoryError (-Xmx50g -Xms50g).
Would be great to have the the field cache access disk or that a facet query does not load all the terms of a field into the cache but only the terms of a query.
Opened issue as discussed in irc.
@kimchy Having a facet on a field with millions of entries and trying to only find top-N is a great interest for me. I have done some preliminary research how this can be done with fixed memory constraints. Basically I've spent several hours reading research papers.
The best solution I've found is called Count-Min Sketch (http://www.eecs.harvard.edu/~michaelm/CS222/countmin.pdf) or blog for non-academic readers (http://lkozma.net/blog/sketching-data-structures/) or this is a pretty good paper explaining a similar approach http://www.ece.uc.edu/~mazlack/dbm.w2010/Charikar.02.pdf
These algorithms however give approximate counts (with parametrizable error bounds), but that is something most users we can live with IMO.
Despite academic jargon, proofs and equations everywhere this is actually pretty straightforward to implement. Any chance that this will land in ES?
This has been biting us as well. Tracked it down to
Would you accept patches that use a different storage scheme or have you already planned out how you will address in 0.20?
@drewr, others: In case it's useful: I solved a similar problem (related to having lists of terms, where the list could be
Eg I reduced my memory usage from ~14GB down to ~2GB by moving the "worst" ~0.1% documents into the new index (some more details are available in issue #1683 ref'd by Shay above, and the original thread in the es forum that spawned the issue).
Note that if you are only faceting/searching on fields, then making the objects nested will also solve the memory problem (at the expense of some performance, though it wasn't noticeable for me). I couldn't do this for me one of my fields (geo-tags) because I needed them for custom scripts, hence the secondary workaround.
Also, despite Shay's warnings, I have had success using the "soft" cache (in my case sometimes people create facets on big fields once or twice and then don't use them again; obviously if you have a single facet or set of common facets that exceed memory then this is a terrible idea!)
@Shay: My suggestion in #1683 was to have a hybrid storage, where most documents were stored as at present, but "large" (user-tunable parameter?) documents were stored "the other way round". (ie effectively an internal version of what I did with seperate indexes)
(fwiw I think it's possible that "good" containers could let you do whole thing "the other way round", but every time I sketched down some thoughts I ended up needing to write it in C++!)
@Alex-Ikanow yea, agreed, other modes of storage / usage would be great. The first thing I want to do is to create that relevant abstraction (or improve on the current one) in elasticsearch, and be able to configure to use different models in the mappings (or dynamically detected and used). This will allow for better plugability that is missing today, and hopefully will help foster other implementations.
referenced this issue
Dec 10, 2012
Closing this one for now. 0.90 brought great improvements in regard to high cardinality fields and memory consumption when doing faceting on these. In addition there are ideas about the different modes of storage/usage Shay mentioned above. I guess we should rather create different issues for those than using this issue.