New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

facets OutofMemoryError #1531

Closed
locojay opened this Issue Dec 9, 2011 · 11 comments

Comments

Projects
None yet
8 participants
@locojay
Copy link

locojay commented Dec 9, 2011

running a term facet on a large string field runs into an OutOfMemoryError (-Xmx50g -Xms50g).

Would be great to have the the field cache access disk or that a facet query does not load all the terms of a field into the cache but only the terms of a query.
The second would allow to run multiple facet term query's within different forms and aggregate at the end.

Opened issue as discussed in irc.

@karussell

This comment has been minimized.

Copy link
Contributor

karussell commented Dec 14, 2011

link to discussion?

@gustavobmaia

This comment has been minimized.

Copy link

gustavobmaia commented Jan 7, 2012

I have the same problem.

I think it creates a cache of the fields you want to do the facet, so if you have many documents will have to have a lot of memory. Imagine that I have a billion documents, I'll never have enough memory to make facets.

@kimchy

This comment has been minimized.

Copy link
Member

kimchy commented Jan 8, 2012

It depends on the number of terms you have for that field. The terms facet is not aimed to have facets on a "body of text", its more aimed at having facets on categories, tags, and the like.

@gustavobmaia

This comment has been minimized.

Copy link

gustavobmaia commented Jan 8, 2012

Today, I have a million and 4 million document tag when I do in every facet documents, I have a problem with OutOfMemory. I started the JVM with 4GB.
This is normal I have problem with OutOfMemory?

@kimchy

This comment has been minimized.

Copy link
Member

kimchy commented Jan 8, 2012

@gustavobbamaia it depends on what you facet on. All the values need to be loaded to memory for fast access.

@jsuchal

This comment has been minimized.

Copy link

jsuchal commented Feb 3, 2012

@kimchy Having a facet on a field with millions of entries and trying to only find top-N is a great interest for me. I have done some preliminary research how this can be done with fixed memory constraints. Basically I've spent several hours reading research papers.

The best solution I've found is called Count-Min Sketch (http://www.eecs.harvard.edu/~michaelm/CS222/countmin.pdf) or blog for non-academic readers (http://lkozma.net/blog/sketching-data-structures/) or this is a pretty good paper explaining a similar approach http://www.ece.uc.edu/~mazlack/dbm.w2010/Charikar.02.pdf

These algorithms however give approximate counts (with parametrizable error bounds), but that is something most users we can live with IMO.

Despite academic jargon, proofs and equations everywhere this is actually pretty straightforward to implement. Any chance that this will land in ES?

@drewr

This comment has been minimized.

Copy link
Member

drewr commented Feb 22, 2012

This has been biting us as well. Tracked it down to ordinals List in FieldDataLoader.load(). With few segments and a query on a field with a large set of unique terms it can grow to a huge size even with a modest data size. The fewer the segments, the larger maxDocs() becomes (seems to be max out at 1k in our setup), resulting in a duplication of that 1k*sizeof(int) footprint as it adds a row to ordinals for every doc where a single term is present.

Would you accept patches that use a different storage scheme or have you already planned out how you will address in 0.20?

@kimchy

This comment has been minimized.

Copy link
Member

kimchy commented Feb 22, 2012

Which patch were you thinking about. I just commented on the another issue #1683, where I explained why the structure is as is.

@Alex-Ikanow

This comment has been minimized.

Copy link

Alex-Ikanow commented Feb 22, 2012

@drewr, others: In case it's useful: I solved a similar problem (related to having lists of terms, where the list could be
"large" but was "small" on average) by keeping a separate index for documents with many list elements.

Eg I reduced my memory usage from ~14GB down to ~2GB by moving the "worst" ~0.1% documents into the new index (some more details are available in issue #1683 ref'd by Shay above, and the original thread in the es forum that spawned the issue).

Note that if you are only faceting/searching on fields, then making the objects nested will also solve the memory problem (at the expense of some performance, though it wasn't noticeable for me). I couldn't do this for me one of my fields (geo-tags) because I needed them for custom scripts, hence the secondary workaround.

Also, despite Shay's warnings, I have had success using the "soft" cache (in my case sometimes people create facets on big fields once or twice and then don't use them again; obviously if you have a single facet or set of common facets that exceed memory then this is a terrible idea!)

@Shay: My suggestion in #1683 was to have a hybrid storage, where most documents were stored as at present, but "large" (user-tunable parameter?) documents were stored "the other way round". (ie effectively an internal version of what I did with seperate indexes)

(fwiw I think it's possible that "good" containers could let you do whole thing "the other way round", but every time I sketched down some thoughts I ended up needing to write it in C++!)

@kimchy

This comment has been minimized.

Copy link
Member

kimchy commented Feb 22, 2012

@Alex-Ikanow yea, agreed, other modes of storage / usage would be great. The first thing I want to do is to create that relevant abstraction (or improve on the current one) in elasticsearch, and be able to configure to use different models in the mappings (or dynamically detected and used). This will allow for better plugability that is missing today, and hopefully will help foster other implementations.

@spinscale

This comment has been minimized.

Copy link
Member

spinscale commented Jun 26, 2013

Closing this one for now. 0.90 brought great improvements in regard to high cardinality fields and memory consumption when doing faceting on these. In addition there are ideas about the different modes of storage/usage Shay mentioned above. I guess we should rather create different issues for those than using this issue.

@spinscale spinscale closed this Jun 26, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment