Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Speed up include/exclude in terms aggregations with regexps, using Lucene regular expressions #10418
Today we check every regular expression eagerly against every possible term.
This commit switches to Lucene regular expressions instead of Java (not exactly
I added some comments. In general I am a little confused as to what is going on in all cases.
Doing this kind of filtering seems costly, since it will intersect against the entire termsenum, potentially dereferencing many global ordinals to byte to do the matching, in order to finally get the bitset.
Can we use a more efficient bitset in some cases? We will be populating it in ordinal order...
Do we cache these bitsets anywhere in case the same filtering is repeated over and over?
I thought this over more and brainstormed with @jpountz . I think we should always build one 'automaton' based on includes, excludes, includeValues, excludeValues, whatever. This can be Operations.minus(includes, excludes) basically.
Then, we can always simply use terms.intersect to make the bitset and not enumerate terms, doing ord-BytesRef resolution. Making it completely optimal for this case is interesting, its different than the techniques we would use for scoring documents, depending on the regex.
For dense cases, prefix cases (
But intersect() is probably much better already than what we do today.
…exps. Today we check every regular expression eagerly against every possible term. This can be very slow if you have lots of unique terms, and even the bottleneck if your query is selective. This commit switches to Lucene regular expressions instead of Java (not exactly the same syntax yet most existing regular expressions should keep working) and uses the same logic as RegExpQuery to intersect the regular expression with the terms dictionary. I wrote a quick benchmark (in the PR) to make sure it made things faster and the same request that took 750ms on master now takes 74ms with this change. Close #7526
@jpountz I am wondering if it would be possible to support both, lucene and java regexps.
On the one hand I greatly appreciate the performance improvements achieved by using lucene regexps but on the other hand I need the power and flexibility of java regexps.
In a perfect world we would get both, speed AND flexibility, but in the meantime it would be great if the user could make the decision which regexp engine to use.
It is currently not possible to include/exclude terms on a caseinsensitive basis.
I can think of some possible workarounds, but they are not as simple and straight forward as this implementation:
I don't know how this would help me in this particular issue. I already have a multi field mapping to preserve the original field value. So it already is possible to build the aggregation over the lowercase field. But in this case the returned field values would also be lowercase, but I am interested in the exact original value.
I know that the decision to only use Lucenes Regexps is carefully chosen for various reasons.
But I am note sure if this goal is reached in the end if users have to implement complexer mappings or client side logic to achieve the same result. As I said, this usecase could be implemented perfectly fine just using the snippet above using Elasticsearch 1.7.