Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up include/exclude in terms aggregations with regexps, using Lucene regular expressions #10418

Merged

Conversation

@jpountz
Copy link
Contributor

commented Apr 3, 2015

Today we check every regular expression eagerly against every possible term.
This can be very slow if you have lots of unique terms, and even the bottleneck
if your query is selective.

This commit switches to Lucene regular expressions instead of Java (not exactly
the same syntax yet most existing regular expressions should keep working) and
uses the same logic as RegExpQuery to intersect the regular expression with the
terms dictionary. I wrote a quick benchmark (in the PR) to make sure it made
things faster and the same request that took 750ms on master now takes 74ms with
this change.

Close #7526
Close #9848

@jpountz

This comment has been minimized.

Copy link
Contributor Author

commented Apr 3, 2015

Note for self: @rashidkpc asked to create an issue in Kibana it this change makes it in so that Kibana can be changed too.

@jpountz

This comment has been minimized.

Copy link
Contributor Author

commented Apr 3, 2015

@rmuir @mikemccand @s1monw I am not familiar with the Automaton/RegExp/CompiledAutomaton/ByteRunAutomaton API so if one of you could check that I did not misuse these APIs in this change, that would be great!

@@ -89,12 +93,20 @@ private void addReject(long val) {
* @param excludeValues The terms to be excluded
* (may only be {@code null} if one of the other arguments is none-null.
*/
public IncludeExclude(Pattern include, Pattern exclude, Set<BytesRef> includeValues, Set<BytesRef> excludeValues) {
assert includeValues != null || include != null ||
public IncludeExclude(RegExp include, RegExp exclude, Set<BytesRef> includeValues, Set<BytesRef> excludeValues) {

This comment has been minimized.

Copy link
@rmuir

rmuir Apr 3, 2015

Contributor

what is this includeValues/excludeValues? Should it just build the same matcher too (maybe require SortedSet).

This comment has been minimized.

Copy link
@rmuir

rmuir Apr 3, 2015

Contributor

Another option for these is to reverse-lookup their ordinals against the dv instance, and you just exclude those ordinals. This should make this option much faster as it wouldnt need to traverse the Terms at all and do contains(BytesRef)

if (include != null) {
// determinize first so that we don't determinize twice because of
// the ByteRunAutomaton and the CompiledAutomaton
final Automaton a = Operations.determinize(include.toAutomaton(), Operations.DEFAULT_MAX_DETERMINIZED_STATES);

This comment has been minimized.

Copy link
@rmuir

rmuir Apr 3, 2015

Contributor

Why do we need both? Is it because there are so many things going on in this file? I don't understand why we wouldnt just need the CompiledAutomaton for the terms.intersect operation, why do we need a ByteRunAutomaton too?

Having both seems silly anyway, but if we must do it, try to assign the ByteRunAutomaton from the CompiledAutomaton. The majority of the time it will be non-null:

/** 
 * Matcher for quickly determining if a byte[] is accepted.
 * only valid for {@link AUTOMATON_TYPE#NORMAL}.
 */
public final ByteRunAutomaton runAutomaton;
@rmuir

This comment has been minimized.

Copy link
Contributor

commented Apr 3, 2015

I added some comments. In general I am a little confused as to what is going on in all cases.

Doing this kind of filtering seems costly, since it will intersect against the entire termsenum, potentially dereferencing many global ordinals to byte[] to do the matching, in order to finally get the bitset.

Can we use a more efficient bitset in some cases? We will be populating it in ordinal order...

Do we cache these bitsets anywhere in case the same filtering is repeated over and over?

@rmuir

This comment has been minimized.

Copy link
Contributor

commented Apr 3, 2015

I thought this over more and brainstormed with @jpountz . I think we should always build one 'automaton' based on includes, excludes, includeValues, excludeValues, whatever. This can be Operations.minus(includes, excludes) basically.

Then, we can always simply use terms.intersect to make the bitset and not enumerate terms, doing ord-BytesRef resolution. Making it completely optimal for this case is interesting, its different than the techniques we would use for scoring documents, depending on the regex.

For dense cases, prefix cases (ab*), or similar regexps, it could be a win to be tricky in some cases. If someone does ab*, we could just do two reverse lookups (ab and ac) and fill the bitset range in between, instead of reading all the ordinals and deref'ing them to terms in ab* .

But intersect() is probably much better already than what we do today.

@jpountz jpountz force-pushed the jpountz:enhancement/speed_up_aggs_include_exclude branch from a7dea5f to 3c067fd Apr 8, 2015
@jpountz

This comment has been minimized.

Copy link
Contributor Author

commented Apr 8, 2015

I pushed a new commit in order to fold in exclusions in the automaton. It is still far from great (we should better reuse what we build when several aggregators reuse the same include/exclude rules for instance) but better than what we have today?

private Set<BytesRef> includeValues;
private Set<BytesRef> excludeValues;
private final boolean hasRegexTest;
public static class StringFilter {

This comment has been minimized.

Copy link
@rmuir

rmuir Apr 8, 2015

Contributor

When is this used? Only in the non-global ordinals case?

This comment has been minimized.

Copy link
@jpountz

jpountz Apr 8, 2015

Author Contributor

Yes, only for scripts.

LongBitSet acceptedGlobalOrdinals = new LongBitSet(globalOrdinals.getValueCount());
TermsEnum globalTermsEnum;
Terms globalTerms = new DocValuesTerms(globalOrdinals);
globalTermsEnum = compiled.getTermsEnum(globalTerms);

This comment has been minimized.

Copy link
@rmuir

rmuir Apr 8, 2015

Contributor

Can you add a TODO here? Something like:

// TODO: specialize based on compiled.type: for ALL and prefixes (sinkState >= 0 ) we can avoid i/o and just set bits.

I can look into it later.

@rmuir

This comment has been minimized.

Copy link
Contributor

commented Apr 8, 2015

looks great to me. I added comments but only asking for code comments

…exps.

Today we check every regular expression eagerly against every possible term.
This can be very slow if you have lots of unique terms, and even the bottleneck
if your query is selective.

This commit switches to Lucene regular expressions instead of Java (not exactly
the same syntax yet most existing regular expressions should keep working) and
uses the same logic as RegExpQuery to intersect the regular expression with the
terms dictionary. I wrote a quick benchmark (in the PR) to make sure it made
things faster and the same request that took 750ms on master now takes 74ms with
this change.

Close #7526
@jpountz jpountz force-pushed the jpountz:enhancement/speed_up_aggs_include_exclude branch from 3c067fd to aecd9ac Apr 9, 2015
jpountz added a commit that referenced this pull request Apr 9, 2015
…ude_exclude

Aggregations: Speed up include/exclude in terms aggregations with regexps.

Close #10418
@jpountz jpountz merged commit e25db22 into elastic:master Apr 9, 2015
1 check passed
1 check passed
CLA Commit author is a member of Elasticsearch
Details
@jpountz jpountz deleted the jpountz:enhancement/speed_up_aggs_include_exclude branch Apr 9, 2015
@clintongormley clintongormley changed the title Aggregations: Speed up include/exclude in terms aggregations with regexps. Speed up include/exclude in terms aggregations with regexps, using Lucene regular expressions Jun 6, 2015
@clintongormley clintongormley removed the review label Aug 7, 2015
@dawi

This comment has been minimized.

Copy link

commented Dec 24, 2015

@jpountz I am wondering if it would be possible to support both, lucene and java regexps.

On the one hand I greatly appreciate the performance improvements achieved by using lucene regexps but on the other hand I need the power and flexibility of java regexps.

In a perfect world we would get both, speed AND flexibility, but in the meantime it would be great if the user could make the decision which regexp engine to use.

Background information:

It is currently not possible to include/exclude terms on a caseinsensitive basis.

I can think of some possible workarounds, but they are not as simple and straight forward as this implementation:

.addAggregation(
    terms("fieldAgg")
        .field(field + ".original")
        .include(regex, Pattern.CASE_INSENSITIVE) // supported by elasticsearch 1.7
        .size(maxResults)
        .order(Terms.Order.count(false))
);
@rmuir

This comment has been minimized.

Copy link
Contributor

commented Dec 24, 2015

Just index your content correctly. If you want case sensitive, that means indexing with lowercasefilter. its a search engine!

@dawi

This comment has been minimized.

Copy link

commented Dec 24, 2015

I don't know how this would help me in this particular issue. I already have a multi field mapping to preserve the original field value. So it already is possible to build the aggregation over the lowercase field. But in this case the returned field values would also be lowercase, but I am interested in the exact original value.

I know that the decision to only use Lucenes Regexps is carefully chosen for various reasons.

But I am note sure if this goal is reached in the end if users have to implement complexer mappings or client side logic to achieve the same result. As I said, this usecase could be implemented perfectly fine just using the snippet above using Elasticsearch 1.7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.