Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggs enhancement - allow Include/Exclude clauses to use array of terms #7529

Closed
wants to merge 1 commit into from
Closed

Conversation

markharwood
Copy link
Contributor

This is provided as an alternative means to filtering with a regex string.
Using this approach clients no longer need to escape terms provided in the query DSL.

Closes #6782

}
--------------------------------------------------


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to also have an example for exclude?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. Exclude French cars? :P

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be useful to have an example showing how this supports numbers, e.g.,
"exclude": [1, 22, 33] since this is something you can't do with the related regex feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exclude French cars?

Haha :)

@nezda This feature makes sense, but it is not implemeted by this pull request yet, which only works on string fields.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markharwood can you add support for numbers too? the string functionality can be done with regexes now, but numeric field exclusions cannot

@jpountz
Copy link
Contributor

jpountz commented Sep 1, 2014

The change looks good to me overall. I think you just missed to add this new API to TermsBuilder? Let's also add basic tests for this functionality?

@jpountz jpountz removed the review label Sep 1, 2014
@markharwood
Copy link
Contributor Author

Thanks for the review. Will make suggested changes

@markharwood
Copy link
Contributor Author

@jpountz Any chance of another review on this? I've got a big performance improvement blocked on this feature (150ms vs 74 seconds for a query)

@markharwood
Copy link
Contributor Author

Bundled the optimisation for non-regex terms referenced in #7526 as part of this PR

// 3) Traversing the term enum for all terms and running past regexes
// Option 3 is known to be very slow in the case of high-cardinality fields and
// should be avoided if possible.
if (includeValues != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if both includeValues and excludeValues are specified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if both includeValues and excludeValues are specified?

The excludeValues take precedence over the include. It's a weird scenario because the user is saying "I need this but also don't want this". An example scenario is a user might throw us a bag of top words to analyse but might use a small "exclude" list of stopwords. Arguably they could remove them from the include list but I'm choosing not to penalise users for non-sensical queries here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I was missing that piece of reasoning. This global ord lookup logic looks good!

@markharwood
Copy link
Contributor Author

Rebased - good to go on this? Fixes a nasty performance issue.

return false;

if(hasRegexTest){
//We need to perform UTF8 to UTF16 conversion for use in the regex matching
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a space after the if and the //?

@jpountz
Copy link
Contributor

jpountz commented Sep 12, 2014

LGTM

@jpountz
Copy link
Contributor

jpountz commented Sep 12, 2014

In the future I'm wondering that it might be cleaner to switch this to Lucene's logic to intersect terms enums with automatons (there is already existing logic to translate a regexp to an automaton).

@markharwood
Copy link
Contributor Author

Thanks for the review, Adrien! Will patch and push to master and 1.x.

I'm wondering that it might be cleaner to switch this to Lucene's logic to intersect terms enums with automatons.

Agreed there is scope for future work here. IncludeExclude feels weird having both regex and straight-match logic in a single class right now but introducing pluggable term filtering impls feels like a bigger change. There's still a big cost in using regexes on high-cardinality fields (it iterates ALL terms) and this is not caught by existing timeout logic (but is by #4586 which is also something we need to decide on).

@jpountz
Copy link
Contributor

jpountz commented Sep 12, 2014

There's still a big cost in using regexes on high-cardinality fields (it iterates ALL terms)

Agreed, that is where I hope Lucene's automaton-based intersection would help since it would allow to not evaluate the regexp on every term (it takes advantage of the structure of the automaton and of the fact that the terms are sorted to rapidly skip over non-matching terms).

@markharwood
Copy link
Contributor Author

Committed as 3c8f8cc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Aggs: filtering values using array of values
4 participants