Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add intervals query #36135

Merged
merged 37 commits into from
Dec 14, 2018
Merged

Add intervals query #36135

merged 37 commits into from
Dec 14, 2018

Conversation

romseygeek
Copy link
Contributor

@romseygeek romseygeek commented Dec 1, 2018

This commit exposes the lucene intervals query in elasticsearch, as a replacement for the
Span query family. Instead of building query structures up from individual terms, the intervals query
uses a tree of sources, the leaves of which are formed of match-type queries that are passed
through analysis. These can then be combined using or, combine or relate sources to test
for proximity or position.

Replaces #32406.

Closes #29636

@romseygeek romseygeek added >feature :Search/Search Search-related issues that do not fall into other categories v7.0.0 labels Dec 1, 2018
@romseygeek romseygeek self-assigned this Dec 1, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good overall @romseygeek , I left some comments.

@@ -0,0 +1,58 @@
setup:
- skip:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this working ? I was not aware that the skip section can be set in the setup part but if it's working... ;)

if (posAtt.getPositionIncrement() == 1) {
if (synonyms.size() == 1) {
terms.add(synonyms.get(0));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you add an else to desambiguate ?

import java.util.Objects;

import static org.elasticsearch.common.xcontent.ConstructingObjectParser.constructorArg;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add javadocs to explain the kind of queries that this builder handles ?


import static org.elasticsearch.common.xcontent.ConstructingObjectParser.constructorArg;
import static org.elasticsearch.common.xcontent.ConstructingObjectParser.optionalConstructorArg;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, this is the user API so I'd expect some explanations regarding the different options ?

@romseygeek
Copy link
Contributor Author

I've added some better YAML tests and some basic documentation in the query DSL reference. I'm still a bit unsure of the DSL interface and some of the terms I'm using (eg source for something that generates intervals) but I think it's a good start.

cc @clintongormley

rest Outdated

./gradlew :distribution:archives:integ-test-zip:integTest \
-Dtests.class="org.elasticsearch.test.rest.*Yaml*IT" \
-Dtests.method="test {p0=$1}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what is the role of this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes running rest tests easier, but I didn't mean to commit it :) Will remove

@romseygeek
Copy link
Contributor Author

After conferring with @clintongormley I've reworked the API somewhat. My one reservation is the name of the query, which is still intervals, which I think is still too based on the internal implementation of these queries. Something like proximity_match might be better? It links it to the other match_* queries. I'd also like to rename the internal intervals field names to something like rules.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very clean overall. I'm curious why you decided to support a filter element in every intervals source instead of eg. having a filter intervals source?

Regarding the name, I don't dislike "intervals" as this is their name in the paper that introduced them. If we give them a more general name, I'm afraid that users would be even more surprised eg. by the behavior of any_of as it only returns minimal intervals?

favourite food is porridge` would not match, because the interval matching
`cold porridge` starts before the interval matching `my favourite food`.

[[intervals-match]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge +1 to exposing a match and no term

@Override
public IntervalsSource intervals(String text, int maxGaps, boolean ordered, NamedAnalyzer analyzer) throws IOException {
if (indexOptions().compareTo(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS) < 0) {
throw new IllegalArgumentException("Cannot create source against field [" + name() + "] with no positions indexed");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe say "intervals source" instead of just source?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

/**
* Constructs an IntervalsSource based on analyzed text
*/
public class IntervalBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it pkg-private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's called from TextFieldMapper which is in a different package, so unfortunately we can't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move it to the same package to keep visibility to a minimum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also called from IntervalsSourceProvider, which doesn't really fit in the mapper package. I think it has to stay public...

default:
provider = IntervalsSourceProvider.fromXContent(parser);

}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that this won't fail if multiple providers are provided?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have updated

MappedFieldType fieldType = context.fieldMapper(field);
if (fieldType == null) {
throw new IllegalArgumentException("Cannot create IntervalQuery over non-existent field [" + field + "]");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general we are lenient with unmapped fields because of cross-index search, should we be lenient here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have updated

The `or` rule will match any of its nested sub-rules.

[horizontal]
`intervals`::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to add documentation here about the fact that this only returns minimal intervals and consequences?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a section on minimization at the end of the doc, with some examples of queries that can produce surprising results, and how to deal with that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome

@romseygeek
Copy link
Contributor Author

I've pushed some changes addressing your comments @jpountz. For the filtering, it seemed to make more sense to add it as an option on to each rule, because the intervals produced by a filter all belong to the rule being filtered, and Clint and I thought that this was the best way of making that obvious.

@romseygeek
Copy link
Contributor Author

@elasticmachine retest this please

1 similar comment
@romseygeek
Copy link
Contributor Author

@elasticmachine retest this please

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @romseygeek. I thought having some sort of filtering intervals would be more consistent with how we deal with queries but I don't feel too strongly about it either.

LGTM

/**
* Constructs an IntervalsSource based on analyzed text
*/
public class IntervalBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move it to the same package to keep visibility to a minimum?

The `or` rule will match any of its nested sub-rules.

[horizontal]
`intervals`::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome

@romseygeek romseygeek merged commit 09bf93d into elastic:master Dec 14, 2018
@romseygeek romseygeek deleted the interval-query branch December 14, 2018 15:14
@clintongormley
Copy link

Thanks @romseygeek. I thought having some sort of filtering intervals would be more consistent with how we deal with queries but I don't feel too strongly about it either.

@jpountz I didn't have strong reasons for adding filtering the way it is here. Happy to discuss if you think it makes sense doing it in a different way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search/Search Search-related issues that do not fall into other categories v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants