Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add intervals query #36135

Merged
merged 37 commits into from
Dec 14, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
6b7d175
Add IntervalQueryBuilder with support for match and combine intervals
romseygeek Jul 26, 2018
7d9b9ef
Add relative intervals
romseygeek Jul 26, 2018
1197fdf
Merge branch 'master' into interval-query
romseygeek Aug 30, 2018
b0439c3
feedback
romseygeek Aug 31, 2018
6cb7fe8
YAML test - broekn
romseygeek Sep 5, 2018
df4d329
Merge remote-tracking branch 'origin/master' into interval-query
romseygeek Sep 30, 2018
b0d28aa
yaml test; begin to add block source
romseygeek Oct 1, 2018
a8806e2
Add block; make disjunction its own source
romseygeek Oct 2, 2018
a4cecc9
Merge remote-tracking branch 'origin/master' into interval-query
romseygeek Nov 7, 2018
8489e86
WIP
romseygeek Nov 12, 2018
c8212f1
Merge remote-tracking branch 'origin/master' into interval-query
romseygeek Nov 30, 2018
2a2244d
Extract IntervalBuilder and add tests for it
romseygeek Dec 1, 2018
6e5339d
Fix eq/hashcode in Disjunction
romseygeek Dec 1, 2018
52bcf1f
New yaml test
romseygeek Dec 1, 2018
872f913
Merge remote-tracking branch 'origin/master' into interval-query
romseygeek Dec 1, 2018
6f2c73c
checkstyle
romseygeek Dec 2, 2018
f044495
license headers
romseygeek Dec 2, 2018
1377bcc
test fix
romseygeek Dec 2, 2018
0368133
YAML format
romseygeek Dec 2, 2018
9c2f035
YAML formatting again
romseygeek Dec 2, 2018
7cde116
yaml tests; javadoc
romseygeek Dec 3, 2018
dabdd77
Add OR test -> requires fix from LUCENE-8586
romseygeek Dec 3, 2018
ba979e5
Merge remote-tracking branch 'origin/master' into interval-query
romseygeek Dec 5, 2018
122f192
Add docs
romseygeek Dec 5, 2018
22f99b4
Re-do API
romseygeek Dec 11, 2018
6de587d
Merge remote-tracking branch 'origin/master' into interval-query
romseygeek Dec 11, 2018
3146c47
Clint's API
romseygeek Dec 11, 2018
3bf1b0d
Delete bash script
romseygeek Dec 11, 2018
2d2df63
doc fixes
romseygeek Dec 11, 2018
67bc11a
imports
romseygeek Dec 11, 2018
abf75bd
docs
romseygeek Dec 11, 2018
0b14af3
test fix
romseygeek Dec 12, 2018
45bf499
feedback
romseygeek Dec 13, 2018
6780a57
Merge remote-tracking branch 'origin/master' into interval-query
romseygeek Dec 13, 2018
a33d816
comma
romseygeek Dec 13, 2018
9834a06
docs fixes
romseygeek Dec 13, 2018
a754165
Tidy up doc references to old rule
romseygeek Dec 14, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions docs/reference/query-dsl/full-text-queries.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,11 @@ The queries in this group are:
A simpler, more robust version of the `query_string` syntax suitable
for exposing directly to users.

<<query-dsl-intervals-query,`intervals` query>>::

A full text query that allows fine-grained control of the ordering and
proximity of matching terms

include::match-query.asciidoc[]

include::match-phrase-query.asciidoc[]
Expand All @@ -53,3 +58,5 @@ include::common-terms-query.asciidoc[]
include::query-string-query.asciidoc[]

include::simple-query-string-query.asciidoc[]

include::intervals-query.asciidoc[]
260 changes: 260 additions & 0 deletions docs/reference/query-dsl/intervals-query.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,260 @@
[[query-dsl-intervals-query]]
=== Intervals query

An `intervals` query allows fine-grained control over the order and proximity of
matching terms. Matching rules are constructed from a small set of definitions,
and the rules are then applied to terms from a particular `field`.

The definitions produce sequences of minimal intervals that span terms in a
body of text. These intervals can be further combined and filtered by
parent sources.

The example below will search for the phrase `my favourite food` appearing
before the terms `hot` and `water` or `cold` and `porridge` in any order, in
the field `my_text`

[source,js]
--------------------------------------------------
POST _search
{
"query": {
"intervals" : {
"my_text" : {
"all_of" : {
"ordered" : true,
"intervals" : [
{
"match" : {
"query" : "my favourite food",
"max_gaps" : 0,
"ordered" : true
}
},
{
"any_of" : {
"intervals" : [
{ "match" : { "query" : "hot water" } },
{ "match" : { "query" : "cold porridge" } }
]
}
}
]
},
"boost" : 2.0,
"_name" : "favourite_food"
}
}
}
}
--------------------------------------------------
// CONSOLE

In the above example, the text `my favourite food is cold porridge` would
match because the two intervals matching `my favourite food` and `cold
porridge` appear in the correct order, but the text `when it's cold my
favourite food is porridge` would not match, because the interval matching
`cold porridge` starts before the interval matching `my favourite food`.

[[intervals-match]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge +1 to exposing a match and no term

==== `match`

The `match` rule matches analyzed text, and takes the following parameters:

[horizontal]
`query`::
The text to match.
`max_gaps`::
Specify a maximum number of gaps between the terms in the text. Terms that
appear further apart than this will not match. If unspecified, or set to -1,
then there is no width restriction on the match. If set to 0 then the terms
must appear next to each other.
`ordered`::
Whether or not the terms must appear in their specified order. Defaults to
`false`
`analyzer`::
Which analyzer should be used to analyze terms in the `query`. By
default, the search analyzer of the top-level field will be used.
`filter`::
An optional <<interval_filter,interval filter>>

[[intervals-all_of]]
==== `all_of`

`all_of` returns returns matches that span a combination of other rules.

[horizontal]
`intervals`::
An array of rules to combine. All rules must produce a match in a
document for the overall source to match.
`max_gaps`::
Specify a maximum number of gaps between the rules. Combinations that match
across a distance greater than this will not match. If set to -1 or
unspecified, there is no restriction on this distance. If set to 0, then the
matches produced by the rules must all appear immediately next to each other.
`ordered`::
Whether the intervals produced by the rules should appear in the order in
which they are specified. Defaults to `false`
`filter`::
An optional <<interval_filter,interval filter>>

[[intervals-any_of]]
==== `any_of`

The `any_of` rule emits intervals produced by any of its sub-rules.

[horizontal]
`intervals`::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to add documentation here about the fact that this only returns minimal intervals and consequences?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a section on minimization at the end of the doc, with some examples of queries that can produce surprising results, and how to deal with that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome

An array of rules to match
`filter`::
An optional <<interval_filter,interval filter>>

[[interval_filter]]
==== filters

You can filter intervals produced by any rules by their relation to the
intervals produced by another rule. The following example will return
documents that have the words `hot` and `porridge` within 10 positions
of each other, without the word `salty` in between:

[source,js]
--------------------------------------------------
POST _search
{
"query": {
"intervals" : {
"my_text" : {
"match" : {
"query" : "hot porridge",
"max_gaps" : 10,
"filter" : {
"not_containing" : {
"match" : {
"query" : "salty"
}
}
}
}
}
}
}
}
--------------------------------------------------
// CONSOLE

The following filters are available:
[horizontal]
`containing`::
Produces intervals that contain an interval from the filter rule
`contained_by`::
Produces intervals that are contained by an interval from the filter rule
`not_containing`::
Produces intervals that do not contain an interval from the filter rule
`not_contained_by`::
Produces intervals that are not contained by an interval from the filter rule
`not_overlapping`::
Produces intervals that do not overlap with an interval from the filter rule

[[interval-minimization]]
==== Minimization

The intervals query always minimizes intervals, to ensure that queries can
run in linear time. This can sometimes cause surprising results, particularly
when using `max_gaps` restrictions or filters. For example, take the
following query, searching for `salty` contained within the phrase `hot
porridge`:

[source,js]
--------------------------------------------------
POST _search
{
"query": {
"intervals" : {
"my_text" : {
"match" : {
"query" : "salty",
"filter" : {
"contained_by" : {
"match" : {
"query" : "hot porridge"
}
}
}
}
}
}
}
}
--------------------------------------------------
// CONSOLE

This query will *not* match a document containing the phrase `hot porridge is
salty porridge`, because the intervals returned by the match query for `hot
porridge` only cover the initial two terms in this document, and these do not
overlap the intervals covering `salty`.

Another restriction to be aware of is the case of `any_of` rules that contain
sub-rules which overlap. In particular, if one of the rules is a strict
prefix of the other, then the longer rule will never be matched, which can
cause surprises when used in combination with `max_gaps`. Consider the
following query, searching for `the` immediately followed by `big` or `big bad`,
immediately followed by `wolf`:

[source,js]
--------------------------------------------------
POST _search
{
"query": {
"intervals" : {
"my_text" : {
"all_of" : {
"intervals" : [
{ "match" : { "query" : "the" } },
{ "any_of" : {
"intervals" : [
{ "match" : { "query" : "big" } },
{ "match" : { "query" : "big bad" } }
] } },
{ "match" : { "query" : "wolf" } }
],
"max_gaps" : 0,
"ordered" : true
}
}
}
}
}
--------------------------------------------------
// CONSOLE

Counter-intuitively, this query *will not* match the document `the big bad
wolf`, because the `any_of` rule in the middle will only produce intervals
for `big` - intervals for `big bad` being longer than those for `big`, while
starting at the same position, and so being minimized away. In these cases,
it's better to rewrite the query so that all of the options are explicitly
laid out at the top level:

[source,js]
--------------------------------------------------
POST _search
{
"query": {
"intervals" : {
"my_text" : {
"any_of" : {
"intervals" : [
{ "match" : {
"query" : "the big bad wolf",
"ordered" : true,
"max_gaps" : 0 } },
{ "match" : {
"query" : "the big wolf",
"ordered" : true,
"max_gaps" : 0 } }
]
}
}
}
}
}
--------------------------------------------------
// CONSOLE
Loading