-
Notifications
You must be signed in to change notification settings - Fork 24.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add intervals query #36135
Add intervals query #36135
Changes from all commits
6b7d175
7d9b9ef
1197fdf
b0439c3
6cb7fe8
df4d329
b0d28aa
a8806e2
a4cecc9
8489e86
c8212f1
2a2244d
6e5339d
52bcf1f
872f913
6f2c73c
f044495
1377bcc
0368133
9c2f035
7cde116
dabdd77
ba979e5
122f192
22f99b4
6de587d
3146c47
3bf1b0d
2d2df63
67bc11a
abf75bd
0b14af3
45bf499
6780a57
a33d816
9834a06
a754165
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,260 @@ | ||
[[query-dsl-intervals-query]] | ||
=== Intervals query | ||
|
||
An `intervals` query allows fine-grained control over the order and proximity of | ||
matching terms. Matching rules are constructed from a small set of definitions, | ||
and the rules are then applied to terms from a particular `field`. | ||
|
||
The definitions produce sequences of minimal intervals that span terms in a | ||
body of text. These intervals can be further combined and filtered by | ||
parent sources. | ||
|
||
The example below will search for the phrase `my favourite food` appearing | ||
before the terms `hot` and `water` or `cold` and `porridge` in any order, in | ||
the field `my_text` | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
POST _search | ||
{ | ||
"query": { | ||
"intervals" : { | ||
"my_text" : { | ||
"all_of" : { | ||
"ordered" : true, | ||
"intervals" : [ | ||
{ | ||
"match" : { | ||
"query" : "my favourite food", | ||
"max_gaps" : 0, | ||
"ordered" : true | ||
} | ||
}, | ||
{ | ||
"any_of" : { | ||
"intervals" : [ | ||
{ "match" : { "query" : "hot water" } }, | ||
{ "match" : { "query" : "cold porridge" } } | ||
] | ||
} | ||
} | ||
] | ||
}, | ||
"boost" : 2.0, | ||
"_name" : "favourite_food" | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// CONSOLE | ||
|
||
In the above example, the text `my favourite food is cold porridge` would | ||
match because the two intervals matching `my favourite food` and `cold | ||
porridge` appear in the correct order, but the text `when it's cold my | ||
favourite food is porridge` would not match, because the interval matching | ||
`cold porridge` starts before the interval matching `my favourite food`. | ||
|
||
[[intervals-match]] | ||
==== `match` | ||
|
||
The `match` rule matches analyzed text, and takes the following parameters: | ||
|
||
[horizontal] | ||
`query`:: | ||
The text to match. | ||
`max_gaps`:: | ||
Specify a maximum number of gaps between the terms in the text. Terms that | ||
appear further apart than this will not match. If unspecified, or set to -1, | ||
then there is no width restriction on the match. If set to 0 then the terms | ||
must appear next to each other. | ||
`ordered`:: | ||
Whether or not the terms must appear in their specified order. Defaults to | ||
`false` | ||
`analyzer`:: | ||
Which analyzer should be used to analyze terms in the `query`. By | ||
default, the search analyzer of the top-level field will be used. | ||
`filter`:: | ||
An optional <<interval_filter,interval filter>> | ||
|
||
[[intervals-all_of]] | ||
==== `all_of` | ||
|
||
`all_of` returns returns matches that span a combination of other rules. | ||
|
||
[horizontal] | ||
`intervals`:: | ||
An array of rules to combine. All rules must produce a match in a | ||
document for the overall source to match. | ||
`max_gaps`:: | ||
Specify a maximum number of gaps between the rules. Combinations that match | ||
across a distance greater than this will not match. If set to -1 or | ||
unspecified, there is no restriction on this distance. If set to 0, then the | ||
matches produced by the rules must all appear immediately next to each other. | ||
`ordered`:: | ||
Whether the intervals produced by the rules should appear in the order in | ||
which they are specified. Defaults to `false` | ||
`filter`:: | ||
An optional <<interval_filter,interval filter>> | ||
|
||
[[intervals-any_of]] | ||
==== `any_of` | ||
|
||
The `any_of` rule emits intervals produced by any of its sub-rules. | ||
|
||
[horizontal] | ||
`intervals`:: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we need to add documentation here about the fact that this only returns minimal intervals and consequences? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've added a section on minimization at the end of the doc, with some examples of queries that can produce surprising results, and how to deal with that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. awesome |
||
An array of rules to match | ||
`filter`:: | ||
An optional <<interval_filter,interval filter>> | ||
|
||
[[interval_filter]] | ||
==== filters | ||
|
||
You can filter intervals produced by any rules by their relation to the | ||
intervals produced by another rule. The following example will return | ||
documents that have the words `hot` and `porridge` within 10 positions | ||
of each other, without the word `salty` in between: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
POST _search | ||
{ | ||
"query": { | ||
"intervals" : { | ||
"my_text" : { | ||
"match" : { | ||
"query" : "hot porridge", | ||
"max_gaps" : 10, | ||
"filter" : { | ||
"not_containing" : { | ||
"match" : { | ||
"query" : "salty" | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// CONSOLE | ||
|
||
The following filters are available: | ||
[horizontal] | ||
`containing`:: | ||
Produces intervals that contain an interval from the filter rule | ||
`contained_by`:: | ||
Produces intervals that are contained by an interval from the filter rule | ||
`not_containing`:: | ||
Produces intervals that do not contain an interval from the filter rule | ||
`not_contained_by`:: | ||
Produces intervals that are not contained by an interval from the filter rule | ||
`not_overlapping`:: | ||
Produces intervals that do not overlap with an interval from the filter rule | ||
|
||
[[interval-minimization]] | ||
==== Minimization | ||
|
||
The intervals query always minimizes intervals, to ensure that queries can | ||
run in linear time. This can sometimes cause surprising results, particularly | ||
when using `max_gaps` restrictions or filters. For example, take the | ||
following query, searching for `salty` contained within the phrase `hot | ||
porridge`: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
POST _search | ||
{ | ||
"query": { | ||
"intervals" : { | ||
"my_text" : { | ||
"match" : { | ||
"query" : "salty", | ||
"filter" : { | ||
"contained_by" : { | ||
"match" : { | ||
"query" : "hot porridge" | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// CONSOLE | ||
|
||
This query will *not* match a document containing the phrase `hot porridge is | ||
salty porridge`, because the intervals returned by the match query for `hot | ||
porridge` only cover the initial two terms in this document, and these do not | ||
overlap the intervals covering `salty`. | ||
|
||
Another restriction to be aware of is the case of `any_of` rules that contain | ||
sub-rules which overlap. In particular, if one of the rules is a strict | ||
prefix of the other, then the longer rule will never be matched, which can | ||
cause surprises when used in combination with `max_gaps`. Consider the | ||
following query, searching for `the` immediately followed by `big` or `big bad`, | ||
immediately followed by `wolf`: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
POST _search | ||
{ | ||
"query": { | ||
"intervals" : { | ||
"my_text" : { | ||
"all_of" : { | ||
"intervals" : [ | ||
{ "match" : { "query" : "the" } }, | ||
{ "any_of" : { | ||
"intervals" : [ | ||
{ "match" : { "query" : "big" } }, | ||
{ "match" : { "query" : "big bad" } } | ||
] } }, | ||
{ "match" : { "query" : "wolf" } } | ||
], | ||
"max_gaps" : 0, | ||
"ordered" : true | ||
} | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// CONSOLE | ||
|
||
Counter-intuitively, this query *will not* match the document `the big bad | ||
wolf`, because the `any_of` rule in the middle will only produce intervals | ||
for `big` - intervals for `big bad` being longer than those for `big`, while | ||
starting at the same position, and so being minimized away. In these cases, | ||
it's better to rewrite the query so that all of the options are explicitly | ||
laid out at the top level: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
POST _search | ||
{ | ||
"query": { | ||
"intervals" : { | ||
"my_text" : { | ||
"any_of" : { | ||
"intervals" : [ | ||
{ "match" : { | ||
"query" : "the big bad wolf", | ||
"ordered" : true, | ||
"max_gaps" : 0 } }, | ||
{ "match" : { | ||
"query" : "the big wolf", | ||
"ordered" : true, | ||
"max_gaps" : 0 } } | ||
] | ||
} | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// CONSOLE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huge +1 to exposing a match and no term