Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terms Lookup by Query/Filter (aka. Join Filter) #3278

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
227 changes: 213 additions & 14 deletions docs/reference/query-dsl/filters/terms-filter.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -30,30 +30,30 @@ them in the more compact model that terms filter provides.
The `execution` option now has the following options :

[horizontal]
`plain`::
`plain`::
The default. Works as today. Iterates over all the terms,
building a bit set matching it, and filtering. The total filter is
cached.

`fielddata`::
Generates a terms filters that uses the fielddata cache to
compare terms. This execution mode is great to use when filtering
on a field that is already loaded into the fielddata cache from
on a field that is already loaded into the fielddata cache from
faceting, sorting, or index warmers. When filtering on
a large number of terms, this execution can be considerably faster
than the other modes. The total filter is not cached unless
explicitly configured to do so.

`bool`::
`bool`::
Generates a term filter (which is cached) for each term, and
wraps those in a bool filter. The bool filter itself is not cached as it
can operate very quickly on the cached term filters.

`and`::
`and`::
Generates a term filter (which is cached) for each term, and
wraps those in an and filter. The and filter itself is not cached.

`or`::
`or`::
Generates a term filter (which is cached) for each term, and
wraps those in an or filter. The or filter itself is not cached.
Generally, the `bool` execution mode should be preferred.
Expand Down Expand Up @@ -102,25 +102,25 @@ lookup mechanism.
The terms lookup mechanism supports the following options:

[horizontal]
`index`::
`index`::
The index to fetch the term values from. Defaults to the
current index.

`type`::
`type`::
The type to fetch the term values from.

`id`::
`id`::
The id of the document to fetch the term values from.

`path`::
`path`::
The field specified as path to fetch the actual values for the
`terms` filter.

`routing`::
`routing`::
A custom routing value to be used when retrieving the
external terms doc.

`cache`::
`cache`::
Whether to cache the filter built from the retrieved document
(`true` - default) or whether to fetch and rebuild the filter on every
request (`false`). See "<<query-dsl-terms-filter-lookup-caching,Terms lookup caching>>" below
Expand All @@ -136,20 +136,126 @@ across all nodes if the "reference" terms data is not large. The lookup
terms filter will prefer to execute the get request on a local node if
possible, reducing the need for networking.

[float]
==== Terms lookup by query mechanism
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mark this as experimental


added[1.2.0]

.Experimental!
[IMPORTANT]
=====
This feature is marked as experimental, and may be subject to change in the
future. If you use this feature, please let us know your experience with it!
=====

The terms filter by query feature allows the lookup of terms from
documents matching a query/filter. This functionality is similar to
the "<<query-dsl-has-child-filter,HasChild>>" and "<<query-dsl-has-parent-filter,HasParent>>"
functionality without the limiations. The lookup query can be executed
over multiple indices, shards, and types.

The terms lookup by query mechanism supports the following options:

[horizontal]
`indices` or `index`::
One or more indices to execute the lookup query against. Default's
to all indices if not specified.

`types` or `type`::
One or more types to execute against. Default's to all types of
the configured indices.

`path`::
The field to fetch the actual values from the documents matching
the lookup query.

`filter`::
The query filter documents must match for their terms to be collected
as part of the query lookup.

`max_terms_per_shard`::
The maximum number of terms to collect from each shard. Default's to
all terms.

`routing`::
A custom routing value to be used when executing the lookup query.

`cache`::
Whether to cache the filter built from the retrieved terms
(`true` - default) or whether to fetch and rebuild the filter on every
request (`false`). See "<<query-dsl-terms-filter-lookup-bloom,Terms lookup caching>>"
below.

`bloom_filter`::
Configures the query lookup to gather terms within a bloom filter for
more compact term representation at the cost of lookup precision. See
the "<<query-dsl-terms-filter-lookup-caching,Bloom Filter>>" section below.

The values for the `terms` filter will be fetched from the `path` field of
documents matching the lookup filter. If `max_terms_per_shard` is set, then
the number of terms gathered will be at most `max_terms_per_shard` * NUM_SHARDS.
If the source field is a text based field, the cost of gathering a large number
of terms can be quite expensive due to network latency. In this situation it is
best to use numeric fields or configure the lookup to use a
"<<query-dsl-terms-filter-lookup-bloom,Bloom Filter>>".

The terms lookup currently only uses the `fielddata` exection mode so proper "<<indices-warmers,Warming>>" is
recommended. Since the filter uses the fielddata cache by default, the resulting filter is not cached unless
configured to do so.

["float",id="query-dsl-terms-filter-lookup-bloom"]
==== Bloom filter support

When performing a lookup query against a set of text based terms with a high
cardinality the cost of transfering these terms over the network can be
quite expensive resulting in slow response times. If precision is not
critical, performance can be considerably better by using a
http://en.wikipedia.org/wiki/Bloom_filter[Bloom Filter]
for the lookup terms. To enable the use of the bloom filter, set any of
the following configuration options on the `bloom_filter` option:

[horizontal]
`expected_insertions`::
The expected number of terms to be inserted into the bloom filter. This
value must be greater than 0 and is REQUIRED.

`fpp`::
The false positive probability. This is the acceptable percentage of
terms that can potentially be considered a valid lookup term even though
it was not found in any documents matching the lookup query. This value
must be between 0 and 1 and defaults to 0.03 (3%).

`hash_functions`::
The number of times a value should be hashed before being inserted into the
bloom filter. This value must be between 1 and 255 and by default has an
optimal value calculated based on the `expected_insertions` and `fpp`.

The optimal bloom filter configuration is very dependent on the number of terms
gathered from matching documents and the number of terms the filter will actually
compare against the bloom filter. For a higher precision (less false positives)
you can increase the number of `expected_insertions`, lower the `fpp`, or increase
the number of `hash_functions`. As you get a higher precision your response times
will get slower due to resulting bloom filter getting larger and/or using more CPU
to calculate the hashes. Increasing the `fpp` value is typically the only thing
required to get faster response times.

The bloom filter support is an advanced feature and will require some trial and error
to find optimal values.

["float",id="query-dsl-terms-filter-lookup-caching"]
==== Terms lookup caching

There is an additional cache involved, which caches the lookup of the
lookup document to the actual terms. This lookup cache is a LRU cache.
This cache has the following options:

`indices.cache.filter.terms.size`::
`indices.cache.filter.terms.size`::
The size of the lookup cache. The default is `10mb`.

`indices.cache.filter.terms.expire_after_access`::
`indices.cache.filter.terms.expire_after_access`::
The time after the last read an entry should expire. Disabled by default.

`indices.cache.filter.terms.expire_after_write`::
`indices.cache.filter.terms.expire_after_write`::
The time after the last write an entry should expire. Disabled by default.

All options for the lookup of the documents cache can only be configured
Expand Down Expand Up @@ -225,3 +331,96 @@ curl -XPUT localhost:9200/users/user/2 -d '{
--------------------------------------------------

In which case, the lookup path will be `followers.id`.

[float]
==== Terms lookup by query example

In the following example we are replicating the
"<<query-dsl-has-child-filter,HasChild Filter>>" by looking up the
"pid" values from children documents with the tag "something" and
then filtering only parent documents that have an "id" matching one
of the children's "pid" values.

In this example, parents and children are stored in their own indices.

[source,js]
--------------------------------------------------
curl -XPOST 'http://localhost:9200/parentIndex/_search' -d '{
"query": {
"constant_score": {
"filter": {
"terms": {
"id": {
"index": "childIndex",
"type": "childType",
"path": "pid",
"filter": {
"term": {
"tag": "something"
}
}
}
}
}
}
}
}'
--------------------------------------------------

Using the "<<query-dsl-terms-filter-lookup-bloom,Bloom Filter>>" support:

[source,js]
--------------------------------------------------
curl -XPOST 'http://localhost:9200/parentIndex/_search' -d '{
"query": {
"constant_score": {
"filter": {
"terms": {
"id": {
"index": "childIndex",
"type": "childType",
"path": "pid",
"filter": {
"term": {
"tag": "something"
}
},
"bloom_filter": {
"expected_insertions": 10000
}
}
}
}
}
}
}'
--------------------------------------------------

Here is another example where we are searching for products or services
mentioning "elasticsearch". Products, Services, and Companies are all stored
in their own index and contain a numeric "company_id" field. Both products
and services have a "description" field.

[source,js]
--------------------------------------------------
curl -XPOST 'http://localhost:9200/companies/_search' -d '{
"query": {
"constant_score": {
"filter": {
"terms": {
"company_id": {
"indices": ["products", "services"],
"path": "company_id",
"filter": {
"term": {
"description": "elasticsearch"
}
}
}
}
}
}
}
}'
--------------------------------------------------

4 changes: 4 additions & 0 deletions src/main/java/org/elasticsearch/action/ActionModule.java
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@

package org.elasticsearch.action;

import org.elasticsearch.action.terms.TermsByQueryAction;
import org.elasticsearch.action.terms.TransportTermsByQueryAction;

import com.google.common.collect.Maps;
import org.elasticsearch.action.admin.cluster.health.ClusterHealthAction;
import org.elasticsearch.action.admin.cluster.health.TransportClusterHealthAction;
Expand Down Expand Up @@ -258,6 +261,7 @@ protected void configure() {
registerAction(DeleteAction.INSTANCE, TransportDeleteAction.class,
TransportIndexDeleteAction.class, TransportShardDeleteAction.class);
registerAction(CountAction.INSTANCE, TransportCountAction.class);
registerAction(TermsByQueryAction.INSTANCE, TransportTermsByQueryAction.class);
registerAction(SuggestAction.INSTANCE, TransportSuggestAction.class);
registerAction(UpdateAction.INSTANCE, TransportUpdateAction.class);
registerAction(MultiGetAction.INSTANCE, TransportMultiGetAction.class,
Expand Down
Loading