Distributed percolator engine #3173
Redesigning the percolate engine is targeted for version 1.0. The main reason why the rewrite is necessary is that the current perculate engine doesn't scale. The idea is that perculating a document should be executed in the same manner as a distributed search request.
In the current approach queries are stored in a single primary shard index, that is auto replicated to each data node. This allows the percolation to happen locally. In the case that large amount of queries are index into this
Because of the fact that percolation will be a distributed request, the perculate option in the index api is scheduled to be removed. The main reason behind this is that we can't block and wait in the index api for a distributed percolate request to complete. The perculate request may take longer to complete then the actual index request (we currently perculate during replication) and thus slowing down the actual index request.
To substitute the percolate while indexing option, one just needs to run percolate api directly after the index api returned. The percolate api will remain to be a realtime api.
The percolator index type approach stores the percolate queries in a special
Store a query in the twitter index:
Percolating a document uses the same rest end point:
The response initially doesn't change. The rest endpoint will also support a routing query string parameter, to allow documents to only be percolated on queries in specific shards.
During regular searches, we will automatically filter out documents with the
The plan is not to keep backwards compatibility with the current percolate implementation. Percolate queries indexed via the old infrastructure will need to be migrated into the new planned infrastructure. The 'old'
After the redesign has been implemented adding more features to the percolator is next. One of them is to highlight what parts of the query matched with the document.
The idea is have different response modes. For example:
Here are a few thoughts on post features for percolator:
Following up on:
I really miss the feature to somehow attach the result of percolation into the document itself. This doesn't necessarily be in the document itself, e.g. a child document would be useful as well. I often use the percolator to categorize incoming data. This data is given by external services and is messy, though fixable by simple "search and clean". We use the percolator to register (sometimes user-created) queries that map those entries to our internal values.
Currently, this works like this:
Allowing this within the percolation step itself would vastly reduce our network overhead in this case and (if bulk percolation happens) also allow us to do bulk actions in one step.
So It is like a percolate post write operation? That would update a specific part of the percolated documents based on the percolate matches and then index this updated document.
There're no plans for this kind of feature. You can create an issue for this feature if you want, then this idea doesn't get lost.