Percolating not as part of a cluster for high-rate operations #3606

synhershko · 2013-09-03T11:56:06Z

We run a large data shop, with ElasticSearch 0.90 serving as our main search engine (+ many plugins). We also use the percolator functionality at a very high rate (tens of thousands per minute, against many queries).

Previously we were using the old percolator implementation, and after several tries we saw separating the percolation operations from the actual search cluster makes much more sense to us. This way we can have servers dedicated to percolation and make sure we handle all we need in a timely fashion; it also means we can just drop more servers during peaks and take them down later without making sure data isn't replicated to them.

First we tried doing percolation on a separate cluster, but since each percolating node is practically autonomous - it just gets a document, percolates and registers the result with our backend - we ended up having a self-contained jar that does all that. We also had issues with discovery over S3, which are no longer an issue when we run a 1-node cluster, as stupid as it may sound.

We are also using a custom PercolatorExecutor implementation - it implements Highlighting using the ParsedDocument as the data source and also has some smarts with regards to query filtering. In order to do this, I had to copy-paste code and create a new Percolator Moudle and using some hack to catch the singleton PE implementation. It is a very ugly hack and quite hard to keep up to date with updates.

In case you'd ask, the reason why we still use ES's percolation and not a custom built MemoryIndex implementation is the compatibility we need with the ES index structure we have, and analyzer and QueryParsing conventions.

TL;DR

If there could be a way to run a stand-alone, cluster-less PercolatorExecutor instance, enjoying all the benefits of the Mappers and QueryParsers but without the intense memory requirements and startup times, that would be really great.

Once this PE instance is initialized, it would be great if there was a way to actually get it using some Java API call - even raw access to the IoC container could help. Alternatively, access to the MapperService, QP instances etc.

Additionally, going forward it would really help having the PercolatorExecutor itself extensible - in particular the ability to amend the queries feed using the Java API (for example for registering, updating and removing of queries using a mechanism other than the ES API), and to have selectors on those queries. I'm aware the percolator query can accept a query with the doc to percolate, but I actually couldn't get it to work for some reason (and got no answer on the mailing list)

Will be happy to discuss the small details as well

Cheers

kimchy · 2013-09-03T12:25:48Z

Regarding being able to run standalone percoaltor service, you can start a local node in your Java program, and just use that for embedded percolations. I would though stress that by going embedded you won't be able to use the built in distributed aspects of percolation that are coming in 1.0.

Regarding being able to extend the percolation, we have made some progress there in 1.0, as part of the work of distributed percolation (that now also supports highlighting). I will say though that pluggability of percolator is less of a priority for us, as we typically prefer to add features that can then be used for all the userbase of ES without the need for extensions (like highlighting, which we added in 1.0).

The selectors part work. We won't add pluggability to things that are features in ES.

Btw, if your needs end up being really custom, you can always implement your own "percolator service", ES is open for you to do so.

martijnvg · 2013-09-03T12:32:09Z

@synhershko Have you seen the recent changes that have been done in master regarding the percolator (#3173, #3574, #3506, #3488)? Like @kimchy said making a standalone percolator doesn't make much sense now since the percolator is now distributed.

synhershko · 2013-09-03T14:47:46Z

Re standalone - this is exactly what we did (went embedded and implemented our own service). My problem is with the amount of init work and memory this still requires, network and FS operations we don't really need etc. I haven't previously set local to true - this could help, but still the only thing we really need is the things required for the PercolatorExecutor instance to work correctly.

The concept of percolation within a cluster obviously isn't targeted for high rate of percolation operations - otherwise I can't really understand how it is at all useful, even when it is done distributed. In scale, just the serialization and network traffic involved can be considered a waste, and you'd be better dedicating nodes to do percolation alone.

Is there a way to selectively init stuff? say, if I only wanted the MapperService and QueryParserService and not the whole package?

Or maybe a way to get to the IoC container and ask it for an instance? couldn't find a way to do that

We don't mind about losing distributed features - as I said each node is autonomous and I would rather have it as light and concentrated at percolation as possible.

@martijnvg I have seen all of those - like I said this issue is about asking for a way to work with an instance of PercolatorExecutor, or to easily initialize and use a custom one

kimchy · 2013-09-03T14:53:14Z

setting local to true will help remove any networking (you might need to also disable http). I want proof that you really see memory overhead then, I don't buy it.

In any case, my view is that focusing on allowing to run those services in their own is not within the scope of elasticsearch. Too esoteric. It can be done though, though you will need to play with it, and I really think running an embedded node is more than enough.

synhershko · 2013-09-03T15:02:05Z

I hear you. Let me see how big it gets with the lowest configs.

kimchy · 2013-09-03T15:03:16Z

also, I completely disagree regarding your note about high rate percolation. Your embedded case will only work till a given number of percolation queries limit, in which case you will need to partition, and you will end up implementing yourself what we do in distributed percolation.

also, you will end up building that embedded node as a remote service, and you will need anyhow to serialize the response from it. And what happens with HA, .... I don't really need answers for those questions, you can go ahead and have fun with an embedded node and do local percolations (in 0.90 or master), I do think that you might find yourself needing all the features ES gives you.

synhershko · 2013-09-03T15:14:16Z

Hey man, no offense intended. I really think the percolation concept is great and beautifully implemented, but what we saw when we tried running it on our search cluster is a huge bottleneck - with high rate percolation, high rate indexing and heavy searches all combined on the same cluster. This is why we moved to dedicated, individual, percolator nodes, and are now using some sort of profiling on queries to eliminate irrelevant ones in advance as an optimization.

kimchy · 2013-09-03T15:31:02Z

understood, I think the new distributed percolation will help a lot, and I agree that sometimes it makes sense to separate to 2 clusters, or use dedicated nodes for percolation, and dedicated ones for search in the same cluster.

I am simply sharing my concern that I think you will end up needing to implement a few things we have in ES. God knows we did :), as evident with the rewrite to properly have distributed percolation in master...

synhershko · 2013-09-25T10:18:07Z

The resources used when moving to local JVM and disabling HTTP seem moderate. I'll leave the option for going back for sharded percolation operations as you suggested. Closing this.

synhershko closed this as completed Sep 25, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Percolating not as part of a cluster for high-rate operations #3606

Percolating not as part of a cluster for high-rate operations #3606

synhershko commented Sep 3, 2013

kimchy commented Sep 3, 2013

martijnvg commented Sep 3, 2013

synhershko commented Sep 3, 2013

kimchy commented Sep 3, 2013

synhershko commented Sep 3, 2013

kimchy commented Sep 3, 2013

synhershko commented Sep 3, 2013

kimchy commented Sep 3, 2013

synhershko commented Sep 25, 2013

Percolating not as part of a cluster for high-rate operations #3606

Percolating not as part of a cluster for high-rate operations #3606

Comments

synhershko commented Sep 3, 2013

TL;DR

kimchy commented Sep 3, 2013

martijnvg commented Sep 3, 2013

synhershko commented Sep 3, 2013

kimchy commented Sep 3, 2013

synhershko commented Sep 3, 2013

kimchy commented Sep 3, 2013

synhershko commented Sep 3, 2013

kimchy commented Sep 3, 2013

synhershko commented Sep 25, 2013