Solr redundant-write solution #184

thiagocamposviana · 2014-11-27T18:34:58Z

This patch allows ezfind to update a list of solr instances at once, so this is basically a distributed-write-to-solr modification.

This is a key part of a solution that removes Solr as a single point of failure. Specifically, this allows eZ Publish to write to a each of a pool of instances, and read from them via a VIP or load balancer. This avoids the issues the Solr has with peer to peer replication, but more importantly provides support for enterprise clients that are using older versions of Solr.

ernestob · 2014-11-27T19:27:53Z

👍

dougplant · 2014-11-27T19:56:49Z

Thanks @ernestob for the vote

bdunogier · 2014-11-27T20:44:45Z

Nice feature. Gotta say I'm not on the other hand completely sold on the
error handling. Do you guys think we need something here ? Gotta say I do,
but I'm not familiar with this setup and its constraints.

Furthermore, we have quite often received bug reports about out-of-sync
indexes, and this particular feature sounds a bit dangerous in that aspect.

paulborgermans · 2014-11-27T22:07:27Z

-2

Sorry, I dont see the point in doing this at all. You refer to older (pre Solr 4.x) installations, why not branch out a fork for those older eZ Find versions if you really think this is better than what Solr has built in? For the current eZ Find master and Solr versions incorporated, SolrCloud is orders of magnitude robuster.

And even for older versions, master and hybrid master + multiple master/slave works very well and reliable, provided they are installed properly (ram, process limits, jvm parameters...).

"This avoids the issues the Solr has with peer to peer replication" Did you actually explore this?

Risks are higher than any benefits in this (too simple) approach which even lacks an important usage pattern ... but leaving it to you to discover what ...

Sorry again if this sounds rude, but this should not be done or supported in any official eZ Find distribution

thiagocamposviana · 2014-11-28T00:16:35Z

The problem we were trying to solve with this approach is that we noticed that using the master/slave approach after the failover the slave indexes some content, but because it is still replicating from master, it notices that it is out of sync with the master and it performs the replication, effectively un-indexing the content that it has just indexed.

So the idea here was instead to use multiple master instances, and this change would mitigate the previous problem, at least for older versions.

shalinmangar · 2014-11-28T01:18:58Z

I'm one of the committers on the Lucene/Solr project.

If, after failover, the slave is still replicating from master then there is a bug in the failover strategy which should be identified and fixed. If you switch to indexing directly against multiple solr instances, you will run into multiple inconsistency issues such as:

indexing fails on some of the instances because they were down
indexing fails on some of the instances because of temporary network partitions
batch indexing fails at different points on different boxes for whatever reasons
as different replicas become inconsistent, people see different results while paging or even when reloading the same page

There are more but I think you get the idea. This is just a can of worms. I urge you to either stick with the master-slave replication and fix the failover scripts or upgrade to SolrCloud where we have solved and continue to solve hundreds of such problems instead of hand-rolling a distributed indexing solution yourself.

I'd be happy to answer any questions and/or help with anything on the Solr side which makes things easier for ezfind.

thiagocamposviana · 2014-11-28T01:36:50Z

Thanks for the feedback shalinmangar, I will further investigate this tomorrow.

thiagocamposviana · 2014-11-28T21:07:33Z

I decided to just close the pull request, as pointed out, seems that this approach is not recommended.

gggeek · 2014-11-30T23:36:23Z

What I would like is a good sample failover script for Solr (3), to make it easier to answer customers requests :-)

In my own limited experience, implementing some "make sure master is really dead when supposed so, and prevent automatic failback" logic in hand-developed scripts is not too hard - at least if you have ever heard before about STONITH before when looking at cluster implementations.

The main problems I faced were:

since failback is a manual event (while failover is automatic), how to minimize the time where the system is exposed by not having the 2 servers available
how to make sure that no indexation events can get lost during the failover period, or that at least they get traced in a reliable way for replay, as starting a full reindexation just to make sure that no events are lost can be quite expensive. I have had customers introducing proxies, log parsers and lots of hackish code to achieve this...

PS: Sorry to post this here but my account on share.ez.no is still not back from the dead

andrerom · 2014-12-02T18:01:30Z

@thiagocamposviana Did you manage to solve the underlying issue using newer version of solr and ezfind? Which version was this intended for?

For some background: Basically as much as possible we try to take advantage of the more advance sharding/replication/.. kind of features of the underlying technology. We will most likely do the same for databases, potentially supporting the different cluster solutions available as opposed to having master/slave logic in PHP for instance.

Anything we should provide out of the box for this, better out of the box config for this?

dougplant · 2014-12-02T19:23:15Z

Thanks Andre!

We have not solved this problem.

A client has requested a solution to provide for eliminating Solr as a single point of failure. Unfortunately, they are staying at eZP version 4.6. At this version, there does not appear to be any supported solution to provide this functionality.

The basic requirement is that the site should not suck if one of the Solr nodes goes down. This indicates a few things:

has to be automatic failover for read operations
the Solr index is not critical (eg, not at all like a MySQL master-slave: a few missed updates to the Solr index is a much smaller problem)
the manual recovery might take a couple days, so the failed-over solution has to be stable, but getting the system off the 'backup' node is not urgent. (If all the Solr nodes are/go down, then there is much larger problem and the current scope is not the place to address it.)
scale is not an issue

The missing piece of the solution is that the read requests all go through a load balancer/virtual IP. Arguably, there are missing improvements, but we thought we would start with the basics so as to keep the idea clear.

I won't try to pitch the idea aside from:

the solution accomplishes the requirements
the requirements are generally useful -- for older versions
the patch is a pure augmentation

andrerom · 2014-12-02T19:24:59Z

@paulborgermans Any suggestions for 4.6 install? (besides upgrading of course)

gggeek · 2014-12-02T20:58:59Z

@dougplant I'd say that the "standard" Solr3 master-slave solution with a proxy in front which does load balancing and failover would meet all of the above requirements.
Did you check out http://share.ez.no/blogs/gaetano-giunta/load-balancing-ez-find-for-fun-and-profit and http://share.ez.no/blogs/gaetano-giunta/load-balancing-ez-find-for-fun-and-profit-part-ii ?

dougplant · 2014-12-02T21:35:35Z

@gggeek We did read those, and thanks, by the way. That model is an option, however the need for STONITH, as you say, puts extra complexity and risk into it. Also, a recommended, or even better, a supported solution would have additional merit.

paulborgermans · 2014-12-02T22:22:14Z

As I emailed privately last week to @dougplant, a (relative) simple Apache based proxy together with a master + slave/sleeping master (or multiple of these) + perhaps a few extra slaves for additional robustness do the job as required here for older (pre 4.x) Solr/eZ Find releases. The blog post of @gggeek also goes in this direction (need to add the health checks though). And there is no need for STONITH either.

hth
Paul

thiagocamposviana added 4 commits November 27, 2014 16:32

Solr redundant-write solution

7dba000

Improved identation

6d251d3

Improved identation

635f2dd

Improved identation

a37e834

thiagocamposviana mentioned this pull request Nov 28, 2014

Solr redundant-write solution #183

Closed

thiagocamposviana closed this Nov 28, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solr redundant-write solution #184

Solr redundant-write solution #184

thiagocamposviana commented Nov 27, 2014

ernestob commented Nov 27, 2014

dougplant commented Nov 27, 2014

bdunogier commented Nov 27, 2014

paulborgermans commented Nov 27, 2014

thiagocamposviana commented Nov 28, 2014

shalinmangar commented Nov 28, 2014

thiagocamposviana commented Nov 28, 2014

thiagocamposviana commented Nov 28, 2014

gggeek commented Nov 30, 2014

andrerom commented Dec 2, 2014

dougplant commented Dec 2, 2014

andrerom commented Dec 2, 2014

gggeek commented Dec 2, 2014

dougplant commented Dec 2, 2014

paulborgermans commented Dec 2, 2014

Solr redundant-write solution #184

Solr redundant-write solution #184

Conversation

thiagocamposviana commented Nov 27, 2014

ernestob commented Nov 27, 2014

dougplant commented Nov 27, 2014

bdunogier commented Nov 27, 2014

paulborgermans commented Nov 27, 2014

thiagocamposviana commented Nov 28, 2014

shalinmangar commented Nov 28, 2014

thiagocamposviana commented Nov 28, 2014

thiagocamposviana commented Nov 28, 2014

gggeek commented Nov 30, 2014

andrerom commented Dec 2, 2014

dougplant commented Dec 2, 2014

andrerom commented Dec 2, 2014

gggeek commented Dec 2, 2014

dougplant commented Dec 2, 2014

paulborgermans commented Dec 2, 2014