Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr redundant-write solution #184

Closed
wants to merge 4 commits into from
Closed

Solr redundant-write solution #184

wants to merge 4 commits into from

Conversation

thiagocamposviana
Copy link
Contributor

This patch allows ezfind to update a list of solr instances at once, so this is basically a distributed-write-to-solr modification.

This is a key part of a solution that removes Solr as a single point of failure. Specifically, this allows eZ Publish to write to a each of a pool of instances, and read from them via a VIP or load balancer. This avoids the issues the Solr has with peer to peer replication, but more importantly provides support for enterprise clients that are using older versions of Solr.

@ernestob
Copy link

👍

@dougplant
Copy link

Thanks @ernestob for the vote

@bdunogier
Copy link
Member

Nice feature. Gotta say I'm not on the other hand completely sold on the
error handling. Do you guys think we need something here ? Gotta say I do,
but I'm not familiar with this setup and its constraints.

Furthermore, we have quite often received bug reports about out-of-sync
indexes, and this particular feature sounds a bit dangerous in that aspect.

@paulborgermans
Copy link
Contributor

-2

Sorry, I dont see the point in doing this at all. You refer to older (pre Solr 4.x) installations, why not branch out a fork for those older eZ Find versions if you really think this is better than what Solr has built in? For the current eZ Find master and Solr versions incorporated, SolrCloud is orders of magnitude robuster.

And even for older versions, master and hybrid master + multiple master/slave works very well and reliable, provided they are installed properly (ram, process limits, jvm parameters...).

"This avoids the issues the Solr has with peer to peer replication" Did you actually explore this?

Risks are higher than any benefits in this (too simple) approach which even lacks an important usage pattern ... but leaving it to you to discover what ...

Sorry again if this sounds rude, but this should not be done or supported in any official eZ Find distribution

@thiagocamposviana
Copy link
Contributor Author

The problem we were trying to solve with this approach is that we noticed that using the master/slave approach after the failover the slave indexes some content, but because it is still replicating from master, it notices that it is out of sync with the master and it performs the replication, effectively un-indexing the content that it has just indexed.

So the idea here was instead to use multiple master instances, and this change would mitigate the previous problem, at least for older versions.

@shalinmangar
Copy link

I'm one of the committers on the Lucene/Solr project.

If, after failover, the slave is still replicating from master then there is a bug in the failover strategy which should be identified and fixed. If you switch to indexing directly against multiple solr instances, you will run into multiple inconsistency issues such as:

  1. indexing fails on some of the instances because they were down
  2. indexing fails on some of the instances because of temporary network partitions
  3. batch indexing fails at different points on different boxes for whatever reasons
  4. as different replicas become inconsistent, people see different results while paging or even when reloading the same page

There are more but I think you get the idea. This is just a can of worms. I urge you to either stick with the master-slave replication and fix the failover scripts or upgrade to SolrCloud where we have solved and continue to solve hundreds of such problems instead of hand-rolling a distributed indexing solution yourself.

I'd be happy to answer any questions and/or help with anything on the Solr side which makes things easier for ezfind.

@thiagocamposviana
Copy link
Contributor Author

Thanks for the feedback shalinmangar, I will further investigate this tomorrow.

@thiagocamposviana
Copy link
Contributor Author

I decided to just close the pull request, as pointed out, seems that this approach is not recommended.

@gggeek
Copy link
Collaborator

gggeek commented Nov 30, 2014

What I would like is a good sample failover script for Solr (3), to make it easier to answer customers requests :-)

In my own limited experience, implementing some "make sure master is really dead when supposed so, and prevent automatic failback" logic in hand-developed scripts is not too hard - at least if you have ever heard before about STONITH before when looking at cluster implementations.

The main problems I faced were:

  1. since failback is a manual event (while failover is automatic), how to minimize the time where the system is exposed by not having the 2 servers available
  2. how to make sure that no indexation events can get lost during the failover period, or that at least they get traced in a reliable way for replay, as starting a full reindexation just to make sure that no events are lost can be quite expensive. I have had customers introducing proxies, log parsers and lots of hackish code to achieve this...

PS: Sorry to post this here but my account on share.ez.no is still not back from the dead

@andrerom
Copy link
Contributor

andrerom commented Dec 2, 2014

@thiagocamposviana Did you manage to solve the underlying issue using newer version of solr and ezfind? Which version was this intended for?

For some background: Basically as much as possible we try to take advantage of the more advance sharding/replication/.. kind of features of the underlying technology. We will most likely do the same for databases, potentially supporting the different cluster solutions available as opposed to having master/slave logic in PHP for instance.

Anything we should provide out of the box for this, better out of the box config for this?

@dougplant
Copy link

Thanks Andre!

We have not solved this problem.

A client has requested a solution to provide for eliminating Solr as a single point of failure. Unfortunately, they are staying at eZP version 4.6. At this version, there does not appear to be any supported solution to provide this functionality.

The basic requirement is that the site should not suck if one of the Solr nodes goes down. This indicates a few things:

  • has to be automatic failover for read operations
  • the Solr index is not critical (eg, not at all like a MySQL master-slave: a few missed updates to the Solr index is a much smaller problem)
  • the manual recovery might take a couple days, so the failed-over solution has to be stable, but getting the system off the 'backup' node is not urgent. (If all the Solr nodes are/go down, then there is much larger problem and the current scope is not the place to address it.)
  • scale is not an issue

The missing piece of the solution is that the read requests all go through a load balancer/virtual IP. Arguably, there are missing improvements, but we thought we would start with the basics so as to keep the idea clear.

I won't try to pitch the idea aside from:

  • the solution accomplishes the requirements
  • the requirements are generally useful -- for older versions
  • the patch is a pure augmentation

@andrerom
Copy link
Contributor

andrerom commented Dec 2, 2014

@paulborgermans Any suggestions for 4.6 install? (besides upgrading of course)

@gggeek
Copy link
Collaborator

gggeek commented Dec 2, 2014

@dougplant I'd say that the "standard" Solr3 master-slave solution with a proxy in front which does load balancing and failover would meet all of the above requirements.
Did you check out http://share.ez.no/blogs/gaetano-giunta/load-balancing-ez-find-for-fun-and-profit and http://share.ez.no/blogs/gaetano-giunta/load-balancing-ez-find-for-fun-and-profit-part-ii ?

@dougplant
Copy link

@gggeek We did read those, and thanks, by the way. That model is an option, however the need for STONITH, as you say, puts extra complexity and risk into it. Also, a recommended, or even better, a supported solution would have additional merit.

@paulborgermans
Copy link
Contributor

As I emailed privately last week to @dougplant, a (relative) simple Apache based proxy together with a master + slave/sleeping master (or multiple of these) + perhaps a few extra slaves for additional robustness do the job as required here for older (pre 4.x) Solr/eZ Find releases. The blog post of @gggeek also goes in this direction (need to add the health checks though). And there is no need for STONITH either.

hth
Paul

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

8 participants