Is there a guide to setting up load balancing? #3093

Open
Unifex opened this Issue Dec 7, 2016 · 4 comments

Projects

None yet

2 participants

@Unifex
Unifex commented Dec 7, 2016 edited

tl;dr;
We're seeing weirdness in our load balanced set up. Different pads for the same PadID based on which load balancer is responding. Are there official or reliable docs on Load Balancing Etherpad that I'm just failing to find?

The long version;
So, the reason I ask is because we have a system that is using a load balancer for a number of services and the etherpad domain backs on to a single etherpad server. The load balancer is also a single termination point for https.

The problem we are seeing is that when we tested from mobile devices we appear to have been directed to one load balancer and desktops were directed through the other. I don't know the specific for why this was the case. The issue though. Each load balancer served a different document despite having the same PadID. Checking the logs showed only one pad having been created on this day also.

We turned off the secondary and set that as a failover instead. All devices now showed the same document, that being the one from the desktop machines. When we turned the primary off forcing the secondary to act as fail over all devices saw the other document that the mobile devices initially saw.

The weirdness here is that they share the same PadID.

I've been searching for things related to this and I've not come across this specific issue yet. I've found a number of conversations around load testing and a few around load balancing but the load balancing ones never seemed to reach any conclusion.

For reference, I'm involved with a project trying to deploy this to a production environment for collaboration in emergency situations. I'm not a node.js dev so please take that into account when asking techy questions around this. I don't mind being asked questions that may seem simple or obvious to a node.js dev, in fact it's likely needed. I'm also part of a team and I have very limited access to the production servers. I am however in close contact with the sysadmins that do take care of that side of things.

@lpagliari
Contributor

Hi, @Unifex, it looks like each instance of Etherpad has its own database. Are you guys using a remote DB to store the pads? Do you have access to the settings.json on the Etherpad root folder? Could you tell us the DB config on that file?

I am not aware of any doc about load balancing for Etherpad and I don't know if it is prepared to replicate its changes through different DBs for redundancy, so maybe having one DB for each instance is what is making you have this behavior.

@Unifex
Unifex commented Dec 8, 2016

Hello @lpagliari,

Thanks for the reply. Just a heads up. My less that prompt response is due to living in the future and most of the world is sleeping while we're up here in New Zealand.

I believe that we only have one DB. We definitely only have one DB server for it. I've put in a request with the sysadmins to verify that this is the case. I'm hoping to find that one is pointing at the DB server and the other is misconfigured (default dirtyDB still in place perhaps...)

Should have a response within half an hour.

Thanks again for the reply.

Regards,
Gold

@Unifex
Unifex commented Dec 8, 2016

Okay, Have just heard back and both of the load balanced servers are indeed pointing at the same database on the same database servers. We verified in on the postgres server and can see both servers with a connection to the DB. I asked that the sysadmin actually eyeball the settings.json file too and they're the same. He went one step further and checked the hash on the file and they're identical.

What we have noticed though, APIKEY.txt are the same on both, but SESSIONKEY.txt are different.

Looking at that now.

@Unifex
Unifex commented Dec 9, 2016

SESSIONKEY.txt didn't appear to play any significant role.

We did notice that the "head" parameter had different values for each server and that it is used in building the key for the key:value store.

e.g.
`eth1

pad:58483aee8a6dd9.65340416 | {"atext":{"text":"Test\n\nUntest?\n\n\n","attribs":"4+45|4+b|1+1"},"pool":{"numToAttrib":{"0":["author","a.kFcgC1Xr1xVyK6VP"],"1":["author","a.KokfEaKFHuLOrj8Q"],"2":["author","a.XqjPbBqDZM7WEIv6"],"3":["author","a.UTXpIyYgqxjhQBJN"],"4":["author","a.6adq2JkT7Yv40gBK"],"5":["author","a.vPuSKAf3Gx2voZlH"]},"nextNum":6},"head":95,"chatHead":-1,"publicStatus":false,"passwordHash":null,"savedRevisions":[]}

eth2

pad:58483aee8a6dd9.65340416 | {"atext":{"text":"Test\n\nTest\n\nWTH\n\n...\n\n\n","attribs":"4+45|8+i|1+1"},"pool":{"numToAttrib":{"0":["author","a.kFcgC1Xr1xVyK6VP"],"1":["author","a.KokfEaKFHuLOrj8Q"],"2":["author","a.XqjPbBqDZM7WEIv6"],"3":["author","a.UTXpIyYgqxjhQBJN"],"4":["author","a.6adq2JkT7Yv40gBK"],"5":["author","a.VImJzfCalZs11fia"]},"nextNum":6},"head":105,"chatHead":0,"publicStatus":false,"passwordHash":null,"savedRevisions":[]}
`

We're at the point of giving up on load balancing this and going for a failover approach. High availability is the primary goal for us.

It would be good to have an official statement from the project leads about this though. It will save others a lot of time hunting out the answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment