-
Notifications
You must be signed in to change notification settings - Fork 576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failover and recovery #3
Comments
its something we are definitely going to add, the containers we use under the hood on this support a watch/failover capability, we will leverage some of that into what the operator will allow... |
Assuming that master went down Isn't failover supposed to be automatic, based on which replica has most of the data of the master & promote him as master? My assumption was that the operator is supposed to watch over the instances & take care of master selection (on bootstrap & on master failure) & coordination among the participating pods. Please correct me incase if I am missing anything |
Well, @jmccormick2001 I was asking because looking for alternative solution and understand benefits against mine, please let me know whenever you got something around... I'm all into this topic! thanks :) |
this is a good topic for sure and one I'm thinking about...some design ideas in my head include:
|
usually you dont want cluster without failover logic
Swichover is the correct term here :P as we are not failing ... and yeah it's something i have not developed.
That might be solved by replica priority, and quorum election of course to avoid split-brain... so your priority should not mean much in case of network issues but it will in case of master bad health.
Any practical use for this one?
For that one I use pgpool ;) it does it's job, excluding nodes from list of backends based on different conditions and health checks to postgres servers
That I don't understand at all :( Sorry for going through the list of your ideas, but looks like I've been in that state half an year ago. And can help you with some of them if you need help ofcourse :)
PS |
Hi everyone! Keep up the good work! |
Was also looking at an alternative using stolon / etcd: |
thanks, currently the master runs in a Deployment which 'should' get rescheduled by Kube if the Kube node dies, Kube 'should' restart the pod if it dies as well to keep the Deployment consistent. However, the gotcha here is what if the master's data is somehow corrupted? Kube in that case will just restart the bad database over and over. What I'm considering is a more formal way to specify that "I want to fail over to this replica", I have some ideas on how this will work but want to give it some extra thought before I do the implementation. I also want a means of specifying a 'sync' replica that a user could specifically target as the failover target. This is definitely a high priority for an upcoming release so stay tuned. There is also the case where a user might want an 'automated failover', I'm thinking about that use case as well. |
Thanks for the immediate reply! |
That's not AWS specific though... just in general - if the db is the critical point of failure - which is for most applications in some way.. and you are aiming towards 100% uptime... So in case of a sudden node death or downtime, 1-2min for k8s to free the PVC, mount on another node + restart the pod somewhere else is quite a long downtime - when there are replicas available that could take over... |
understood, there is definitely the case where users will want to orchestrate a failover onto a specific replica regardless of what Kube might do |
Many thanks for your input! I don't want to urge or anything - do you have a rough timescale / idea when some sort of automatic failover may get realised? May I help in any way? |
no worries, the current roadmap/schedule is to release the 'policy' mechanism in the next week or so, this is a new feature that lets you apply SQL policies against clusters....right after that is the failover work...so I'm shooting to have some form of failover feature in about the 4-5 week time frame, hopefully earlier. Once I have some early work done on this, I'll reach out and see if you could do a sanity check on it. |
👍 will try to keep track, happy to help with sanity check. |
right now there is only a full backup (using pg_basebackup)...the other forms of backup and a scheduling capability are in the works too but they will likely lag behind the failover and policy releases. |
That all sounds highly promising 👏 ! |
manual failover is coded and will land in upcoming 2.6 release. |
auto failover (first cut) will land in operator 3.1 soon to be released. |
Hi Guys!
I'm looking for information in your solution about failover and recovery of nodes in postgres cluster, sorry but not sure if it exists... could you please advice?
The text was updated successfully, but these errors were encountered: