-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kill etcd process exactly before taking a final snapshot #478
Kill etcd process exactly before taking a final snapshot #478
Conversation
Overrall PR looks good, just few doubts:
|
Testing it currently.
Yep that is ok. The copy operation will just ignore it and copy w/e is currently available inside the backup directory after a 30 minute (by default) timeout. I think the only problem will be if
Don't need to kill it after taking the final snapshot if we've already killed it once before that - |
no, this is not what I meant ... I'm saying what about if we first take the full final snapshot then kill the etcd process (irrespective of final full snapshot successful or not.) |
We always must kill it before to make sure that the connection from the |
For testing I forgot to mention that you have to make a custom |
I see, thanks I was trying out with master and receiving these errors from the kube-apiserver:
But I wasn't sure if this wasn't because of the new secrets managed and that my shoot was pretty old. I'll try out with the versions you mentioned. |
0ed78cd
to
f09c3b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!!
[cherry-pick of #478 ]Kill etcd process exactly before taking a final snapshot
What this PR does / why we need it:
This PR slightly changes when the
etcd
process is killed if the owner check fails so that the connection from thekube-apiserver
to theetcd
is broken.Previously a corner case could happen which would leave
etcd
running and therefore the shoot's control plane would not get properly shut down in the source seed during control plane migration leading to split brain scenario:This corner case can happen in the following scneario (but is probably not limited only to it)
backup-restore
will fail to create a snapshotsnapshotter
backup-restore
detects that the owner has changed too early in this codeetcd-backup-restore/pkg/server/backuprestoreserver.go
Lines 285 to 324 in eb6c516
etcd-backup-restore/pkg/server/backuprestoreserver.go
Lines 442 to 456 in eb6c516
etcd
process.This PR adds a new boolean variable
killEtcdBeforeTakingFinalSnapshot
. The idea is to set it to true once the owner check has succeeded at least once, sinceetcd
only needs to be restarted in this case - the connection from thekube-apiserver
is established only in this case. If the owner check has never succeeded, thenbackup-restore
never becomes ready and there should be no need to restartetcd
Additionally I have made it so that the final snapshot does not get taken unless the
etcd
process is killed. But I have some doubts about that @ishan16696 wdyt?Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
I have tested this by locally running
backup-restore
andetcd
and manually changing the owner DNS entry and it seems to work fine.Release note: