Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kill etcd process exactly before taking a final snapshot #478

Merged

Conversation

plkokanov
Copy link
Contributor

What this PR does / why we need it:
This PR slightly changes when the etcd process is killed if the owner check fails so that the connection from the kube-apiserver to the etcd is broken.
Previously a corner case could happen which would leave etcd running and therefore the shoot's control plane would not get properly shut down in the source seed during control plane migration leading to split brain scenario:
This corner case can happen in the following scneario (but is probably not limited only to it)

  1. If the backup bucket is not reachable for any reason backup-restore will fail to create a snapshot
  2. backup-restore constantly loops trying to start the snapshotter
  3. Control plane migration is triggered and eventually owner is changed during the restore phase.
  4. However, backup-restore detects that the owner has changed too early in this code
    if b.ownerChecker != nil {
    // Check if the actual owner ID matches the expected one
    // If the check returns false, take a final full snapshot if needed.
    // If the check returns an error continue with normal operation
    b.logger.Debugf("Checking owner before starting snapshotter...")
    result, err := b.ownerChecker.Check(ctx)
    if err != nil {
    b.logger.Errorf("ownerChecker check fails: %v", err)
    } else if !result {
    handler.SetStatus(http.StatusServiceUnavailable)
    // If the previous full snapshot doesn't exist or is not marked as final, take a final full snapshot
    if ssr.PrevFullSnapshot == nil || !ssr.PrevFullSnapshot.IsFinal {
    b.logger.Infof("Taking final full snapshot...")
    var snapshot *brtypes.Snapshot
    if snapshot, err = ssr.TakeFullSnapshotAndResetTimer(true); err != nil {
    b.logger.Errorf("Could not take final full snapshot: %v", err)
    continue
    }
    if b.config.HealthConfig.SnapshotLeaseRenewalEnabled {
    leaseUpdatectx, cancel := context.WithTimeout(ctx, brtypes.LeaseUpdateTimeoutDuration)
    defer cancel()
    if err = heartbeat.FullSnapshotCaseLeaseUpdate(leaseUpdatectx, b.logger, snapshot, ssr.K8sClientset, b.config.HealthConfig.FullSnapshotLeaseName, b.config.HealthConfig.DeltaSnapshotLeaseName); err != nil {
    b.logger.Warnf("Snapshot lease update failed : %v", err)
    }
    }
    }
    // Wait for the configured interval before making another attempt
    b.logger.Infof("Waiting for %s...", b.config.OwnerCheckConfig.OwnerCheckInterval.Duration)
    select {
    case <-ctx.Done():
    b.logger.Info("Shutting down...")
    return
    case <-time.After(b.config.OwnerCheckConfig.OwnerCheckInterval.Duration):
    }
    continue
    }
    }
    instead of here:
    if b.ownerChecker != nil {
    // Stop owner check watchdog
    ownerCheckWatchdog.Stop()
    // If the owner check fails or returns false, kill the etcd process
    // to ensure that any open connections from kube-apiserver are terminated
    result, err := b.ownerChecker.Check(ctx)
    if err != nil {
    b.logger.Errorf("ownerChecker check fails: %v", err)
    } else if !result {
    if _, err := b.etcdProcessKiller.Kill(ctx); err != nil {
    b.logger.Errorf("Could not kill etcd process: %v", err)
    }
    }
    }
    so it gets permanently stuck in lines L259-l289 and never shuts down the etcd process.

This PR adds a new boolean variable killEtcdBeforeTakingFinalSnapshot. The idea is to set it to true once the owner check has succeeded at least once, since etcd only needs to be restarted in this case - the connection from the kube-apiserver is established only in this case. If the owner check has never succeeded, then backup-restore never becomes ready and there should be no need to restart etcd

Additionally I have made it so that the final snapshot does not get taken unless the etcd process is killed. But I have some doubts about that @ishan16696 wdyt?

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:
I have tested this by locally running backup-restore and etcd and manually changing the owner DNS entry and it seems to work fine.

Release note:

When the owner check fails, `etcd-backup-restore` will restart the `etcd` process right before attempting to take a final snapshot, if the owner check was previously successful.

@plkokanov plkokanov requested a review from a team as a code owner May 11, 2022 13:22
@gardener-robot gardener-robot added needs/review Needs review size/s Size of pull request is small (see gardener-robot robot/bots/size.py) labels May 11, 2022
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label May 11, 2022
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels May 11, 2022
@ishan16696
Copy link
Member

ishan16696 commented May 11, 2022

Overrall PR looks good, just few doubts:

  1. Have you tested that seed/shoot migration scenario with this custom image of backup-restore ?

If the backup bucket is not reachable for any reason backup-restore will fail to create a snapshot

  1. backup-restore won't be able to take/upload the final full snapshot. Hope that’s fine.

  2. What about kill the etcd process after taking the final full snapshot ?

@plkokanov
Copy link
Contributor Author

plkokanov commented May 11, 2022

  1. Have you tested that seed/shoot migration scenario with this custom image of backup-restore ?

Testing it currently.

  1. backup-restore won't be able to take/upload the final full snapshot. Hope that’s fine.

Yep that is ok. The copy operation will just ignore it and copy w/e is currently available inside the backup directory after a 30 minute (by default) timeout. I think the only problem will be if etcd is never successfully killed - but not sure if there is anyway around it.

  1. What about kill the etcd process after taking the final full snapshot ?

Don't need to kill it after taking the final snapshot if we've already killed it once before that - kube-apiserver connection should be already interrupted and the etcd-main-0 pod will be in a not ready state so that the connection is never established again.

@ishan16696
Copy link
Member

Don't need to kill it after taking the final snapshot if we've already killed it once before that - kube-apiserver connection should be already interrupted and the etcd-main-0 pod will be in a not ready state so that the connection is never established again.

no, this is not what I meant ... I'm saying what about if we first take the full final snapshot then kill the etcd process (irrespective of final full snapshot successful or not.)
It has one advantage backup-restore don't have to wait for etcd to come back to take final full snapshot.

@plkokanov
Copy link
Contributor Author

We always must kill it before to make sure that the connection from the kube-apiserver is interrupted and no new information is written by the kube-apiserver to the etcd while taking the final snapshot.

@ishan16696
Copy link
Member

For testing I forgot to mention that you have to make a custom backup-restore image with these changes on the top of v0.15.3 with druid v0.8.4.
as if you use image with these changes on the top of backup-restore master then it might not work as it contains some multi-node changes and it can only work with druid v0.9.0.

@plkokanov
Copy link
Contributor Author

plkokanov commented May 11, 2022

For testing I forgot to mention that you have to make a custom backup-restore image with these changes on the top of v0.15.3 with druid v0.8.4.
as if you use image with these changes on the top of backup-restore master then it might not work as it contains some multi-node changes and it can only work with druid v0.9.0.

I see, thanks I was trying out with master and receiving these errors from the kube-apiserver:

0511 17:07:16.720588       1 client.go:360] parsed scheme: "endpoint"
I0511 17:07:16.720621       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://etcd-main-client:2379  <nil> 0 <nil>}]
W0511 17:07:16.722590       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://etcd-main-client:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
W0511 17:07:16.729505       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://etcd-main-client:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
W0511 17:07:17.724649       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://etcd-main-client:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
W0511 17:07:18.569912       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://etcd-main-client:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
W0511 17:07:19.543593       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://etcd-main-client:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
W0511 17:07:21.431953       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://etcd-main-client:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: read tcp 10.223.131.175:60344->10.223.24.12:2379: read: connection reset by peer". Reconnecting...
W0511 17:07:22.601672       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://etcd-main-client:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
W0511 17:07:25.492959       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://etcd-main-client:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
W0511 17:07:26.279029       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://etcd-main-client:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
W0511 17:07:31.455453       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://etcd-main-client:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
W0511 17:07:33.297165       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://etcd-main-client:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
Error: context deadline exceeded

But I wasn't sure if this wasn't because of the new secrets managed and that my shoot was pretty old. I'll try out with the versions you mentioned.

@plkokanov plkokanov force-pushed the cp-migraiton/restart-etcd-on-owner-loss branch from 0ed78cd to f09c3b0 Compare May 12, 2022 06:53
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels May 12, 2022
Copy link
Member

@ishan16696 ishan16696 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!

@ishan16696 ishan16696 merged commit acd860a into gardener:master May 12, 2022
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label May 12, 2022
ishan16696 added a commit that referenced this pull request May 12, 2022
[cherry-pick of #478 ]Kill etcd process exactly before taking a final snapshot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/review Needs review size/s Size of pull request is small (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants