Kill etcd process exactly before taking a final snapshot #478

plkokanov · 2022-05-11T13:22:47Z

What this PR does / why we need it:
This PR slightly changes when the etcd process is killed if the owner check fails so that the connection from the kube-apiserver to the etcd is broken.
Previously a corner case could happen which would leave etcd running and therefore the shoot's control plane would not get properly shut down in the source seed during control plane migration leading to split brain scenario:
This corner case can happen in the following scneario (but is probably not limited only to it)

If the backup bucket is not reachable for any reason backup-restore will fail to create a snapshot
backup-restore constantly loops trying to start the snapshotter
Control plane migration is triggered and eventually owner is changed during the restore phase.

However, backup-restore detects that the owner has changed too early in this code

etcd-backup-restore/pkg/server/backuprestoreserver.go

Lines 285 to 324 in eb6c516

    
           if b.ownerChecker != nil { 
        
           	// Check if the actual owner ID matches the expected one 
        
           	// If the check returns false, take a final full snapshot if needed. 
        
           	// If the check returns an error continue with normal operation 
        
           	b.logger.Debugf("Checking owner before starting snapshotter...") 
        
           	result, err := b.ownerChecker.Check(ctx) 
        
           	if err != nil { 
        
           		b.logger.Errorf("ownerChecker check fails: %v", err) 
        
           	} else if !result { 
        
           		handler.SetStatus(http.StatusServiceUnavailable) 
        
           		// If the previous full snapshot doesn't exist or is not marked as final, take a final full snapshot 
        
           		if ssr.PrevFullSnapshot == nil || !ssr.PrevFullSnapshot.IsFinal { 
        
           			b.logger.Infof("Taking final full snapshot...") 
        
           			var snapshot *brtypes.Snapshot 
        
           			if snapshot, err = ssr.TakeFullSnapshotAndResetTimer(true); err != nil { 
        
           				b.logger.Errorf("Could not take final full snapshot: %v", err) 
        
           				continue 
        
           			} 
        
           			if b.config.HealthConfig.SnapshotLeaseRenewalEnabled { 
        
           				leaseUpdatectx, cancel := context.WithTimeout(ctx, brtypes.LeaseUpdateTimeoutDuration) 
        
           				defer cancel() 
        
           				if err = heartbeat.FullSnapshotCaseLeaseUpdate(leaseUpdatectx, b.logger, snapshot, ssr.K8sClientset, b.config.HealthConfig.FullSnapshotLeaseName, b.config.HealthConfig.DeltaSnapshotLeaseName); err != nil { 
        
           					b.logger.Warnf("Snapshot lease update failed : %v", err) 
        
           				} 
        
           			} 
        
           		} 
        
           		// Wait for the configured interval before making another attempt 
        
           		b.logger.Infof("Waiting for %s...", b.config.OwnerCheckConfig.OwnerCheckInterval.Duration) 
        
           		select { 
        
           		case <-ctx.Done(): 
        
           			b.logger.Info("Shutting down...") 
        
           			return 
        
           		case <-time.After(b.config.OwnerCheckConfig.OwnerCheckInterval.Duration): 
        
           		} 
        
           		continue 
        
           	} 
        
           }

instead of here:

etcd-backup-restore/pkg/server/backuprestoreserver.go

Lines 442 to 456 in eb6c516

    
           if b.ownerChecker != nil { 
        
           	// Stop owner check watchdog 
        
           	ownerCheckWatchdog.Stop() 
        
           	// If the owner check fails or returns false, kill the etcd process 
        
           	// to ensure that any open connections from kube-apiserver are terminated 
        
           	result, err := b.ownerChecker.Check(ctx) 
        
           	if err != nil { 
        
           		b.logger.Errorf("ownerChecker check fails: %v", err) 
        
           	} else if !result { 
        
           		if _, err := b.etcdProcessKiller.Kill(ctx); err != nil { 
        
           			b.logger.Errorf("Could not kill etcd process: %v", err) 
        
           		} 
        
           	} 
        
           }

so it gets permanently stuck in lines L259-l289 and never shuts down the etcd process.

This PR adds a new boolean variable killEtcdBeforeTakingFinalSnapshot. The idea is to set it to true once the owner check has succeeded at least once, since etcd only needs to be restarted in this case - the connection from the kube-apiserver is established only in this case. If the owner check has never succeeded, then backup-restore never becomes ready and there should be no need to restart etcd

Additionally I have made it so that the final snapshot does not get taken unless the etcd process is killed. But I have some doubts about that @ishan16696 wdyt?

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:
I have tested this by locally running backup-restore and etcd and manually changing the owner DNS entry and it seems to work fine.

Release note:

When the owner check fails, `etcd-backup-restore` will restart the `etcd` process right before attempting to take a final snapshot, if the owner check was previously successful.

ishan16696 · 2022-05-11T13:43:30Z

Overrall PR looks good, just few doubts:

Have you tested that seed/shoot migration scenario with this custom image of backup-restore ?

If the backup bucket is not reachable for any reason backup-restore will fail to create a snapshot

backup-restore won't be able to take/upload the final full snapshot. Hope that’s fine.
What about kill the etcd process after taking the final full snapshot ?

plkokanov · 2022-05-11T13:59:18Z

Have you tested that seed/shoot migration scenario with this custom image of backup-restore ?

Testing it currently.

backup-restore won't be able to take/upload the final full snapshot. Hope that’s fine.

Yep that is ok. The copy operation will just ignore it and copy w/e is currently available inside the backup directory after a 30 minute (by default) timeout. I think the only problem will be if etcd is never successfully killed - but not sure if there is anyway around it.

What about kill the etcd process after taking the final full snapshot ?

Don't need to kill it after taking the final snapshot if we've already killed it once before that - kube-apiserver connection should be already interrupted and the etcd-main-0 pod will be in a not ready state so that the connection is never established again.

ishan16696 · 2022-05-11T14:11:00Z

Don't need to kill it after taking the final snapshot if we've already killed it once before that - kube-apiserver connection should be already interrupted and the etcd-main-0 pod will be in a not ready state so that the connection is never established again.

no, this is not what I meant ... I'm saying what about if we first take the full final snapshot then kill the etcd process (irrespective of final full snapshot successful or not.)
It has one advantage backup-restore don't have to wait for etcd to come back to take final full snapshot.

plkokanov · 2022-05-11T14:15:05Z

We always must kill it before to make sure that the connection from the kube-apiserver is interrupted and no new information is written by the kube-apiserver to the etcd while taking the final snapshot.

ishan16696 · 2022-05-11T15:49:04Z

For testing I forgot to mention that you have to make a custom backup-restore image with these changes on the top of v0.15.3 with druid v0.8.4.
as if you use image with these changes on the top of backup-restore master then it might not work as it contains some multi-node changes and it can only work with druid v0.9.0.

plkokanov · 2022-05-11T17:11:56Z

For testing I forgot to mention that you have to make a custom backup-restore image with these changes on the top of v0.15.3 with druid v0.8.4.
as if you use image with these changes on the top of backup-restore master then it might not work as it contains some multi-node changes and it can only work with druid v0.9.0.

I see, thanks I was trying out with master and receiving these errors from the kube-apiserver:

0511 17:07:16.720588       1 client.go:360] parsed scheme: "endpoint"
I0511 17:07:16.720621       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://etcd-main-client:2379  <nil> 0 <nil>}]
W0511 17:07:16.722590       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://etcd-main-client:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
W0511 17:07:16.729505       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://etcd-main-client:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
W0511 17:07:17.724649       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://etcd-main-client:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
W0511 17:07:18.569912       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://etcd-main-client:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
W0511 17:07:19.543593       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://etcd-main-client:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
W0511 17:07:21.431953       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://etcd-main-client:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: read tcp 10.223.131.175:60344->10.223.24.12:2379: read: connection reset by peer". Reconnecting...
W0511 17:07:22.601672       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://etcd-main-client:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
W0511 17:07:25.492959       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://etcd-main-client:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
W0511 17:07:26.279029       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://etcd-main-client:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
W0511 17:07:31.455453       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://etcd-main-client:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
W0511 17:07:33.297165       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://etcd-main-client:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
Error: context deadline exceeded

But I wasn't sure if this wasn't because of the new secrets managed and that my shoot was pretty old. I'll try out with the versions you mentioned.

ishan16696

LGTM!!

[cherry-pick of #478 ]Kill etcd process exactly before taking a final snapshot

plkokanov requested a review from a team as a code owner May 11, 2022 13:22

gardener-robot added needs/review Needs review size/s Size of pull request is small (see gardener-robot robot/bots/size.py) labels May 11, 2022

gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label May 11, 2022

gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels May 11, 2022

Kill etcd process exactly before taking a final snapshot

f09c3b0

plkokanov force-pushed the cp-migraiton/restart-etcd-on-owner-loss branch from 0ed78cd to f09c3b0 Compare May 12, 2022 06:53

gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels May 12, 2022

ishan16696 approved these changes May 12, 2022

View reviewed changes

ishan16696 merged commit acd860a into gardener:master May 12, 2022

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label May 12, 2022

ishan16696 mentioned this pull request May 12, 2022

[cherry-pick of #478 ]Kill etcd process exactly before taking a final snapshot #479

Merged

ishan16696 added a commit that referenced this pull request May 12, 2022

Merge pull request #479 from ishan16696/cherry-pick-PR478

09bc1a4

[cherry-pick of #478 ]Kill etcd process exactly before taking a final snapshot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kill etcd process exactly before taking a final snapshot #478

Kill etcd process exactly before taking a final snapshot #478

plkokanov commented May 11, 2022

ishan16696 commented May 11, 2022 •

edited

plkokanov commented May 11, 2022 •

edited

ishan16696 commented May 11, 2022

plkokanov commented May 11, 2022

ishan16696 commented May 11, 2022

plkokanov commented May 11, 2022 •

edited

ishan16696 left a comment

	if b.ownerChecker != nil {
	// Check if the actual owner ID matches the expected one
	// If the check returns false, take a final full snapshot if needed.
	// If the check returns an error continue with normal operation
	b.logger.Debugf("Checking owner before starting snapshotter...")
	result, err := b.ownerChecker.Check(ctx)
	if err != nil {
	b.logger.Errorf("ownerChecker check fails: %v", err)
	} else if !result {
	handler.SetStatus(http.StatusServiceUnavailable)

	// If the previous full snapshot doesn't exist or is not marked as final, take a final full snapshot
	if ssr.PrevFullSnapshot == nil \|\| !ssr.PrevFullSnapshot.IsFinal {
	b.logger.Infof("Taking final full snapshot...")
	var snapshot *brtypes.Snapshot
	if snapshot, err = ssr.TakeFullSnapshotAndResetTimer(true); err != nil {
	b.logger.Errorf("Could not take final full snapshot: %v", err)
	continue
	}
	if b.config.HealthConfig.SnapshotLeaseRenewalEnabled {
	leaseUpdatectx, cancel := context.WithTimeout(ctx, brtypes.LeaseUpdateTimeoutDuration)
	defer cancel()
	if err = heartbeat.FullSnapshotCaseLeaseUpdate(leaseUpdatectx, b.logger, snapshot, ssr.K8sClientset, b.config.HealthConfig.FullSnapshotLeaseName, b.config.HealthConfig.DeltaSnapshotLeaseName); err != nil {
	b.logger.Warnf("Snapshot lease update failed : %v", err)
	}
	}
	}

	// Wait for the configured interval before making another attempt
	b.logger.Infof("Waiting for %s...", b.config.OwnerCheckConfig.OwnerCheckInterval.Duration)
	select {
	case <-ctx.Done():
	b.logger.Info("Shutting down...")
	return
	case <-time.After(b.config.OwnerCheckConfig.OwnerCheckInterval.Duration):
	}

	continue
	}
	}

	if b.ownerChecker != nil {
	// Stop owner check watchdog
	ownerCheckWatchdog.Stop()

	// If the owner check fails or returns false, kill the etcd process
	// to ensure that any open connections from kube-apiserver are terminated
	result, err := b.ownerChecker.Check(ctx)
	if err != nil {
	b.logger.Errorf("ownerChecker check fails: %v", err)
	} else if !result {
	if _, err := b.etcdProcessKiller.Kill(ctx); err != nil {
	b.logger.Errorf("Could not kill etcd process: %v", err)
	}
	}
	}

Kill etcd process exactly before taking a final snapshot #478

Kill etcd process exactly before taking a final snapshot #478

Conversation

plkokanov commented May 11, 2022

ishan16696 commented May 11, 2022 • edited

plkokanov commented May 11, 2022 • edited

ishan16696 commented May 11, 2022

plkokanov commented May 11, 2022

ishan16696 commented May 11, 2022

plkokanov commented May 11, 2022 • edited

ishan16696 left a comment

Choose a reason for hiding this comment

ishan16696 commented May 11, 2022 •

edited

plkokanov commented May 11, 2022 •

edited

plkokanov commented May 11, 2022 •

edited