[Feature] To discuss the impact of PR #383(Add owner checks and taking of final snapshots) on Multi-Node Etcd #242

ishan16696 · 2021-10-18T10:04:27Z

Feature (What you would like to be added):

Right now backup-leader have the responsibility to take snapshots(full and delta) and to trigger the defragmentation. Do we also want backup-leader to have a responsibility kill/disconnect the etcd cluster members?
What changes do we have to make in a work-flow of backup-leader ?
Right now if quorum will lost then etcd-druid will take care of this non-quorate cluster and try to bring back the quorum.
How etcd-druid will going to react to this new change in work-flow of backup-leader ?

Motivation (Why is this needed?):
PR #383 (Add owner checks and taking of final snapshots) wants to disconnect the api-server and etcd for that it is killing the etcd process and fails the Readiness Probe. Now in multi-node etcd Do we wants to kill/disconnect all etcd members ?
If we go with change like backup-leader will kill/disconnect the api-server and etcd cluster will lost its quorum, then we also have to take of etcd-druid reaction to quorum loss.

cc @stoyanr
Approach/Hint to the implement solution (optional):

timuthy · 2021-11-08T13:57:19Z

In a test with @ishan16696, @abdasgupta, @aaronfern and @stoyanr we simulated the impact of doing owner checks on every member. This will be necessary to cut off active client connections to the member by killing the etcd process once.

In a multi-node setup killing the etcd process (process is started again afterwards) will have the following impact:

Leader is killed: A new leader is elected, the member will continue running as a follower.
Follower is killed: Member will continue as a follower.
A final full snapshot is only taken when a member becomes a leader && owner check is failing.
Read/right requests are accepted as long as the Kubernetes service has one Ready pod (no matter if leader or follower).

--> Ideally, the final full snapshot is only taken after the owner check of the last member fails.

Unfortunately, this is not given at any time.

Example:

T1
------------------
member1:Leader
member2:Follower
member3:Follower
------------------


T2 (etcd process of memeber1 is killed)
------------------
member1:Leader    (owner check fails)
member2:Follower  (owner check succeeds) 
member3:Follower  (owner check succeeds) 
------------------


T3 (member2 becomes leader, member1 runs as follower)
------------------
member1:Follower  (owner check fails)     
member2:Leader    (owner check succeeds) 
member3:Follower  (owner check succeeds) 
------------------


T4 (etcd process of memeber2 is killed)
------------------
member1:Follower  (owner check fails)     
member2:Leader    (owner check fails)
member3:Follower  (owner check succeeds) 
------------------


T5 (member1 becomes leader + final full snapshot is taken, member2 runs as follower)
------------------
member1:Leader    (owner check fails + final full snapshot)
member2:Follower  (owner check fails) 
member3:Follower  (owner check succeeds) --> etcd cluster is still accessible through this member!
------------------


T6 (etcd cluster is not accessible anymore)
------------------
member1:Leader    (owner check fails)   
member2:Follower  (owner check fails)
member3:Follower  (owner check fails) 
------------------

The given example results in a loss of any data that is written between T5 and T6.

ishan16696 · 2021-11-08T16:49:13Z

The given example results in a loss of any data that is written between T5 and T6.

Yeah, but I think upper bound in a loss of data between T5 and T6 will be around ~5mins and I think it’s totally fine as we currently have deltaSnapshotPeriod is scheduled for every 5mins and in worst case here we can also lose 5mins of data.

timuthy · 2021-11-08T16:51:51Z

The given example results in a loss of any data that is written between T5 and T6.

Yeah, but I think upper bound in a loss of data between T5 and T6 will be around ~5mins and I think it’s totally fine as we currently have deltaSnapshotPeriod is scheduled for every 5mins and in worst case here we can also lose 5mins of data.

Thanks for bringing this up. IIUC, the Snapshotter is stopped as long as the owner check is failing, i.e. there will be no delta snapshots taken. Can you double check this?

ishan16696 · 2021-11-08T17:06:54Z

yes, I have double checked ... this block of code will not let the snapshotter loop start if owner check fails.

The only thing we need to discuss is "How to cut-off the client traffic in such a way that etcd peers communication shouldn’t get affected so that there will be no Quorum loss".

ishan16696 · 2021-11-08T17:25:05Z

And AFAIK configuring ReadinessProbe in a master-slave architecture in Kubernetes doesn't seems to be good idea as we will provision multi-node etcd as statefulSet

timuthy · 2021-11-09T11:10:55Z

The only thing we need to discuss is "How to cut-off the client traffic in such a way that etcd peers communication shouldn’t get affected so that there will be no Quorum loss".

So your proposal is to cut-off traffic as soon as one member has a failing owner check and thus there will be no changes any more?

And AFAIK configuring ReadinessProbe in a master-slave architecture in Kubernetes doesn't seems to be good idea as we will provision multi-node etcd as statefulSet

If you want to drop the ReadinessProbe how are we supposed to place a service in front of the etcd cluster?

ishan16696 · 2021-11-09T18:51:36Z

So your proposal is to cut-off traffic as soon as one member has a failing owner check and thus there will be no changes any more?

yes, I think it would be suffice. Let take this scenario under some assumption.
Assumption :

each etcd-member is also running a OwnerCheckwatchdog and can able to kill their corresponding etcd process if they detect owner check fails.
etcd peers communication is possible even after when any etcd cluster member cut-off the client traffic.
snapshots can be uploaded even after when any etcd cluster member cut-off the client ingress traffic as we need to take and upload the final full snapshot when Owner check fails.

T1
------------------
etcd-0: Leader
etcd-1: Follower
etcd-2: Follower
------------------


T2  [etcd process of member2 is killed ---> it will cut-off client ingress traffic]
------------------
etcd-0: Leader
etcd-1: Follower  (Owner check fails)
etcd-2: Follower
------------------


T3 [etcd process of member1 is killed --> leads to leader-election)
------------------
etcd-0: Leader    (Owner check fails)
etcd-1: Follower  (Owner check fails)
etcd-2: Follower
------------------


T4(1) [etcd-1 becomes leader --> takes final full snapshot]
------------------
etcd-0: Follower  (Owner check fails)
etcd-1: Leader    (Owner check fails)
etcd-2: Follower 
------------------

T4(2) [etcd-2 becomes leader --> eventually Owner check will fails --> etcd process of member3 will be killed --> somebody will become leader --> take final full snapshot]
------------------
etcd-0: Follower  (Owner check fails)
etcd-1: Follower  (Owner check fails)
etcd-2: Leader 
------------------

ishan16696 · 2021-11-09T20:16:11Z

There is also another way to do it. we can use MoveLeader API call to transfer the leadership to that etcd member which first detects that Owner check has failed. Advantage of this method is that we don't have to worry about any scenarios like above.
So it will be work like this:

if(Owner check fails && current etcd == Follower ){
      etcd will be killed and restarted
      then transfer the leadership using move leader api call
      takes the Final full snapshot
      cut-off the client traffic
} else if(Owner check fails && current etcd == Leader){
      etcd will be killed and restarted --> leads to leader-election
      then transfer the leadership using move leader api call
      takes the Final full snapshot
      cut-off the client traffic
}

ishan16696 · 2021-11-10T07:22:58Z

I tried to capture the above proposal of using MoveLeader API call described here #242 (comment) in a picture for better understanding:

timuthy · 2021-11-10T09:43:50Z

Thanks for the proposal @ishan16696.

Let me address two doubts:

The entire client connection is cut-off as soon as one member has a failing owner check, i.e. we cannot handle temporary network issues of a single zone. I'm not sure if that's what we want.
If 1. was negligible, can we really cut-off the client by solely changing the pod selector? What about existing connections that might still exist to other members which did not kill their etcd process yet?

ishan16696 · 2021-11-10T10:41:54Z

can we really cut-off the client by solely changing the pod selector?

IMO yes it is possible .... the thing is when we change the pod selector in etcd-service we will able to cut-off the ingress client traffic (refer here) ... I also had already tried to simulate this behaviour (although I used k8s LoadBalancer service to cut-off the client request).

What about existing connections that might still exist to other members which did not kill their etcd process yet?

they will eventually kill their corresponding etcd-member when they detects OwnerCheck fails as OwnerCheck Watchdog is running on each backup-follower or leader.

timuthy · 2021-11-10T13:51:23Z

The entire client connection is cut-off as soon as one member has a failing owner check, i.e. we cannot handle temporary network issues of a single zone. I'm not sure if that's what we want.

⬆️ What about this point? Doing multi-node etcd is one of the reasons to survive problems in one AZ.

ishan16696 · 2021-11-10T14:33:47Z

What about this point? Doing multi-node etcd is one of the reasons to survive problems in one AZ.

I think similar question was also raised by @abdasgupta .... I forgot what was the counter argument, @abdasgupta do you remember ?

ishan16696 · 2021-11-15T11:00:56Z

What about this point? Doing multi-node etcd is one of the reasons to survive problems in one AZ.

AFAIK Owner checks fails is not a very frequent scenario it will happens rarely, so it is not like we are completely lost this advantage of having a multi-node etcd .... it's more like we lost this advantage in some scenarios like when OwnerCheck fails.

cc @stoyanr

stoyanr · 2021-11-17T08:08:34Z

We have the following 2 cases:

Owner check fails because the owner DNS record actually changed (or was deleted), which should only happen during the "bad case" scenario of a control plane migration.
Owner check fails because DNS resolution failed - either something is wrong with DNS itself, or there is some other network issue preventing DNS resolution.

Note that from the perspective of the source cluster, it's not possible to make a clear distinction between the 2 cases. The DNS resolution can fail and the control plane migration can happen at the same time, so the source cluster must be disabled when the owner check fails, no matter the actual error. This was discussed a few times and is clearly pointed out in GEP-17 as well. I don't think it would be possible to fall back from this point without compromising the entire "bad case" design idea (to which we agreed after several phases of PoC and a lot of discussions).

From the discussion so far I believe the first case is covered. In the second case, the interesting question is what the behavior should be if the DNS resolution only fails in a single zone. If it fails in all (or more than one) zones then the cluster should be disabled, and it's probably not really usable in this case anyway.

Ideally, if the DNS resolution fails in a single zone, we should be able to contain the failure to that zone only. Let's discuss in a meeting if this would be possible and how to achieve it.

vlerenc · 2021-11-17T09:22:07Z

Do we also want backup-leader to have a responsibility kill/disconnect the etcd cluster members?

@ishan16696 I wouldn't think so. The leaders can change/aren't stable. Why aren't we letting the druid doing it?

Is that some "shortcut", so that the final snapshot is taken? But that can be instrumented by the druid also or remain implicit, but instead of performing an owner check, be based on checking whether traffic was cut off, i.e. druid cuts of traffic, backup sidecar notices that via the changed service and takes a full snapshot, because it is obvious/in a sense logical to now cease operations and close shop, which equals to one last final full snapshot.

Yeah, but I think upper bound in a loss of data between T5 and T6 will be around ~5mins and I think it’s totally fine as we currently have deltaSnapshotPeriod is scheduled for every 5mins and in worst case here we can also lose 5mins of data.

Uh, what? Why do we consider data loss acceptable (even 5 minutes) without proving it is unavoidable?

In other words, unless there is no technical solution or no practical one with acceptable effort (considering the severity of "data loss", we will usually go to greater lengths than for any other feature in Gardener), data loss is not acceptable.

We first need to be certain that it cannot be helped, but just like that (without a very detailed explanation), data loss is not acceptable, I would think.

So your proposal is to cut-off traffic as soon as one member has a failing owner check and thus there will be no changes any more?

Terminology question: When you say "the check fails" you do not mean the "check fails", but that that the check shows that this ETCD cluster is no longer responsible for that cluster, right? Because if the „check fails", this should not have consequences and over-eagerly cut off traffic.

But the main point is, that I do not see why the individual ETCD instances should have all their own checks and not the druid be in control. The druid is the master orchestrator here, including cutting off traffic, I would think?

The entire client connection is cut-off as soon as one member has a failing owner check, i.e. we cannot handle temporary network issues of a single zone. I'm not sure if that's what we want.

Yes @timuthy. See above. Absolutely not. If a zone is segregated and the check fails (not: check shows this ETCD lost responsibility for that shoot), whether the ETCD instance can then reach the control plane or not, it shall NEVER cut off traffic.

I understand that there may be cases where DNS is malfunctioning (though, probably calls will time our or deliver stale records, which is likely and even more worrying then to cut off traffic, see e.g. our own CoreDNS issues in the past and present) AND the ETCD cluster has lost ownership, but that is a corner case of a corner case of a corner case and the risk is much, much smaller to NOT cut off client traffic than to cut it off too often and render control planes broken as we have seen in the past when we over-eagerly (e.g. because backups failed) shut down our ETCD.

In that sense, https://github.com/gardener/gardener/blob/master/docs/proposals/17-shoot-control-plane-migration-bad-case.md#handling-inability-to-resolve-the-owner-dns-record is pretty strong and I thought (but again didn't read the GEP in detail) we agreed to ONLY cut off traffic if it is clear that ownership is lost (not failed) @stoyanr? If druid happens to be in the "broken" zone, it will/should probably fail its own readiness check as well and then come up on another (ready) node that will be (eventually) on a node in a healthy zone and if then the ETCD cluster lost indeed (for real) ownership, it can cut off the traffic.

ishan16696 · 2021-11-17T12:07:38Z

To still have the advantage of having multi-node etcd of surviving the failure in one zone or any other DNS malfunctioning:
As discussed in a meeting, we can use the Quorum in such way that when there will be a consensus of Owner check has failed then only we completely cut-off the client traffic.

Let take this scenario under some assumption.
Assumption :

each etcd-member is also running a OwnerCheckwatchdog and can able to kill their corresponding etcd process if they detect owner check fails.
etcd peers communication is possible even after when any etcd cluster member cut-off the client traffic.

T1
------------------
etcd-0: Leader
etcd-1: Follower
etcd-2: Follower
------------------


T2  [etcd process of member2 is killed ---> it will let other member knows that `it has detected that Owner check has failed` ]
------------------
etcd-0: Leader
etcd-1: Follower  (Owner check fails)
etcd-2: Follower
------------------


Consensus: Owner check hasn't failed yet as other 2 etcd member haven't detected  the Owner check failure. 
etcd can still able to serve the incoming traffic.

timuthy · 2021-11-17T13:08:43Z

To complement the previous comment about how a consensus for owner check results can be implemented:

Each member maintains the result of the owner check in the etcd cluster, e.g.:

etcdctl put /meta/owner-check-control/{Member_Name}/last-successful-check "2021-11-16T07:43:19Z"

(we can also work with etcd leases for further simplification)

Having (n/2) + 1 members with an outdated last-successful-check timestamp means that all members consent to a failing owner check.
With (2.) being true, members start to return failures for the /healthz endpoint and kill the etcd process to cut-off client traffic.
Each member reports the success of step (3.) by adding a key-value pair

etcdctl put /meta/owner-check-control/{Member_Name}/disconnect true

Leader checks if all members in the cluster reported disconnect: true and eventually triggers a final full snapshot.

ishan16696 · 2021-11-17T13:31:56Z

etcdctl put /meta/owner-check-control/{Member_Name}/last-successful-check "2021-11-16T07:43:19Z"

what about using memberID instead of member-name.

Leader checks if all members in the cluster reported disconnect: true and eventually triggers a final full snapshot.

I have one concern regarding the checking of all members reports disconnect that Let suppose one etcd follower got disconnected from cluster due to network partition then we end-up keep waiting for final full snapshot.
I prefer here also we should check the Quorum for disconnect: true. WDYT?

members start to return failures for the /healthz endpoint and kill the etcd process to cut-off client traffic.

I'm under impression that we will not gonna use /healthz endpoint as ReadinessProbe in multi-node etcd. How to cut-off the client traffic for each etcd-member separately ?

timuthy · 2021-11-17T13:43:42Z

etcdctl put /meta/owner-check-control/{Member_Name}/last-successful-check "2021-11-16T07:43:19Z"

what about using memberID instead of member-name.

No clear preference, this was only an example. Please keep in mind that garbage collection will be more complex with memberID.

Leader checks if all members in the cluster reported disconnect: true and eventually triggers a final full snapshot.

I have one concern regarding the checking of all members reports disconnect that Let suppose one etcd follower got disconnected from cluster due to network partition then we end-up keep waiting for final full snapshot. I prefer here also we should check the Quorum for disconnect: true. WDYT?

I'd rather use a combination of quorum && timeouts because a sole quorum can result in taking the final snapshot prematurely.

members start to return failures for the /healthz endpoint and kill the etcd process to cut-off client traffic.

I'm under impression that we will not gonna use /healthz endpoint as ReadinessProbe in multi-node etcd. How to cut-off the client traffic for each etcd-member separately ?

At the moment I don't see any other proposed solution, so I used what is already there which is /healthz. If we come to an agreement to not use /healthz and instead something else, then we also have to decide how to cut-off the client traffic with this alternative.

vlerenc · 2021-11-17T15:02:43Z

Leader checks if all members in the cluster reported disconnect: true and eventually triggers a final full snapshot.

@timuthy That is what I would rather leave to the quorum/majority than to all as discussed in the meeting and also @ishan16696 pointed out.

I'd rather use a combination of quorum && timeouts because a sole quorum can result in taking the final snapshot prematurely.

@timuthy Maybe you meant it, but what @ishan16696 and I wondered is the "all", but above you write "quorum". Quorum sounds good, timeouts may help even more as (as discussed in the call, but now in combination with quorum to safe-guard it even more).

The general T5/T6 problem from above is already averted by the quorum trick. So there is now an even smaller risk (must think). The original T5 case was anyway a bit "constructed", because if 2/3 pods see lost or failed ownership, it is unlikely that 1/3 pods will see ownership. That means, that 2/3 pods failed to get the DNS record, but that it was still unchanged and the ownership remained with the seed (or a TTL issue maybe, which is a special form of failure, so quite "constructed" I would think).

A detail question while trying to play through the different cases: I think @stoyanr explained in the case of ownership loss/failure, that the readiness probe returns (503) and then the etcd process is terminated to terminate all existing connections. Since the Kubelet takes time to detect the failed readiness probe and report it back and then KCM to update the endpoints, how is it ensured that the endpoint is first removed and doesn’t come back? If you restart the etcd process too fast, the endpoint is still up. While you can make some assumptions about the Kubelet, KCM is even more unpredictable. Removing the endpoint yourself runs the danger of racing with KCM. What is the trick here?

vlerenc · 2021-11-17T15:44:42Z

@stoyanr replied out-of-band that the delays add up to the time the kubelet/KCM will usually take to remove the endpoint:

We kill the etcd process

After at most periodSeconds (5 seconds), k8s detects that the container liveness probe fails, so it restarts the etcd container after trying a failureThreshold times (defaults to 3), so the total delay is between 10 and 15 seconds

This causes the etcd process to effectively get started again, which also takes a few seconds before it's actually able to receive connections

Meanwhile ~20 seconds have elapsed so the endpoint is already gone

And...

We can probably play with this configuration a bit, e.g. increase the failureThreshold to 5, to make it even more likely to avoid having the etcd process start before the endpoint is gone. Beyond that, I am not sure what could be done to ensure that this is always the case.

Maybe that can be further safe-guarded, but considering the delays above, it’s probably not worth it. E.g., the moment we will let the readiness probe fail, we know the Kubelet will report it whenever it sees it next time (there is still a tiny uncertainty if it invoked the readiness probe that succeeded and has not yet reported it back). Knowing that the Kubelet will report false next time, one can “already” pro-actively update the pod status ready condition to false and also the endpoint as well. That’s it. If the Kubelet reports false, it’s already false and if KCM checks, the endpoint is already gone. As said, there is this time window of uncertainty where we do not know for sure whether the Kubelet will race with us if it has just called the readiness probe successfully and has not yet reported it back.

vlerenc · 2021-11-17T18:14:39Z

A few cases (time progresses vertically), e.g.: Zone outage/network partitioned -> leader election

|           Zone 1           |           Zone 2           |           Zone 3           |
|----------------------------|----------------------------|----------------------------|
| Becoming Leader            | Becoming Follower          | Becoming Follower          |
|                            |                            |                            |
|  /----------------------\  |                            |                            |
|  |  Zone Outage resp.   |  |                            |                            |
|  | Network Partitioned  |  |                            |                            |
|  \----------------------/  |                            |                            |
|                            |                            |                            |
| Heartbeats not sent        |                            |                            |
| Reporting staleness        | Leader Election            | Leader Election            |
|                            | Becoming Leader            | Staying Follower           |
|                            |                            |                            |
|  /----------------------\  |                            |                            |
|  |  Zone Restored resp. |  |                            |                            |
|  |   Network Rejoined   |  |                            |                            |
|  \----------------------/  |                            |                            |
|                            |                            |                            |
| Rejoin as Follower         |                            |                            |

Case: Loss of ownership -> traffic cut-off and final full snapshot backup

|           Zone 1           |           Zone 2           |           Zone 3           |
|----------------------------|----------------------------|----------------------------|
| Becoming Leader            | Becoming Follower          | Becoming Follower          |
|                            |                            |                            |
|                     /------------------------------------------\                     |
|                     | Loss of Ownership / DNS Record Rewritten |                     |
|                     \------------------------------------------/                     |
|                            |                            |                            |
| Ownership change detected  |                            |                            |
| Write "I saw owner loss"   | Ownership change detected  |                            |
| Read peer state (noquorum) | Write "I saw owner loss"   | Ownership change detected  |
| No action taken            | Read peer state (quorum)   | Write "I saw owner loss"   |
|                            | Terminate ETCD process     | Read peer state (quorum)   |
|                            | Fail readiness probe       | Terminate ETCD process     |
|                            | Rejoin as Follower         | Fail readiness probe       |
|                            | No Leader -> No backup     | Rejoin as Follower         |
|                            |                            | No Leader -> No backup     |
| Ownership change detected  |                            |                            |
| Write "I saw owner loss"   |                            |                            |
| Read peer state (quorum)   |                            |                            |
| Terminate ETCD process     |                            |                            |
| Heartbeats not sent        |                            |                            |
|                            | Leader Election            | Leader Election            |
|                            | Becoming Leader            | Staying Follower           |
| Fail readiness probe       | Leader -> Full backup      | No Leader -> No backup     |
| Rejoin as Follower         |                            |                            |
| No Leader -> No backup     |                            |                            |

Case: Zone outage/network partitioned -> leader election &
Loss of ownership -> traffic cut-off and final full snapshot backup

|           Zone 1           |           Zone 2           |           Zone 3           |
|----------------------------|----------------------------|----------------------------|
| Becoming Leader            | Becoming Follower          | Becoming Follower          |
|                            |                            |                            |
|  /----------------------\  |                            |                            |
|  |  Zone Outage resp.   |  |                            |                            |
|  | Network Partitioned  |  |                            |                            |
|  \----------------------/  |                            |                            |
|                            |                            |                            |
| Heartbeats not sent        |                            |                            |
| Reporting staleness        | Leader Election            | Leader Election            |
|                            | Becoming Leader            | Staying Follower           |
|                            |                            |                            |
|                     /------------------------------------------\                     |
|                     | Loss of Ownership / DNS Record Rewritten |                     |
|                     \------------------------------------------/                     |
|                            |                            |                            |
| Ownership change failed    |                            |                            |
| Write "I saw..." failed    | Ownership change detected  |                            |
| No action taken            | Write "I saw owner loss"   | Ownership change detected  |
|                            | Read peer state (noquorum) | Write "I saw owner loss"   |
|                            | No action taken            | Read peer state (quorum)   |
|                            |                            | Terminate ETCD process     |
|                            |                            | Heartbeats not sent        |
|                            | Leader Election            |                            |
|                            | Leader Election            |                            |
|                            | Leader Election            | Fail readiness probe       |
|                            | Leader Election            | Leader Election            |
|                            | Staying Leader             | Staying Follower           |
|                            | Ownership change detected  | No Leader -> No backup     |
|                            | Write "I saw owner loss"   |                            |
|                            | Read peer state (quorum)   |                            |
|                            | Terminate ETCD process     |                            |
|                            | Heartbeats not sent        |                            |
|                            |                            | Leader Election            |
|                            |                            | Leader Election            |
|                            | Fail readiness probe       | Leader Election            |
|                            | Leader Election            | Leader Election            |
|                            | Staying Leader             | Staying Follower           |
|                            | Leader -> Full backup      | No Leader -> No backup     |
...or...
|                            | Leader Election            | Leader Election            |
|                            | Becoming Follower          | Becoming Leader            |
|                            | No Leader -> No backup     | Leader -> Full backup      |
|                            |                            |                            |
|  /----------------------\  |                            |                            |
|  |  Zone Restored resp. |  |                            |                            |
|  |   Network Rejoined   |  |                            |                            |
|  \----------------------/  |                            |                            |
|                            |                            |                            |
| Rejoin as Follower         |                            |                            |
| Ownership change detected  |                            |                            |
| Write "I saw owner loss"   |                            |                            |
| Read peer state (quorum)   |                            |                            |
| Terminate ETCD process     |                            |                            |
| Fail readiness probe       |                            |                            |
| Rejoin as Follower         |                            |                            |
| No Leader -> No backup     |                            |                            |

There are many more (corner cases), but when I think of some (similar to the T5/T6 case that should no longer be possible?), e.g. the leader loses leadership while in full backup, a follower that gets elected to become the next leader should check the owner loss quorum before opening up for traffic (pass or fail the readiness probe from then on depending on the result) and that would prevent intermediate etcd updates, right?

|           Zone 1           |           Zone 2           |           Zone 3           |
|----------------------------|----------------------------|----------------------------|
| Becoming Leader            | Becoming Follower          | Becoming Follower          |
|                            |                            |                            |
|                     /------------------------------------------\                     |
|                     | Loss of Ownership / DNS Record Rewritten |                     |
|                     \------------------------------------------/                     |
|                            |                            |                            |
| Ownership change detected  |                            |                            |
| Write "I saw owner loss"   | Ownership change detected  |                            |
| Read peer state (noquorum) | Write "I saw owner loss"   | Ownership change detected  |
| No action taken            | Read peer state (quorum)   | Write "I saw owner loss"   |
|                            | Terminate ETCD process     | Read peer state (quorum)   |
|                            | Fail readiness probe       | Terminate ETCD process     |
|                            | Rejoin as Follower         | Fail readiness probe       |
|                            | No Leader -> No backup     | Rejoin as Follower         |
|                            |                            | No Leader -> No backup     |
| Ownership change detected  |                            |                            |
| Write "I saw owner loss"   |                            |                            |
| Read peer state (quorum)   |                            |                            |
| Terminate ETCD process     |                            |                            |
| Heartbeats not sent        |                            |                            |
|                            | Leader Election            | Leader Election            |
|                            | Becoming Leader            | Staying Follower           |
| Fail readiness probe       | Leader -> Full backup      | No Leader -> No backup     |
| Rejoin as Follower         |                            |                            |
| No Leader -> No backup     |                            |                            |
|                            |                            |                            |
|                            |  /----------------------\  |                            |
|                            |  |  Zone Outage resp.   |  |                            |
|                            |  | Network Partitioned  |  |                            |
|                            |  \----------------------/  |                            |
|                            |                            |                            |
|                            | Full backup interrupted    |                            |
|                            | Heartbeats not sent        |                            |
| Leader Election            | Reporting staleness        | Leader Election            |
| Staying Follower           |                            | Becoming Leader            |
| No Leader -> No backup     |                            | Leader -> Full backup      |
...or...
| Leader Election            | Reporting staleness        | Leader Election            |
| Becoming Leader            |                            | Staying Follower           |
| Leader -> Full backup      |                            | No Leader -> No backup     |

Something like that?

timuthy · 2021-11-18T07:47:04Z

The general T5/T6 problem from above is already averted by the quorum trick. So there is now an even smaller risk (must think). The original T5 case was anyway a bit "constructed", because if 2/3 pods see lost or failed ownership, it is unlikely that 1/3 pods will see ownership.

The constructed case was less about that 1/3 will remain seeing an ownership but rather when it will start seeing a lost ownership because of the deviated owner check intervals and involved TTLs. @stoyanr then suggested to have a second safeguard which the /meta/owner-check-control/{Member_Name}/disconnect keys are used for in my example.

Do I get it correctly, that with your cases above you don't see the necessity to have this second safeguard?

vlerenc · 2021-11-18T09:50:09Z

The constructed case was less about that 1/3 will remain seeing an ownership but rather when it will start seeing a lost ownership because of the deviated owner check intervals and involved TTLs.

Yes, I understood, but if two pods see already lost ownership, the third one, checking even later, should see it as well. The chance that it doesn't is very small, is it not? Maybe a TTL issue because the record was fetched right before it was switched, something like that. Anyway, I didn't say impossible. And, with quorum, it doesn't matter anymore.

Do I get it correctly, that with your cases above you don't see the necessity to have this second safeguard?

No, that shouldn't imply it. I don't know how exactly it is implemented, the termination, the readiness probe, etc. So, I was vague because of lack of knowledge, but maybe we can fill in the details together and make it safer?

The above has helped me see some things more clearly (like termination will of course cause loss of leadership or leader election), but when that happens, the sections where I write "Terminate ETCD process"/"Fail readiness probe", etc. are too vague for me. It really depends now on the details, but as said yesterday (not helpful as that's not a concrete statement), I don't think it's far now anymore (like also Stoyan said). It's basically a quorum-based "distributed transaction". The pattern is known, the details must be clarified now. What happens chronologically when, storing data, terminating ETCD, checking that data after restart and before ever passing (or not passing) a readiness probe, etc. This information I didn't have to make it more concrete.

ishan16696 · 2022-08-10T04:56:11Z

As we have discussed and decided that we would like to turn off the Owner check in case the multi-node etcd as owner checks introduced complexities which are difficult to manage in a multi-node etcd and with HA control planes the "bad case" control plane migration would be triggered very rarely.
I'm closing this issue in favour of this PR: gardener/gardener#6412 which disables the Owner checks for multi-node etcd.
More details on "bad case" control plane migration, please refer ☂️ :gardener/gardener#6302
/cc @plkokanov @timuthy

/close

ishan16696 added the kind/enhancement Enhancement, improvement, extension label Oct 18, 2021

timuthy mentioned this issue Oct 19, 2021

Multi-Node/Clustered ETCD #107

Closed

34 tasks

timuthy added kind/discussion Discussion (enaging others in deciding about multiple options) and removed kind/enhancement Enhancement, improvement, extension labels Nov 9, 2021

stoyanr mentioned this issue Nov 23, 2021

☂️ [GEP-17] Shoot Control Plane Migration "Bad Case" Scenario gardener/gardener#4866

Closed

46 tasks

This was referenced Jan 28, 2022

[Feature] Readiness Probe in multi-node etcd #288

Closed

[Feature] Readiness Probe in multi-node etcd gardener/etcd-wrapper#7

Open

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label May 18, 2022

ashwani2k added the release/ga Planned for GA(General Availability) release of the Feature label Jul 6, 2022

gardener-robot closed this as completed Aug 10, 2022

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Aug 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] To discuss the impact of PR #383(Add owner checks and taking of final snapshots) on Multi-Node Etcd #242

[Feature] To discuss the impact of PR #383(Add owner checks and taking of final snapshots) on Multi-Node Etcd #242

ishan16696 commented Oct 18, 2021 •

edited

timuthy commented Nov 8, 2021 •

edited

ishan16696 commented Nov 8, 2021

timuthy commented Nov 8, 2021 •

edited

ishan16696 commented Nov 8, 2021

ishan16696 commented Nov 8, 2021 •

edited

timuthy commented Nov 9, 2021

ishan16696 commented Nov 9, 2021

ishan16696 commented Nov 9, 2021 •

edited

ishan16696 commented Nov 10, 2021 •

edited

timuthy commented Nov 10, 2021

ishan16696 commented Nov 10, 2021

timuthy commented Nov 10, 2021

ishan16696 commented Nov 10, 2021

ishan16696 commented Nov 15, 2021

stoyanr commented Nov 17, 2021

vlerenc commented Nov 17, 2021 •

edited

ishan16696 commented Nov 17, 2021 •

edited

timuthy commented Nov 17, 2021

ishan16696 commented Nov 17, 2021

timuthy commented Nov 17, 2021

vlerenc commented Nov 17, 2021

vlerenc commented Nov 17, 2021

vlerenc commented Nov 17, 2021 •

edited

timuthy commented Nov 18, 2021 •

edited

vlerenc commented Nov 18, 2021

ishan16696 commented Aug 10, 2022 •

edited

[Feature] To discuss the impact of PR #383(Add owner checks and taking of final snapshots) on Multi-Node Etcd #242

[Feature] To discuss the impact of PR #383(Add owner checks and taking of final snapshots) on Multi-Node Etcd #242

Comments

ishan16696 commented Oct 18, 2021 • edited

timuthy commented Nov 8, 2021 • edited

ishan16696 commented Nov 8, 2021

timuthy commented Nov 8, 2021 • edited

ishan16696 commented Nov 8, 2021

ishan16696 commented Nov 8, 2021 • edited

timuthy commented Nov 9, 2021

ishan16696 commented Nov 9, 2021

ishan16696 commented Nov 9, 2021 • edited

ishan16696 commented Nov 10, 2021 • edited

timuthy commented Nov 10, 2021

ishan16696 commented Nov 10, 2021

timuthy commented Nov 10, 2021

ishan16696 commented Nov 10, 2021

ishan16696 commented Nov 15, 2021

stoyanr commented Nov 17, 2021

vlerenc commented Nov 17, 2021 • edited

ishan16696 commented Nov 17, 2021 • edited

timuthy commented Nov 17, 2021

ishan16696 commented Nov 17, 2021

timuthy commented Nov 17, 2021

vlerenc commented Nov 17, 2021

vlerenc commented Nov 17, 2021

vlerenc commented Nov 17, 2021 • edited

timuthy commented Nov 18, 2021 • edited

vlerenc commented Nov 18, 2021

ishan16696 commented Aug 10, 2022 • edited

ishan16696 commented Oct 18, 2021 •

edited

timuthy commented Nov 8, 2021 •

edited

timuthy commented Nov 8, 2021 •

edited

ishan16696 commented Nov 8, 2021 •

edited

ishan16696 commented Nov 9, 2021 •

edited

ishan16696 commented Nov 10, 2021 •

edited

vlerenc commented Nov 17, 2021 •

edited

ishan16696 commented Nov 17, 2021 •

edited

vlerenc commented Nov 17, 2021 •

edited

timuthy commented Nov 18, 2021 •

edited

ishan16696 commented Aug 10, 2022 •

edited