Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] To discuss the impact of PR #383(Add owner checks and taking of final snapshots) on Multi-Node Etcd #242

Closed
Tracked by #107
ishan16696 opened this issue Oct 18, 2021 · 26 comments
Labels
kind/discussion Discussion (enaging others in deciding about multiple options) lifecycle/stale Nobody worked on this for 6 months (will further age) release/ga Planned for GA(General Availability) release of the Feature status/closed Issue is closed (either delivered or triaged)

Comments

@ishan16696
Copy link
Member

ishan16696 commented Oct 18, 2021

Feature (What you would like to be added):

  1. Right now backup-leader have the responsibility to take snapshots(full and delta) and to trigger the defragmentation. Do we also want backup-leader to have a responsibility kill/disconnect the etcd cluster members?
  2. What changes do we have to make in a work-flow of backup-leader ?
  3. Right now if quorum will lost then etcd-druid will take care of this non-quorate cluster and try to bring back the quorum.
    How etcd-druid will going to react to this new change in work-flow of backup-leader ?

Motivation (Why is this needed?):
PR #383 (Add owner checks and taking of final snapshots) wants to disconnect the api-server and etcd for that it is killing the etcd process and fails the Readiness Probe. Now in multi-node etcd Do we wants to kill/disconnect all etcd members ?
If we go with change like backup-leader will kill/disconnect the api-server and etcd cluster will lost its quorum, then we also have to take of etcd-druid reaction to quorum loss.

cc @stoyanr
Approach/Hint to the implement solution (optional):

@ishan16696 ishan16696 added the kind/enhancement Enhancement, improvement, extension label Oct 18, 2021
@timuthy timuthy mentioned this issue Oct 19, 2021
34 tasks
@timuthy
Copy link
Member

timuthy commented Nov 8, 2021

In a test with @ishan16696, @abdasgupta, @aaronfern and @stoyanr we simulated the impact of doing owner checks on every member. This will be necessary to cut off active client connections to the member by killing the etcd process once.

In a multi-node setup killing the etcd process (process is started again afterwards) will have the following impact:

  • Leader is killed: A new leader is elected, the member will continue running as a follower.
  • Follower is killed: Member will continue as a follower.
  • A final full snapshot is only taken when a member becomes a leader && owner check is failing.
  • Read/right requests are accepted as long as the Kubernetes service has one Ready pod (no matter if leader or follower).

--> Ideally, the final full snapshot is only taken after the owner check of the last member fails.

Unfortunately, this is not given at any time.

Example:

T1
------------------
member1:Leader
member2:Follower
member3:Follower
------------------


T2 (etcd process of memeber1 is killed)
------------------
member1:Leader    (owner check fails)
member2:Follower  (owner check succeeds) 
member3:Follower  (owner check succeeds) 
------------------


T3 (member2 becomes leader, member1 runs as follower)
------------------
member1:Follower  (owner check fails)     
member2:Leader    (owner check succeeds) 
member3:Follower  (owner check succeeds) 
------------------


T4 (etcd process of memeber2 is killed)
------------------
member1:Follower  (owner check fails)     
member2:Leader    (owner check fails)
member3:Follower  (owner check succeeds) 
------------------


T5 (member1 becomes leader + final full snapshot is taken, member2 runs as follower)
------------------
member1:Leader    (owner check fails + final full snapshot)
member2:Follower  (owner check fails) 
member3:Follower  (owner check succeeds) --> etcd cluster is still accessible through this member!
------------------


T6 (etcd cluster is not accessible anymore)
------------------
member1:Leader    (owner check fails)   
member2:Follower  (owner check fails)
member3:Follower  (owner check fails) 
------------------

The given example results in a loss of any data that is written between T5 and T6.

@ishan16696
Copy link
Member Author

The given example results in a loss of any data that is written between T5 and T6.

Yeah, but I think upper bound in a loss of data between T5 and T6 will be around ~5mins and I think it’s totally fine as we currently have deltaSnapshotPeriod is scheduled for every 5mins and in worst case here we can also lose 5mins of data.

@timuthy
Copy link
Member

timuthy commented Nov 8, 2021

The given example results in a loss of any data that is written between T5 and T6.

Yeah, but I think upper bound in a loss of data between T5 and T6 will be around ~5mins and I think it’s totally fine as we currently have deltaSnapshotPeriod is scheduled for every 5mins and in worst case here we can also lose 5mins of data.

Thanks for bringing this up. IIUC, the Snapshotter is stopped as long as the owner check is failing, i.e. there will be no delta snapshots taken. Can you double check this?

@ishan16696
Copy link
Member Author

yes, I have double checked ... this block of code will not let the snapshotter loop start if owner check fails.

The only thing we need to discuss is "How to cut-off the client traffic in such a way that etcd peers communication shouldn’t get affected so that there will be no Quorum loss".

@ishan16696
Copy link
Member Author

ishan16696 commented Nov 8, 2021

And AFAIK configuring ReadinessProbe in a master-slave architecture in Kubernetes doesn't seems to be good idea as we will provision multi-node etcd as statefulSet

@timuthy
Copy link
Member

timuthy commented Nov 9, 2021

The only thing we need to discuss is "How to cut-off the client traffic in such a way that etcd peers communication shouldn’t get affected so that there will be no Quorum loss".

So your proposal is to cut-off traffic as soon as one member has a failing owner check and thus there will be no changes any more?

And AFAIK configuring ReadinessProbe in a master-slave architecture in Kubernetes doesn't seems to be good idea as we will provision multi-node etcd as statefulSet

If you want to drop the ReadinessProbe how are we supposed to place a service in front of the etcd cluster?

@timuthy timuthy added kind/discussion Discussion (enaging others in deciding about multiple options) and removed kind/enhancement Enhancement, improvement, extension labels Nov 9, 2021
@ishan16696
Copy link
Member Author

So your proposal is to cut-off traffic as soon as one member has a failing owner check and thus there will be no changes any more?

yes, I think it would be suffice. Let take this scenario under some assumption.
Assumption :

  1. each etcd-member is also running a OwnerCheckwatchdog and can able to kill their corresponding etcd process if they detect owner check fails.
  2. etcd peers communication is possible even after when any etcd cluster member cut-off the client traffic.
  3. snapshots can be uploaded even after when any etcd cluster member cut-off the client ingress traffic as we need to take and upload the final full snapshot when Owner check fails.
T1
------------------
etcd-0: Leader
etcd-1: Follower
etcd-2: Follower
------------------


T2  [etcd process of member2 is killed ---> it will cut-off client ingress traffic]
------------------
etcd-0: Leader
etcd-1: Follower  (Owner check fails)
etcd-2: Follower
------------------


T3 [etcd process of member1 is killed --> leads to leader-election)
------------------
etcd-0: Leader    (Owner check fails)
etcd-1: Follower  (Owner check fails)
etcd-2: Follower
------------------


T4(1) [etcd-1 becomes leader --> takes final full snapshot]
------------------
etcd-0: Follower  (Owner check fails)
etcd-1: Leader    (Owner check fails)
etcd-2: Follower 
------------------

T4(2) [etcd-2 becomes leader --> eventually Owner check will fails --> etcd process of member3 will be killed --> somebody will become leader --> take final full snapshot]
------------------
etcd-0: Follower  (Owner check fails)
etcd-1: Follower  (Owner check fails)
etcd-2: Leader 
------------------

@ishan16696
Copy link
Member Author

ishan16696 commented Nov 9, 2021

There is also another way to do it. we can use MoveLeader API call to transfer the leadership to that etcd member which first detects that Owner check has failed. Advantage of this method is that we don't have to worry about any scenarios like above.
So it will be work like this:

if(Owner check fails && current etcd == Follower ){
      etcd will be killed and restarted
      then transfer the leadership using move leader api call
      takes the Final full snapshot
      cut-off the client traffic
} else if(Owner check fails && current etcd == Leader){
      etcd will be killed and restarted --> leads to leader-election
      then transfer the leadership using move leader api call
      takes the Final full snapshot
      cut-off the client traffic
}

@ishan16696
Copy link
Member Author

ishan16696 commented Nov 10, 2021

I tried to capture the above proposal of using MoveLeader API call described here #242 (comment) in a picture for better understanding:
OwnerCheckProposal

@timuthy
Copy link
Member

timuthy commented Nov 10, 2021

Thanks for the proposal @ishan16696.

Let me address two doubts:

  1. The entire client connection is cut-off as soon as one member has a failing owner check, i.e. we cannot handle temporary network issues of a single zone. I'm not sure if that's what we want.
  2. If 1. was negligible, can we really cut-off the client by solely changing the pod selector? What about existing connections that might still exist to other members which did not kill their etcd process yet?

@ishan16696
Copy link
Member Author

can we really cut-off the client by solely changing the pod selector?

IMO yes it is possible .... the thing is when we change the pod selector in etcd-service we will able to cut-off the ingress client traffic (refer here) ... I also had already tried to simulate this behaviour (although I used k8s LoadBalancer service to cut-off the client request).

What about existing connections that might still exist to other members which did not kill their etcd process yet?

they will eventually kill their corresponding etcd-member when they detects OwnerCheck fails as OwnerCheck Watchdog is running on each backup-follower or leader.

@timuthy
Copy link
Member

timuthy commented Nov 10, 2021

  1. The entire client connection is cut-off as soon as one member has a failing owner check, i.e. we cannot handle temporary network issues of a single zone. I'm not sure if that's what we want.

⬆️ What about this point? Doing multi-node etcd is one of the reasons to survive problems in one AZ.

@ishan16696
Copy link
Member Author

What about this point? Doing multi-node etcd is one of the reasons to survive problems in one AZ.

I think similar question was also raised by @abdasgupta .... I forgot what was the counter argument, @abdasgupta do you remember ?

@ishan16696
Copy link
Member Author

What about this point? Doing multi-node etcd is one of the reasons to survive problems in one AZ.

AFAIK Owner checks fails is not a very frequent scenario it will happens rarely, so it is not like we are completely lost this advantage of having a multi-node etcd .... it's more like we lost this advantage in some scenarios like when OwnerCheck fails.

cc @stoyanr

@stoyanr
Copy link
Contributor

stoyanr commented Nov 17, 2021

We have the following 2 cases:

  1. Owner check fails because the owner DNS record actually changed (or was deleted), which should only happen during the "bad case" scenario of a control plane migration.
  2. Owner check fails because DNS resolution failed - either something is wrong with DNS itself, or there is some other network issue preventing DNS resolution.

Note that from the perspective of the source cluster, it's not possible to make a clear distinction between the 2 cases. The DNS resolution can fail and the control plane migration can happen at the same time, so the source cluster must be disabled when the owner check fails, no matter the actual error. This was discussed a few times and is clearly pointed out in GEP-17 as well. I don't think it would be possible to fall back from this point without compromising the entire "bad case" design idea (to which we agreed after several phases of PoC and a lot of discussions).

From the discussion so far I believe the first case is covered. In the second case, the interesting question is what the behavior should be if the DNS resolution only fails in a single zone. If it fails in all (or more than one) zones then the cluster should be disabled, and it's probably not really usable in this case anyway.

Ideally, if the DNS resolution fails in a single zone, we should be able to contain the failure to that zone only. Let's discuss in a meeting if this would be possible and how to achieve it.

@vlerenc
Copy link
Member

vlerenc commented Nov 17, 2021

Do we also want backup-leader to have a responsibility kill/disconnect the etcd cluster members?

@ishan16696 I wouldn't think so. The leaders can change/aren't stable. Why aren't we letting the druid doing it?

Is that some "shortcut", so that the final snapshot is taken? But that can be instrumented by the druid also or remain implicit, but instead of performing an owner check, be based on checking whether traffic was cut off, i.e. druid cuts of traffic, backup sidecar notices that via the changed service and takes a full snapshot, because it is obvious/in a sense logical to now cease operations and close shop, which equals to one last final full snapshot.

Yeah, but I think upper bound in a loss of data between T5 and T6 will be around ~5mins and I think it’s totally fine as we currently have deltaSnapshotPeriod is scheduled for every 5mins and in worst case here we can also lose 5mins of data.

Uh, what? Why do we consider data loss acceptable (even 5 minutes) without proving it is unavoidable?

In other words, unless there is no technical solution or no practical one with acceptable effort (considering the severity of "data loss", we will usually go to greater lengths than for any other feature in Gardener), data loss is not acceptable.

We first need to be certain that it cannot be helped, but just like that (without a very detailed explanation), data loss is not acceptable, I would think.

So your proposal is to cut-off traffic as soon as one member has a failing owner check and thus there will be no changes any more?

Terminology question: When you say "the check fails" you do not mean the "check fails", but that that the check shows that this ETCD cluster is no longer responsible for that cluster, right? Because if the „check fails", this should not have consequences and over-eagerly cut off traffic.

But the main point is, that I do not see why the individual ETCD instances should have all their own checks and not the druid be in control. The druid is the master orchestrator here, including cutting off traffic, I would think?

The entire client connection is cut-off as soon as one member has a failing owner check, i.e. we cannot handle temporary network issues of a single zone. I'm not sure if that's what we want.

Yes @timuthy. See above. Absolutely not. If a zone is segregated and the check fails (not: check shows this ETCD lost responsibility for that shoot), whether the ETCD instance can then reach the control plane or not, it shall NEVER cut off traffic.

I understand that there may be cases where DNS is malfunctioning (though, probably calls will time our or deliver stale records, which is likely and even more worrying then to cut off traffic, see e.g. our own CoreDNS issues in the past and present) AND the ETCD cluster has lost ownership, but that is a corner case of a corner case of a corner case and the risk is much, much smaller to NOT cut off client traffic than to cut it off too often and render control planes broken as we have seen in the past when we over-eagerly (e.g. because backups failed) shut down our ETCD.

In that sense, https://github.com/gardener/gardener/blob/master/docs/proposals/17-shoot-control-plane-migration-bad-case.md#handling-inability-to-resolve-the-owner-dns-record is pretty strong and I thought (but again didn't read the GEP in detail) we agreed to ONLY cut off traffic if it is clear that ownership is lost (not failed) @stoyanr? If druid happens to be in the "broken" zone, it will/should probably fail its own readiness check as well and then come up on another (ready) node that will be (eventually) on a node in a healthy zone and if then the ETCD cluster lost indeed (for real) ownership, it can cut off the traffic.

@ishan16696
Copy link
Member Author

ishan16696 commented Nov 17, 2021

To still have the advantage of having multi-node etcd of surviving the failure in one zone or any other DNS malfunctioning:
As discussed in a meeting, we can use the Quorum in such way that when there will be a consensus of Owner check has failed then only we completely cut-off the client traffic.

Let take this scenario under some assumption.
Assumption :

  1. each etcd-member is also running a OwnerCheckwatchdog and can able to kill their corresponding etcd process if they detect owner check fails.
  2. etcd peers communication is possible even after when any etcd cluster member cut-off the client traffic.
T1
------------------
etcd-0: Leader
etcd-1: Follower
etcd-2: Follower
------------------


T2  [etcd process of member2 is killed ---> it will let other member knows that `it has detected that Owner check has failed` ]
------------------
etcd-0: Leader
etcd-1: Follower  (Owner check fails)
etcd-2: Follower
------------------


Consensus: Owner check hasn't failed yet as other 2 etcd member haven't detected  the Owner check failure. 
etcd can still able to serve the incoming traffic.

@timuthy
Copy link
Member

timuthy commented Nov 17, 2021

To complement the previous comment about how a consensus for owner check results can be implemented:

  1. Each member maintains the result of the owner check in the etcd cluster, e.g.:
etcdctl put /meta/owner-check-control/{Member_Name}/last-successful-check "2021-11-16T07:43:19Z"

(we can also work with etcd leases for further simplification)

  1. Having (n/2) + 1 members with an outdated last-successful-check timestamp means that all members consent to a failing owner check.

  2. With (2.) being true, members start to return failures for the /healthz endpoint and kill the etcd process to cut-off client traffic.

  3. Each member reports the success of step (3.) by adding a key-value pair

etcdctl put /meta/owner-check-control/{Member_Name}/disconnect true
  1. Leader checks if all members in the cluster reported disconnect: true and eventually triggers a final full snapshot.

@ishan16696
Copy link
Member Author

etcdctl put /meta/owner-check-control/{Member_Name}/last-successful-check "2021-11-16T07:43:19Z"

what about using memberID instead of member-name.

  1. Leader checks if all members in the cluster reported disconnect: true and eventually triggers a final full snapshot.

I have one concern regarding the checking of all members reports disconnect that Let suppose one etcd follower got disconnected from cluster due to network partition then we end-up keep waiting for final full snapshot.
I prefer here also we should check the Quorum for disconnect: true. WDYT?

members start to return failures for the /healthz endpoint and kill the etcd process to cut-off client traffic.

I'm under impression that we will not gonna use /healthz endpoint as ReadinessProbe in multi-node etcd. How to cut-off the client traffic for each etcd-member separately ?

@timuthy
Copy link
Member

timuthy commented Nov 17, 2021

etcdctl put /meta/owner-check-control/{Member_Name}/last-successful-check "2021-11-16T07:43:19Z"

what about using memberID instead of member-name.

No clear preference, this was only an example. Please keep in mind that garbage collection will be more complex with memberID.

  1. Leader checks if all members in the cluster reported disconnect: true and eventually triggers a final full snapshot.

I have one concern regarding the checking of all members reports disconnect that Let suppose one etcd follower got disconnected from cluster due to network partition then we end-up keep waiting for final full snapshot. I prefer here also we should check the Quorum for disconnect: true. WDYT?

I'd rather use a combination of quorum && timeouts because a sole quorum can result in taking the final snapshot prematurely.

members start to return failures for the /healthz endpoint and kill the etcd process to cut-off client traffic.

I'm under impression that we will not gonna use /healthz endpoint as ReadinessProbe in multi-node etcd. How to cut-off the client traffic for each etcd-member separately ?

At the moment I don't see any other proposed solution, so I used what is already there which is /healthz. If we come to an agreement to not use /healthz and instead something else, then we also have to decide how to cut-off the client traffic with this alternative.

@vlerenc
Copy link
Member

vlerenc commented Nov 17, 2021

Leader checks if all members in the cluster reported disconnect: true and eventually triggers a final full snapshot.

@timuthy That is what I would rather leave to the quorum/majority than to all as discussed in the meeting and also @ishan16696 pointed out.

I'd rather use a combination of quorum && timeouts because a sole quorum can result in taking the final snapshot prematurely.

@timuthy Maybe you meant it, but what @ishan16696 and I wondered is the "all", but above you write "quorum". Quorum sounds good, timeouts may help even more as (as discussed in the call, but now in combination with quorum to safe-guard it even more).

The general T5/T6 problem from above is already averted by the quorum trick. So there is now an even smaller risk (must think). The original T5 case was anyway a bit "constructed", because if 2/3 pods see lost or failed ownership, it is unlikely that 1/3 pods will see ownership. That means, that 2/3 pods failed to get the DNS record, but that it was still unchanged and the ownership remained with the seed (or a TTL issue maybe, which is a special form of failure, so quite "constructed" I would think).

A detail question while trying to play through the different cases: I think @stoyanr explained in the case of ownership loss/failure, that the readiness probe returns (503) and then the etcd process is terminated to terminate all existing connections. Since the Kubelet takes time to detect the failed readiness probe and report it back and then KCM to update the endpoints, how is it ensured that the endpoint is first removed and doesn’t come back? If you restart the etcd process too fast, the endpoint is still up. While you can make some assumptions about the Kubelet, KCM is even more unpredictable. Removing the endpoint yourself runs the danger of racing with KCM. What is the trick here?

@vlerenc
Copy link
Member

vlerenc commented Nov 17, 2021

@stoyanr replied out-of-band that the delays add up to the time the kubelet/KCM will usually take to remove the endpoint:

  • We kill the etcd process
  • After at most periodSeconds (5 seconds), k8s detects that the container liveness probe fails, so it restarts the etcd container after trying a failureThreshold times (defaults to 3), so the total delay is between 10 and 15 seconds
  • This causes the etcd process to effectively get started again, which also takes a few seconds before it's actually able to receive connections
  • Meanwhile ~20 seconds have elapsed so the endpoint is already gone

And...

We can probably play with this configuration a bit, e.g. increase the failureThreshold to 5, to make it even more likely to avoid having the etcd process start before the endpoint is gone. Beyond that, I am not sure what could be done to ensure that this is always the case.

Maybe that can be further safe-guarded, but considering the delays above, it’s probably not worth it. E.g., the moment we will let the readiness probe fail, we know the Kubelet will report it whenever it sees it next time (there is still a tiny uncertainty if it invoked the readiness probe that succeeded and has not yet reported it back). Knowing that the Kubelet will report false next time, one can “already” pro-actively update the pod status ready condition to false and also the endpoint as well. That’s it. If the Kubelet reports false, it’s already false and if KCM checks, the endpoint is already gone. As said, there is this time window of uncertainty where we do not know for sure whether the Kubelet will race with us if it has just called the readiness probe successfully and has not yet reported it back.

@vlerenc
Copy link
Member

vlerenc commented Nov 17, 2021

A few cases (time progresses vertically), e.g.: Zone outage/network partitioned -> leader election

|           Zone 1           |           Zone 2           |           Zone 3           |
|----------------------------|----------------------------|----------------------------|
| Becoming Leader            | Becoming Follower          | Becoming Follower          |
|                            |                            |                            |
|  /----------------------\  |                            |                            |
|  |  Zone Outage resp.   |  |                            |                            |
|  | Network Partitioned  |  |                            |                            |
|  \----------------------/  |                            |                            |
|                            |                            |                            |
| Heartbeats not sent        |                            |                            |
| Reporting staleness        | Leader Election            | Leader Election            |
|                            | Becoming Leader            | Staying Follower           |
|                            |                            |                            |
|  /----------------------\  |                            |                            |
|  |  Zone Restored resp. |  |                            |                            |
|  |   Network Rejoined   |  |                            |                            |
|  \----------------------/  |                            |                            |
|                            |                            |                            |
| Rejoin as Follower         |                            |                            |

Case: Loss of ownership -> traffic cut-off and final full snapshot backup

|           Zone 1           |           Zone 2           |           Zone 3           |
|----------------------------|----------------------------|----------------------------|
| Becoming Leader            | Becoming Follower          | Becoming Follower          |
|                            |                            |                            |
|                     /------------------------------------------\                     |
|                     | Loss of Ownership / DNS Record Rewritten |                     |
|                     \------------------------------------------/                     |
|                            |                            |                            |
| Ownership change detected  |                            |                            |
| Write "I saw owner loss"   | Ownership change detected  |                            |
| Read peer state (noquorum) | Write "I saw owner loss"   | Ownership change detected  |
| No action taken            | Read peer state (quorum)   | Write "I saw owner loss"   |
|                            | Terminate ETCD process     | Read peer state (quorum)   |
|                            | Fail readiness probe       | Terminate ETCD process     |
|                            | Rejoin as Follower         | Fail readiness probe       |
|                            | No Leader -> No backup     | Rejoin as Follower         |
|                            |                            | No Leader -> No backup     |
| Ownership change detected  |                            |                            |
| Write "I saw owner loss"   |                            |                            |
| Read peer state (quorum)   |                            |                            |
| Terminate ETCD process     |                            |                            |
| Heartbeats not sent        |                            |                            |
|                            | Leader Election            | Leader Election            |
|                            | Becoming Leader            | Staying Follower           |
| Fail readiness probe       | Leader -> Full backup      | No Leader -> No backup     |
| Rejoin as Follower         |                            |                            |
| No Leader -> No backup     |                            |                            |

Case: Zone outage/network partitioned -> leader election &
Loss of ownership -> traffic cut-off and final full snapshot backup

|           Zone 1           |           Zone 2           |           Zone 3           |
|----------------------------|----------------------------|----------------------------|
| Becoming Leader            | Becoming Follower          | Becoming Follower          |
|                            |                            |                            |
|  /----------------------\  |                            |                            |
|  |  Zone Outage resp.   |  |                            |                            |
|  | Network Partitioned  |  |                            |                            |
|  \----------------------/  |                            |                            |
|                            |                            |                            |
| Heartbeats not sent        |                            |                            |
| Reporting staleness        | Leader Election            | Leader Election            |
|                            | Becoming Leader            | Staying Follower           |
|                            |                            |                            |
|                     /------------------------------------------\                     |
|                     | Loss of Ownership / DNS Record Rewritten |                     |
|                     \------------------------------------------/                     |
|                            |                            |                            |
| Ownership change failed    |                            |                            |
| Write "I saw..." failed    | Ownership change detected  |                            |
| No action taken            | Write "I saw owner loss"   | Ownership change detected  |
|                            | Read peer state (noquorum) | Write "I saw owner loss"   |
|                            | No action taken            | Read peer state (quorum)   |
|                            |                            | Terminate ETCD process     |
|                            |                            | Heartbeats not sent        |
|                            | Leader Election            |                            |
|                            | Leader Election            |                            |
|                            | Leader Election            | Fail readiness probe       |
|                            | Leader Election            | Leader Election            |
|                            | Staying Leader             | Staying Follower           |
|                            | Ownership change detected  | No Leader -> No backup     |
|                            | Write "I saw owner loss"   |                            |
|                            | Read peer state (quorum)   |                            |
|                            | Terminate ETCD process     |                            |
|                            | Heartbeats not sent        |                            |
|                            |                            | Leader Election            |
|                            |                            | Leader Election            |
|                            | Fail readiness probe       | Leader Election            |
|                            | Leader Election            | Leader Election            |
|                            | Staying Leader             | Staying Follower           |
|                            | Leader -> Full backup      | No Leader -> No backup     |
...or...
|                            | Leader Election            | Leader Election            |
|                            | Becoming Follower          | Becoming Leader            |
|                            | No Leader -> No backup     | Leader -> Full backup      |
|                            |                            |                            |
|  /----------------------\  |                            |                            |
|  |  Zone Restored resp. |  |                            |                            |
|  |   Network Rejoined   |  |                            |                            |
|  \----------------------/  |                            |                            |
|                            |                            |                            |
| Rejoin as Follower         |                            |                            |
| Ownership change detected  |                            |                            |
| Write "I saw owner loss"   |                            |                            |
| Read peer state (quorum)   |                            |                            |
| Terminate ETCD process     |                            |                            |
| Fail readiness probe       |                            |                            |
| Rejoin as Follower         |                            |                            |
| No Leader -> No backup     |                            |                            |

There are many more (corner cases), but when I think of some (similar to the T5/T6 case that should no longer be possible?), e.g. the leader loses leadership while in full backup, a follower that gets elected to become the next leader should check the owner loss quorum before opening up for traffic (pass or fail the readiness probe from then on depending on the result) and that would prevent intermediate etcd updates, right?

|           Zone 1           |           Zone 2           |           Zone 3           |
|----------------------------|----------------------------|----------------------------|
| Becoming Leader            | Becoming Follower          | Becoming Follower          |
|                            |                            |                            |
|                     /------------------------------------------\                     |
|                     | Loss of Ownership / DNS Record Rewritten |                     |
|                     \------------------------------------------/                     |
|                            |                            |                            |
| Ownership change detected  |                            |                            |
| Write "I saw owner loss"   | Ownership change detected  |                            |
| Read peer state (noquorum) | Write "I saw owner loss"   | Ownership change detected  |
| No action taken            | Read peer state (quorum)   | Write "I saw owner loss"   |
|                            | Terminate ETCD process     | Read peer state (quorum)   |
|                            | Fail readiness probe       | Terminate ETCD process     |
|                            | Rejoin as Follower         | Fail readiness probe       |
|                            | No Leader -> No backup     | Rejoin as Follower         |
|                            |                            | No Leader -> No backup     |
| Ownership change detected  |                            |                            |
| Write "I saw owner loss"   |                            |                            |
| Read peer state (quorum)   |                            |                            |
| Terminate ETCD process     |                            |                            |
| Heartbeats not sent        |                            |                            |
|                            | Leader Election            | Leader Election            |
|                            | Becoming Leader            | Staying Follower           |
| Fail readiness probe       | Leader -> Full backup      | No Leader -> No backup     |
| Rejoin as Follower         |                            |                            |
| No Leader -> No backup     |                            |                            |
|                            |                            |                            |
|                            |  /----------------------\  |                            |
|                            |  |  Zone Outage resp.   |  |                            |
|                            |  | Network Partitioned  |  |                            |
|                            |  \----------------------/  |                            |
|                            |                            |                            |
|                            | Full backup interrupted    |                            |
|                            | Heartbeats not sent        |                            |
| Leader Election            | Reporting staleness        | Leader Election            |
| Staying Follower           |                            | Becoming Leader            |
| No Leader -> No backup     |                            | Leader -> Full backup      |
...or...
| Leader Election            | Reporting staleness        | Leader Election            |
| Becoming Leader            |                            | Staying Follower           |
| Leader -> Full backup      |                            | No Leader -> No backup     |

Something like that?

@timuthy
Copy link
Member

timuthy commented Nov 18, 2021

The general T5/T6 problem from above is already averted by the quorum trick. So there is now an even smaller risk (must think). The original T5 case was anyway a bit "constructed", because if 2/3 pods see lost or failed ownership, it is unlikely that 1/3 pods will see ownership.

The constructed case was less about that 1/3 will remain seeing an ownership but rather when it will start seeing a lost ownership because of the deviated owner check intervals and involved TTLs. @stoyanr then suggested to have a second safeguard which the /meta/owner-check-control/{Member_Name}/disconnect keys are used for in my example.

Do I get it correctly, that with your cases above you don't see the necessity to have this second safeguard?

@vlerenc
Copy link
Member

vlerenc commented Nov 18, 2021

The constructed case was less about that 1/3 will remain seeing an ownership but rather when it will start seeing a lost ownership because of the deviated owner check intervals and involved TTLs.

Yes, I understood, but if two pods see already lost ownership, the third one, checking even later, should see it as well. The chance that it doesn't is very small, is it not? Maybe a TTL issue because the record was fetched right before it was switched, something like that. Anyway, I didn't say impossible. And, with quorum, it doesn't matter anymore.

Do I get it correctly, that with your cases above you don't see the necessity to have this second safeguard?

No, that shouldn't imply it. I don't know how exactly it is implemented, the termination, the readiness probe, etc. So, I was vague because of lack of knowledge, but maybe we can fill in the details together and make it safer?

The above has helped me see some things more clearly (like termination will of course cause loss of leadership or leader election), but when that happens, the sections where I write "Terminate ETCD process"/"Fail readiness probe", etc. are too vague for me. It really depends now on the details, but as said yesterday (not helpful as that's not a concrete statement), I don't think it's far now anymore (like also Stoyan said). It's basically a quorum-based "distributed transaction". The pattern is known, the details must be clarified now. What happens chronologically when, storing data, terminating ETCD, checking that data after restart and before ever passing (or not passing) a readiness probe, etc. This information I didn't have to make it more concrete.

@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label May 18, 2022
@ashwani2k ashwani2k added the release/ga Planned for GA(General Availability) release of the Feature label Jul 6, 2022
@ishan16696
Copy link
Member Author

ishan16696 commented Aug 10, 2022

As we have discussed and decided that we would like to turn off the Owner check in case the multi-node etcd as owner checks introduced complexities which are difficult to manage in a multi-node etcd and with HA control planes the "bad case" control plane migration would be triggered very rarely.
I'm closing this issue in favour of this PR: gardener/gardener#6412 which disables the Owner checks for multi-node etcd.
More details on "bad case" control plane migration, please refer ☂️ :gardener/gardener#6302
/cc @plkokanov @timuthy

/close

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/discussion Discussion (enaging others in deciding about multiple options) lifecycle/stale Nobody worked on this for 6 months (will further age) release/ga Planned for GA(General Availability) release of the Feature status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

No branches or pull requests

6 participants