Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Make etcd-backup-restore Kubernetes-aware keep member and backup status up-to-date #333

Closed
Tracked by #107
amshuman-kr opened this issue May 17, 2021 · 5 comments · Fixed by gardener/etcd-druid#233 or #382
Assignees
Labels
kind/enhancement Enhancement, improvement, extension release/beta Planned for Beta release of the Feature
Milestone

Comments

@amshuman-kr
Copy link
Collaborator

amshuman-kr commented May 17, 2021

Feature (What you would like to be added):

Make etcd-backup-restore Kubernetes-aware keep the corresponding member status and BackupReady condition up-to-date in the Etcd resource status.

Motivation (Why is this needed?):

etcd-backup-restore is the closest to the source of this information and making it responsible to update these status sections will be a solution at the lowest viable level and avoid etcd-druid becoming a bottleneck.

Approach/Hint to the implement solution (optional):

  1. Member status can be obtained by a heart-beat on the corresponding ETCD member's endpoint status. The same heart-beat can be used to decide backup leader election in [Feature] Add leader-election to the backup-restore sidecar according to the multi-node ETCD proposal. #321
  2. The BackupReady status could be determined based on whether latest backup upload (full or incremental) successful.

Also, it would be preferable to use StatusWriter.Patch() to avoid race-conditions.

@timuthy
Copy link
Member

timuthy commented May 20, 2021

Also, it would be preferable to use StatusWriter.Patch() to avoid race-conditions.

Please see gardener/gardener#4083. As soon as we only have seeds running 1.18 and the support of Etcd-Druid, we can use Server-Side Apply to let each etcd-backup-restore actor only update the relevant portion of the etcd resource which is status.members[id=<cluster-member-id>].

This will spare us potential conflicts, saves time and network costs.

Here is a simple example of how to use SSA:

	etcd := &druidv1alpha1.Etcd{
		TypeMeta: metav1.TypeMeta{
			APIVersion: "druid.gardener.cloud/v1alpha1",
			Kind:       "Etcd",
		},
		ObjectMeta: metav1.ObjectMeta{
			Name:      <name>,
			Namespace: <namespace>,
		},
		Status: druidv1alpha1.EtcdStatus{
			Members: []druidv1alpha1.EtcdMemberStatus{
				{
					ID:                 "1",
					Name:               "member1",
					Role:               druidv1alpha1.EtcdRoleMember,
					Status:             druidv1alpha1.EtcdMemeberStatusReady,
					Reason:             "up and running",
					LastUpdateTime:     metav1.NewTime(time.Now()),
					LastTransitionTime: metav1.NewTime(time.Now()),
				},
			},
		},
	}

	if err := ec.Status().Patch(ctx, etcd, client.Apply, client.FieldOwner("etcdbr1"), client.ForceOwnership); err != nil {
		return err
	}

@shreyas-s-rao
Copy link
Collaborator

shreyas-s-rao commented Jul 20, 2021

Backup sidecar will now have to adhere to gardener/etcd-druid#206, by creating and maintaining the EtcdMember resource, if the corresponding CRD exists, or if a corresponding flag is set in the etcdbrctl command that tells the backup sidecar to work in a k8s-aware environment. It will still maintain the BackupReady condition on the Etcd status according to the original proposal.
@amshuman-kr who is responsible for the deletion of the EtcdMember resource? Ideally, upon termination request, the backup sidecar should delete its own EtcdMember before terminating, but there are various situations when that might not be possible. In that case, would etcd-druid wait for failed heartbeat and then decide whether to set status: Unready or delete the resource itself? I'm a bit unclear about this.

@amshuman-kr
Copy link
Collaborator Author

who is responsible for the deletion of the EtcdMember resource? Ideally, upon termination request, the backup sidecar should delete its own EtcdMember before terminating, but there are various situations when that might not be possible. In that case, would etcd-druid wait for failed heartbeat and then decide whether to set status: Unready or delete the resource itself?

Whoever was responsible for removing the member entry in the status in the earlier design with SSA. Basically, the custodian controller.

There is still the case of superfluous member entries in etcd cluster which is to be cleaned up by the leading sidecar.

@amshuman-kr
Copy link
Collaborator Author

Given the limitations in Server-Side Apply (please see gardener/etcd-druid#179), gardener/etcd-druid#207 describes the alternative of etcd-backup-restore renewing the corresponding lease object continually as a way to inform etcd-druid about the health/status of the corresponding member.

@amshuman-kr
Copy link
Collaborator Author

🎉 ❤️

@ashwani2k ashwani2k added reelase/beta release/beta Planned for Beta release of the Feature labels Jul 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement, improvement, extension release/beta Planned for Beta release of the Feature
Projects
None yet
4 participants