Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance backup ready conditions #619

Closed

Conversation

shreyas-s-rao
Copy link
Contributor

@shreyas-s-rao shreyas-s-rao commented Jun 19, 2023

How to categorize this PR?

/area backup
/area usability
/kind enhancement

What this PR does / why we need it:

  • Enhance backup ready conditions to provide more fine-grained condition status messages and reasons based on different states of snapshot leases
    • Set BackupReady condition reason to BackupFailed no longer depends on previous condition (previously, setting this depended on whether previous condition was failed or unknown, which meant that if previous condition was succeeded, we would never set condition to failed unless either of the leases were recreated. This behavior is now fixed and made fully deterministic)
    • Fine-grained condition messages for operators to infer whether full, delta or both snapshot leases have problems with renewal
    • Improve readability of handling of different cases when setting BackupReady condition
    • Remove hardcoding of 24h for calculating staleness of full snapshot lease, by computing the "schedule duration" (duration between two activations of the cron, assuming activations are equal durations apart) from full snapshot cron schedule

Which issue(s) this PR fixes:
Fixes #618

Special notes for your reviewer:

Release note:

Enhance `BackupReady` conditions to allow for more fine-grained condition states, messages and reasons.

@shreyas-s-rao shreyas-s-rao requested a review from a team as a code owner June 19, 2023 07:52
@gardener-robot gardener-robot added area/backup Backup related area/usability Usability related kind/enhancement Enhancement, improvement, extension needs/review Needs review size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Jun 19, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 19, 2023
@gardener-robot-ci-2 gardener-robot-ci-2 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jun 19, 2023
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 19, 2023
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 19, 2023
@shreyas-s-rao
Copy link
Contributor Author

/invite @unmarshall

@gardener-robot
Copy link

@unmarshall You have pull request review open invite, please check

pkg/health/condition/check_backup_ready.go Outdated Show resolved Hide resolved
Makefile Show resolved Hide resolved
@gardener-robot gardener-robot added the needs/changes Needs (more) changes label Jun 27, 2023
@gardener-robot gardener-robot added size/l Size of pull request is large (see gardener-robot robot/bots/size.py) needs/second-opinion Needs second review by someone else and removed size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Jun 28, 2023
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jun 28, 2023
@shreyas-s-rao
Copy link
Contributor Author

@seshachalam-yv thanks for your review. I've addressed your comment, in a slightly different way than you suggested. But overall, readability has improved. PTAL.

Copy link
Contributor

@seshachalam-yv seshachalam-yv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from one small nitpick, overall, this PR looks great to me. I appreciate the time and effort you've put into addressing all the comments and suggestions. I am impressed with the changes you've made in this PR. It not only enhances the code's readability but also provides a clear and logical flow. Excellent job! 😍

@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 30, 2023
@shreyas-s-rao
Copy link
Contributor Author

@seshachalam-yv I've addressed your follow-up suggestion as well. Thanks for the detailed suggestions!

@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 30, 2023
@shreyas-s-rao
Copy link
Contributor Author

/test pull-etcd-druid-e2e-kind

@shreyas-s-rao shreyas-s-rao added this to the v0.19.0 milestone Jul 3, 2023
pkg/health/condition/builder.go Outdated Show resolved Hide resolved
pkg/health/condition/check_backup_ready.go Outdated Show resolved Hide resolved
// Fetch snapshot leases
fullSnapshotLease, err := a.fetchLease(ctx, etcd.GetFullSnapshotLeaseName(), etcd.Namespace)
if err != nil {
return createBackupConditionResult(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no real need for createBackupConditionResult function. You do not save on the number of lines of code, in-fact you have more lines of code :) and there is no readability improvement over just creating an instance of a struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conType: druidv1alpha1.ConditionTypeBackupReady is common to all backupReady condition result, so it made sense to pull it out into a separate function, just to avoid adding the conType every single time when returning. Of course, the previous method was to create a default result at the beginning of the function and then simply change the values when returning, but @seshachalam-yv pointed out that it was not the most readable, since one has to check the default result as well as the changed values to figure out the final result being returned.

if err != nil {
return createBackupConditionResult(
druidv1alpha1.ConditionUnknown, Unknown,
fmt.Sprintf("Unable to fetch delta snap lease. %s", err.Error()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you wish to include err.Error() as part of the message or just log it using logger.Error? How large is err.Error()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason Unknown is quite vague. Current set of conditions that i see on a typical etcd resource are as follows:

 conditions:
  - lastTransitionTime: "2023-07-05T03:55:06Z"
    lastUpdateTime: "2023-07-05T05:02:52Z"
    message: All members are ready
    reason: AllMembersReady
    status: "True"
    type: AllMembersReady
  - lastTransitionTime: "2023-07-05T04:00:02Z"
    lastUpdateTime: "2023-07-05T05:02:52Z"
    message: Snapshot backup succeeded
    reason: BackupSucceeded
    status: "True"
    type: BackupReady
  - lastTransitionTime: "2023-07-05T03:55:06Z"
    lastUpdateTime: "2023-07-05T05:02:52Z"
    message: The majority of ETCD members is ready
    reason: Quorate
    status: "True"
    type: Ready

If you see the reason clearly indicates what that condition is for. So having a reason as Unknown would be un-qualified and therefore very hard to reason or disambiguate or even process later. In your proposal one has to look at the message to learn more about the condition which is a departure from the existing set of conditions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check method returns a single Result. If there are problems fetching both the leases then you will always return the condition with message for delta, thereby masking full-snapshot lease condition message. Would it make sense to have different conditions - one for delta and another for full snapshot?

This also allows you to separately capture a message when you are unable to compute the full snapshot duration. This will then not affect the condition for delta snapshot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when lease itself is NotFound? Should you not have a different message indicating that the lease itself is missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you wish to include err.Error() as part of the message or just log it using logger.Error? How large is err.Error()

It's safer to add the error to the condition, just so that it's visible to an operator without having to sift through logs. Even gardener shoot conditions for instance store the error message in the condition, which are printed to the dashboard as well. It's quite helpful for operators and users alike.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you see the reason clearly indicates what that condition is for. So having a reason as Unknown would be un-qualified and therefore very hard to reason or disambiguate or even process later. In your proposal one has to look at the message to learn more about the condition which is a departure from the existing set of conditions.

I'll try to add more meaningful reason strings then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check method returns a single Result. If there are problems fetching both the leases then you will always return the condition with message for delta, thereby masking full-snapshot lease condition message. Would it make sense to have different conditions - one for delta and another for full snapshot?

I've handled such cases specifically so that the full snapshot lease error does not get masked by delta snapshot lease error. If both leases have errors, both are captured in the condition, such as this and this.
Looks like only the case of failing to fetch the leases needs to be handled more robustly. I'll handle this then, thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when lease itself is NotFound? Should you not have a different message indicating that the lease itself is missing?

Right now, it's a blanket message of Unable to fetch full/delta snap lease: <error-string>, which is still technically correct. The error string holds the reason as to why the fetch failed, and it will specify that lease not found. If you want, I can separate out the lease-not-found case and use a separate Reason string for that like LeaseNotFound. WDYT?

pkg/utils/miscellaneous.go Outdated Show resolved Hide resolved
pkg/health/condition/check_backup_ready.go Show resolved Hide resolved
isFullSnapshotLeaseStale := isLeaseStale(fullSnapshotLeaseRenewTime, fullSnapshotLeaseRenewalGracePeriod)
isDeltaSnapshotLeaseStale := isLeaseStale(deltaSnapshotLeaseRenewTime, deltaSnapshotLeaseRenewalGracePeriod)

if isFullSnapshotLeaseStale && !isDeltaSnapshotLeaseStale {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you create separate conditions for full and delta snapshots then these checks would get simplified and it will let you capture these 2 conditions independently. BackupsReady can then be a derived condition which could look at delta and full snapshot conditions if at all you require a single condition for all backups.

pkg/health/condition/check_backup_ready.go Outdated Show resolved Hide resolved
pkg/health/condition/check_backup_ready.go Outdated Show resolved Hide resolved
isDeltaSnapshotLeaseStale := isLeaseStale(deltaSnapshotLeaseRenewTime, deltaSnapshotLeaseRenewalGracePeriod)

// Delta snapshot lease is stale, while staleness of full snapshot lease cannot be determined yet
if isDeltaSnapshotLeaseStale && wasFullSnapshotLeaseCreatedRecently {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has become quite complicated. Can be simplified by just having 2 conditions. Then we just require 2 functions overall - one to properly update full snapshot lease status and another to update delta snapshot lease status

Comment on lines 182 to 183
// even though the full snapshot may have succeeded within the required time, we must still wait
// for delta snapshotting to begin to consider the backups as healthy, to maintain the given RPO.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too sure I agree with this
I think that if a full snapshot has been taken and that is within the deltaSnapshotRenewalGracePeriod then backup status should be BackupSucceeded as this still maintains our RPO for that instant and makes semantic sense
wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally have an issue with having a single condition for delta + full snapshot backup. It will become a LOT easier if we have separate conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the reason we pass the deltaSnapshotRenewalGracePeriod as 5*etcd.Spec.Backup.DeltaSnapshotPeriod.Duration, to allow backup sidecar that much time to start delta snapshotting. It depends on how we define RPO, and right now RPO loosely means the delta snapshot period (1x). I've removed the mention of RPO in the comment to avoid any ambiguity, since we still don't define an official RPO for etcds managed by druid, so it doesn't make sense to bake that into the code now, until we have more clarity.

@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jul 6, 2023
@shreyas-s-rao
Copy link
Contributor Author

/hold
To be re-looked at from the perspective of having two separate conditions for FullBackupReady and DeltaBackupReady as suggested by @unmarshall .
/milestone v0.20.0

@gardener-robot gardener-robot modified the milestones: v0.19.0, v0.20.0 Jul 6, 2023
@gardener-robot gardener-robot added the reviewed/do-not-merge Has no approval for merging as it may break things, be of poor quality or have (ext.) dependencies label Jul 6, 2023
@gardener-prow
Copy link

gardener-prow bot commented Jul 31, 2023

@shreyas-s-rao: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-druid-e2e-kind 9645733 link true /test pull-etcd-druid-e2e-kind
pull-etcd-druid-e2e-kind-alpha-features 9645733 link true /test pull-etcd-druid-e2e-kind-alpha-features

Full PR test history. Your PR dashboard. Command help for this repository.
Please help us cut down on flakes by linking this test failure to an open flake report or filing a new flake report if you can't find an existing one. Also see our testing guideline for how to avoid and hunt flakes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@gardener-robot gardener-robot added the needs/rebase Needs git rebase label Aug 3, 2023
@gardener-robot
Copy link

@shreyas-s-rao You need rebase this pull request with latest master branch. Please check.

@shreyas-s-rao shreyas-s-rao removed this from the v0.20.0 milestone Aug 11, 2023
@shreyas-s-rao
Copy link
Contributor Author

After an out-of-band discussion amongst myself, @unmarshall , @seshachalam-yv and @aaronfern , we concluded that it is not simple to handle all cases of successful, failed, skipped, missed snapshots by etcd-backup-restore, as well as missed renewals of the snapshot lease. Instead, we will solve this holistically as part of #702 , where the EtcdMember status.snapshots.last[Full|Delta] will also include an additional field state, with possible values Succeeded, Failed or Skipped, to correctly reflect the state of the snapshot.

This PR will be closed in favour of #729 , which focuses on fixing the problem of hardcoded value of 24h for checking full snapshot staleness.
/close

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Nov 29, 2023
@shreyas-s-rao shreyas-s-rao deleted the enhance/backup-ready-conds branch November 29, 2023 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backup Backup related area/usability Usability related kind/enhancement Enhancement, improvement, extension needs/changes Needs (more) changes needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/rebase Needs git rebase needs/review Needs review needs/second-opinion Needs second review by someone else reviewed/do-not-merge Has no approval for merging as it may break things, be of poor quality or have (ext.) dependencies size/l Size of pull request is large (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Enhancement] Improve BackupReady conditions
8 participants