-
Notifications
You must be signed in to change notification settings - Fork 53
Description
Problem
When running scheduled backups with retention policies, we observe transient errors:
{"level":"error","msg":"Error while updating the recovery window in the ObjectStore status stanza. Skipping.","error":"Operation cannot be fulfilled on objectstores.barmancloud.cnpg.io \"cluster-name-backup\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"error","msg":"Retention policy enforcement failed","error":"Operation cannot be fulfilled on objectstores.barmancloud.cnpg.io \"cluster-name-backup\": the object has been modified; please apply your changes to the latest version and try again"}
Root Cause Analysis
After investigating the plugin source code, we identified that the updateRecoveryWindow function in internal/cnpgi/instance/recovery_window.go performs a direct status update without retry logic:
// recovery_window.go:40
func updateRecoveryWindow(...) error {
// ... builds status ...
return c.Status().Update(ctx, objectStore) // No retry on conflict
}This function is called from two places that can run concurrently:
- backup.go:169 - After a backup completes successfully
- retention.go:66 - During periodic retention policy enforcement (default every 5 minutes)
When both operations happen close together, Kubernetes optimistic concurrency control rejects one update because the resourceVersion changed between read and write.
Evidence
The same file already has a function that correctly handles this scenario:
// recovery_window.go:65 - setLastFailedBackupTime
func setLastFailedBackupTime(...) error {
return retry.RetryOnConflict(retry.DefaultBackoff, func() error {
var objectStore barmancloudv1.ObjectStore
if err := c.Get(ctx, objectStoreKey, &objectStore); err != nil {
return err
}
// ... update status ...
return c.Status().Update(ctx, &objectStore)
})
}The setLastFailedBackupTime function uses retry.RetryOnConflict which:
- Gets a fresh copy of the resource before updating
- Retries on conflict with exponential backoff
Impact
- Severity: Low - backups complete successfully, status eventually updates
- User experience: Confusing error messages in logs
- Frequency: Depends on backup/retention timing overlap (we see ~2 errors per 24h)
Proposed Fix
Apply the same retry pattern to updateRecoveryWindow:
func updateRecoveryWindow(
ctx context.Context,
c client.Client,
backupList *catalog.Catalog,
objectStore *barmancloudv1.ObjectStore,
serverName string,
) error {
return retry.RetryOnConflict(retry.DefaultBackoff, func() error {
// Get fresh copy
var freshObjectStore barmancloudv1.ObjectStore
if err := c.Get(ctx, client.ObjectKeyFromObject(objectStore), &freshObjectStore); err != nil {
return err
}
// Build recovery window
convertTime := func(t *time.Time) *metav1.Time {
if t == nil {
return nil
}
return ptr.To(metav1.NewTime(*t))
}
recoveryWindow := freshObjectStore.Status.ServerRecoveryWindow[serverName]
recoveryWindow.FirstRecoverabilityPoint = convertTime(backupList.GetFirstRecoverabilityPoint())
recoveryWindow.LastSuccessfulBackupTime = convertTime(backupList.GetLastSuccessfulBackupTime())
if freshObjectStore.Status.ServerRecoveryWindow == nil {
freshObjectStore.Status.ServerRecoveryWindow = make(map[string]barmancloudv1.RecoveryWindow)
}
freshObjectStore.Status.ServerRecoveryWindow[serverName] = recoveryWindow
return c.Status().Update(ctx, &freshObjectStore)
})
}Environment
- Plugin version: 0.10.0
- CNPG Operator: 1.26+
- Kubernetes: 1.29+
- Object storage: AWS S3
We're happy to submit a PR if this approach looks correct.