Skip to content

fix(storage): fix purge infinite loop and garbled errors on object-locked buckets#832

Merged
natalie-o-perret merged 4 commits intomasterfrom
fix/storage-purge-infinite-loop
May 6, 2026
Merged

fix(storage): fix purge infinite loop and garbled errors on object-locked buckets#832
natalie-o-perret merged 4 commits intomasterfrom
fix/storage-purge-infinite-loop

Conversation

@natalie-o-perret
Copy link
Copy Markdown
Contributor

@natalie-o-perret natalie-o-perret commented May 5, 2026

Problem

exo storage purge hangs forever on buckets with object lock enabled.

3 bugs in DeleteObjectVersions (pkg/storage/sos/object.go):

  1. Infinite loop: when DeleteObjects returns HTTP 200 with per-object errors (objects are locked), the Deleted slice is empty. No progress check meant the loop kept re-listing and re-submitting the same objects forever.

  2. Garbled errors: types.Error has *string fields. Formatting with %v printed raw pointer addresses instead of the actual message:

    Error happened: delete error: {<nil> 0xc332882c390 0xc332882c3b0 0xc332882c3a0}
    
  3. Skipped pages: ListObjectVersions never forwarded KeyMarker/VersionIdMarker, so only the first page of versions was processed.

Fix

  • Exit early when no objects were successfully deleted (all locked).
  • Paginate ListObjectVersions with KeyMarker/VersionIdMarker.
  • Format delete errors with aws.ToString() on each *string field.
  • Fix a select-race in the purge command handler (cmd/storage_purge.go).

Before

=== RUN   TestDeleteObjectVersions_HappyPath
--- PASS: TestDeleteObjectVersions_HappyPath (0.00s)
=== RUN   TestDeleteObjectVersions_ComplianceLock
panic: test timed out after 15s
        running tests:
                TestDeleteObjectVersions_ComplianceLock (15s)

goroutine 10 [runnable]:
github.com/exoscale/cli/pkg/storage/sos_test.drainDeleteObjectVersions(...)
        pkg/storage/sos/object_test.go:620 +0x110
github.com/exoscale/cli/pkg/storage/sos_test.TestDeleteObjectVersions_ComplianceLock(...)
        pkg/storage/sos/object_test.go:707 +0x197

FAIL    github.com/exoscale/cli/pkg/storage/sos 15.104s

After

=== RUN   TestDeleteObjectVersions_HappyPath
--- PASS: TestDeleteObjectVersions_HappyPath (0.00s)
=== RUN   TestDeleteObjectVersions_ComplianceLock
--- PASS: TestDeleteObjectVersions_ComplianceLock (0.00s)
=== RUN   TestDeleteObjectVersions_Pagination
--- PASS: TestDeleteObjectVersions_Pagination (0.00s)
PASS    github.com/exoscale/cli/pkg/storage/sos 0.005s

Tests

  • Unit tests for the happy path, compliance-lock exit, and pagination (object_test.go).
  • Integration test against a live SOS bucket with GOVERNANCE object lock (tests/integ/with-api/storage_purge_object_lock_test.go).
End-to-end repro script (boto3, no aws CLI required)
#!/usr/bin/env python3
# Requirements: pip install boto3
# Env: EXOSCALE_API_KEY, EXOSCALE_API_SECRET, EXOSCALE_DEFAULT_ZONE
# Binary: make build  (or set EXO_BIN=path/to/exo)

import os
import random
import re
import subprocess
import time

import boto3
from botocore.config import Config

PURGE_TIMEOUT = 10

api_key    = os.environ["EXOSCALE_API_KEY"]
api_secret = os.environ["EXOSCALE_API_SECRET"]
zone       = os.environ["EXOSCALE_DEFAULT_ZONE"]
endpoint   = os.environ.get("EXOSCALE_SOS_ENDPOINT", f"https://sos-{zone}.exo.io")
exo_bin    = os.environ.get("EXO_BIN", "bin/exo")

s3 = boto3.client(
    "s3",
    endpoint_url=endpoint,
    aws_access_key_id=api_key,
    aws_secret_access_key=api_secret,
    region_name=zone,
    config=Config(signature_version="s3v4"),
)

bucket = f"repro-purge-lock-{random.randint(0, 999999):06d}"
print(f"bucket: {bucket}")

s3.create_bucket(Bucket=bucket, ObjectLockEnabledForBucket=True)

def cleanup():
    print("\ncleaning up...")
    try:
        resp = s3.list_object_versions(Bucket=bucket)
        for v in resp.get("Versions", []) + resp.get("DeleteMarkers", []):
            s3.delete_object(
                Bucket=bucket,
                Key=v["Key"],
                VersionId=v["VersionId"],
                BypassGovernanceRetention=True,
            )
    except Exception as e:
        print(f"  warning: {e}")
    try:
        s3.delete_bucket(Bucket=bucket)
        print(f"  bucket {bucket} deleted")
    except Exception as e:
        print(f"  warning: {e}")

s3.put_object_lock_configuration(
    Bucket=bucket,
    ObjectLockConfiguration={
        "ObjectLockEnabled": "Enabled",
        "Rule": {"DefaultRetention": {"Mode": "GOVERNANCE", "Days": 1}},
    },
)

for name, body in [("hello.txt", b"hello world\n"), ("world.txt", b"world\n")]:
    s3.put_object(Bucket=bucket, Key=f"test/{name}", Body=body)
    print(f"uploaded: test/{name}")

print(f"\nrunning: {exo_bin} storage purge sos://{bucket}/ (timeout={PURGE_TIMEOUT}s)")

start = time.monotonic()
try:
    result = subprocess.run(
        [exo_bin, "-z", zone, "storage", "purge", "--force", f"sos://{bucket}/"],
        capture_output=True,
        text=True,
        timeout=PURGE_TIMEOUT,
    )
    elapsed = time.monotonic() - start
    output = result.stdout + result.stderr
    print(f"exited after {elapsed:.1f}s (rc={result.returncode})")
    print(output or "  (empty)")
    if re.search(r"0x[0-9a-fA-F]{8,}", output):
        print("FAIL: raw pointer addresses in output (bug #2)")
    else:
        print("OK: no pointer addresses")
except subprocess.TimeoutExpired:
    print(f"FAIL: did not exit after {time.monotonic() - start:.0f}s -- infinite loop (bug #1)")
finally:
    cleanup()

Note

GitHub Copilot leveraged to:

  • Sketch the Python script used for (integration) testing-purposes
  • Draft Go (unit) tests
  • Rephrase this PR description

- stop looping when no objects are deleted (e.g. compliance retention)
- fix missing pagination in ListObjectVersions (KeyMarker/VersionIdMarker)
- fix select race on closed deletedChan using nil-channel pattern
- ComplianceLock: verify goroutine exits after one failed batch instead
  of looping forever re-listing the same locked objects
- Pagination: verify KeyMarker/VersionIdMarker are forwarded correctly
  on truncated responses
- HappyPath: verify normal two-version delete works end-to-end
types.Error fields are all *string pointers; printing with %%v produces
pointer addresses instead of the actual error details. Format them with
aws.ToString to get the key, version, message and code.
Reproduces the bug where exo storage purge looped forever on a bucket
with object lock enabled. Sets up a real bucket with GOVERNANCE retention
via the aws CLI, uploads objects, then runs exo storage purge with a 10s
deadline to catch the loop. Also asserts errors are human-readable (not
raw pointer addresses).

Cleanup bypasses GOVERNANCE retention so the bucket is removed after
the test regardless of outcome.
@natalie-o-perret natalie-o-perret force-pushed the fix/storage-purge-infinite-loop branch 2 times, most recently from 9c970d4 to 464ee76 Compare May 5, 2026 16:33
@natalie-o-perret
Copy link
Copy Markdown
Contributor Author

[SC-176386]

@natalie-o-perret natalie-o-perret marked this pull request as ready for review May 5, 2026 16:55
@natalie-o-perret natalie-o-perret requested a review from a team May 6, 2026 12:56
@natalie-o-perret natalie-o-perret merged commit 1c3e89b into master May 6, 2026
11 of 12 checks passed
@natalie-o-perret natalie-o-perret deleted the fix/storage-purge-infinite-loop branch May 6, 2026 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants