Skip to content

Standardize CSI recover loop logging for clearer observability #5663

@mrhapile

Description

@mrhapile

The CSI recover loop logs recovery behavior, but the logs are currently inconsistent in severity and lack sufficient context to understand recovery state transitions.

Specifically:

  • Recoverable failures (e.g. mount / unmount failures) are logged as errors
  • Some logs lack key context such as mount path, source path, or mount count
  • Operators cannot easily tell:
    • When recovery starts
    • When recovery is skipped
    • When recovery succeeds
    • Why retries continue to happen

This makes diagnosing CSI recovery behavior difficult in production clusters.

Why this matters

  • CSI recovery runs continuously as a daemonset
  • Logs are the primary debugging signal for operators
  • Incorrect log levels add noise and obscure real failures
  • Clear observability improves maintainability without changing behavior

Proposed Solution

Improve observability of the CSI recover loop by:

  • Standardizing log levels:

    - Info for normal flow and state transitions
    - Warning for recoverable mount / unmount failures
    - Error for unexpected or API-level failures
    
  • Adding structured context to logs (mountPath, source, count, thresholds)

  • Logging recovery start, skip, cleanup, and success paths consistently

  • Keeping the change log-only (no behavior changes)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions