-
Notifications
You must be signed in to change notification settings - Fork 1
Incident Post‐Mortem Template
- Incident ID: [Unique Incident ID, e.g., INC-YYYY-MM-DD-XXX]
- Date of Incident: [YYYY-MM-DD]
- Time of Incident (Start - End): [HH:MM UTC/PDT] - [HH:MM UTC/PDT]
- Affected Systems/Services: Redis Cluster on OpenShift, Klamm application
- Incident Lead/Responder(s): Klamm team
- Severity Level: Critical
Brief Description: On July 9, 2025, the Klamm Cache Redis cluster experienced a critical outage. The incident was initiated by a large spike in I/O operations from a Klamm Laravel worker, which consequently led to a significant increase in Redis cache requests. This surge in activity rapidly filled the Redis master's persistent volume, resulting in a "Fail to fsync the AOF file: No space left on device" error.
This primary failure subsequently caused a secondary issue where the Append-Only File (AOF) developed an "invalid byte sequence" due to Redis being unable to complete writes gracefully to the full disk. When the Redis container attempted to restart and load this corrupted AOF, it failed to parse the invalid data, resulting in a persistent crash loop for the pod. This prolonged outage of the Redis cache then led to queue worker timeouts, severely impacting dependent services.
Business/User Impact: Klamm prod was completely unavailable for about 30 minutes during non-working hours. Data loss resulted from reinstallation of PVC.
Provide a chronological list of events with precise timestamps.
- [YYYY-MM-DD HH:MM UTC/PDT] - [Event Description, e.g., Automated alert triggered, etc.]
What Happened?
The incident began when the Redis master pod failed to write to its Append-Only File (AOF) due to insufficient disk space on its assigned persistent volume. The # Fail to fsync the AOF file: No space left on device error in the Redis logs indicates that the underlying storage was full, preventing Redis from flushing buffered write operations to disk. Redis relies on fsync to persist data. When this operation failed, Redis entered an unrecoverable state, leading to its process termination. In OpenShift, this caused the Redis container to crash. Upon restart attempts, the Redis instance tried to load the existing AOF file to recover its dataset, but because the AOF was not cleanly written due to the disk space issue, it became corrupted, which resulted in an "invalid byte sequence" error. Redis could not parse this corrupted file, leading to repeated container crashes and preventing the pod from becoming ready. This continuous failure of the Redis pod disrupted the cache functionality, causing our queue workers to time out as they could not access the required Redis services.
Why Did it Happen?
-
Unexpected Workload Surge: A large, unforeseen spike in I/O operations from a Klamm Laravel worker led to an excessive volume of Redis cache requests.
-
Insufficient Persistent Volume Capacity: The persistent volume allocated to the Redis master was not adequately sized to handle the rapid AOF growth triggered by the workload spike, resulting in disk space exhaustion.
-
AOF Corruption due to Disk Full Condition: The AOF file became corrupted as a direct consequence of Redis being unable to perform complete and reliable fsync operations to a full disk. This left the AOF in an inconsistent state, leading to the "invalid byte sequence" error on subsequent startup attempts.
Contributing Factors:
- Under-resourcing of Klamm Redis cache volume storage (50Mi) in storage limit request.
Actions Taken During Incident (for immediate resolution):
- Backed off the stateful set and deployment.
- Deleted the corrupted PVC and rebuilt it.
- Restarted the stateful set and deployment.
Future Actions / Preventative Measures:
- Task: Increase Klamm Redis cache volume to 100Mi Owner: [Team/Individual] Due Date: [YYYY-MM-DD] Status: Open
- Task: Integrate sysdig monitoring on namespace. Owner: [Team/Individual] Due Date: [YYYY-MM-DD] Status: Open
- Task: Enable replica pods on Redis cluster Owner: [Team/Individual] Due Date: [YYYY-MM-DD] Status: Open
What Could Be Improved?
TBA
Knowledge Gaps Identified
TBA
Relevant Logs:
- Kibana query
kubernetes.pod_name:"klamm-cache-redis-master-0" AND kubernetes.namespace_name:"ed84ea-prod" AND kubernetes.container_name.raw:"redis"
Monitoring Dashboards:
- See Kibana query.
OpenShift Events:
- OCP events not retained.
Configuration Files:
- Helm chart updates
Related Tickets/Issues:
- N/A
Communication:
- Teams
- OCP Notifications