-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Primary PostgreSQL Instance Not Gracefully Shutting Down and Missing WAL File #3680
Comments
From what I can see, it looks like an issue with the underlying file system (data corruption of that file). You need to recreate the PVC of that instance. |
It happens 2 days in a row, we are using EC2 spot instances, and we managed to solve it by deleting the problematic PVC (gp3) but that manual PVC creation every day is probably not the desired behavior. |
This might be related to #3698. Give us some time to investigate. |
I face the same issue in multiple clusters in the same k8s cluster. I think it occurs when a network error happens. The only way to resolve it is to delete the PVC of the failing pod and remove the pod. In this case, the operator is going to recreate the failing replica. Logs:
|
We are facing the same problem on Google GKE clusters. After each node upgrade random instances have problems with recovering. The only workaround we found is removing one of failing instance PVC. |
Just encountered into the same issue in GKE |
I keep running into this every other day |
Also discovered this now.. is CNPG team aware of this? Whom to ping about this issue? only working solution was to delete the failing Pod and PVC so a new node is created to join the cluster |
I am facing similar issue. Do we have any fix available for below issue ? I tried to delete the pod but did not help.
|
@gbartolini is it possible to modify the health check behavior of the pods managed by CNPG? |
@gbartolini Any updates on this would be greatly appreciated 🙏🏿 |
The only way I've been able to recover from this is to delete both the PVC and Pod |
Is there an existing issue already for this bug?
I have read the troubleshooting guide
I am running a supported version of CloudNativePG
Contact Details
oren@aperio.ai
Version
1.22.0
What version of Kubernetes are you using?
1.27
What is your Kubernetes environment?
Cloud: Amazon EKS
How did you install the operator?
Helm
What happened?
We deployed the operator using a Helm chart and the PostgreSQL cluster resource using a separate Helm chart. Three instances of PostgreSQL were deployed on EC2 Spot instances. After a few hours, it appears that the primary instance (postgres-1) did not gracefully shut down, leading to the absence of the relevant Write-Ahead Logging (WAL) file.
Subsequently, attempts to use pg_rewind have failed, with the process encountering an empty file. Additionally, the file "pg_xact/0000" is also empty, contributing to the inability to resolve the issue and leaving the old primary instance in a stuck state.
Cluster resource
Relevant log output
Code of Conduct
The text was updated successfully, but these errors were encountered: