-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Primary PostgreSQL Instance Not Gracefully Shutting Down and Missing WAL File #3680
Comments
From what I can see, it looks like an issue with the underlying file system (data corruption of that file). You need to recreate the PVC of that instance. |
It happens 2 days in a row, we are using EC2 spot instances, and we managed to solve it by deleting the problematic PVC (gp3) but that manual PVC creation every day is probably not the desired behavior. |
This might be related to #3698. Give us some time to investigate. |
I face the same issue in multiple clusters in the same k8s cluster. I think it occurs when a network error happens. The only way to resolve it is to delete the PVC of the failing pod and remove the pod. In this case, the operator is going to recreate the failing replica. Logs:
|
We are facing the same problem on Google GKE clusters. After each node upgrade random instances have problems with recovering. The only workaround we found is removing one of failing instance PVC. |
Just encountered into the same issue in GKE |
I keep running into this every other day |
Also discovered this now.. is CNPG team aware of this? Whom to ping about this issue? only working solution was to delete the failing Pod and PVC so a new node is created to join the cluster |
I am facing similar issue. Do we have any fix available for below issue ? I tried to delete the pod but did not help.
|
@gbartolini is it possible to modify the health check behavior of the pods managed by CNPG? |
@gbartolini Any updates on this would be greatly appreciated 🙏🏿 |
The only way I've been able to recover from this is to delete both the PVC and Pod |
Same Issue here. Luckily, Our production cluster survived because one of the 3 replicas was working. Deleting both PVC and Pod solved the issue but it is not a permanent solution. we use GKE |
I'm facing the same issue using the CNPG community addon for Microk8s, i.e. CNPG v1.22.0. |
Seems to be gone with latest v1.23. The same for you? |
Indeed, it seems good for my cluster too |
I updated the operator to the latest version yesterday and restored the database; today it's the same again. It makes me nervous and angry I have expanded the PVC and everything is fixed. There are no related logs! |
I think I am facing the similar issue. I am running CNPG on AWS Spot Instances provisioned through Karpenter Cluster Autoscaler. Currently I have only one of the three instances working. The other two instance don't seem to come up. I am using CNPG Operator v1.23.1 Seeing error messages like this on the instance that won't come up: {"level":"info","ts":"2024-08-06T10:12:20Z","logger":"postgres","msg":"record","logging_pod":"pg-cluster-3","record":{"log_time":"2024-08-06 10:12:20.737 UTC","process_id":"18434","session_id":"66b1f704.4802","session_line_num":"2","session_start_time":"2024-08-06 10:12:20 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"XX000","message":"could not receive data from WAL stream: ERROR: requested WAL segment 0000000A000001BB00000044 has already been removed","backend_type":"walreceiver"}}
{"level":"info","ts":"2024-08-06T10:12:20Z","logger":"postgres","msg":"record","logging_pod":"pg-cluster-3","record":{"log_time":"2024-08-06 10:12:20.913 UTC","user_name":"postgres","database_name":"postgres","process_id":"18441","connection_from":"[local]","session_id":"66b1f704.4809","session_line_num":"1","session_start_time":"2024-08-06 10:12:20 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend"}} and {"level":"info","ts":"2024-08-06T10:12:21Z","msg":"DB not available, will retry","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"pg-cluster","namespace":"xxx-development"},"namespace":"xxx-development","name":"pg-cluster","reconcileID":"f6812cba-7a90-41fe-a115-01cc8edac2d5","logging_pod":"pg-cluster-3","err":"failed to connect to `host=/controller/run user=postgres database=postgres`: server error (FATAL: the database system is starting up (SQLSTATE 57P03))"} Update: I managed to get the Cluster back into an healthy state. While the CRD Events were indicating a communication issue (saying somtehing like nodes not reporting healthy status). Turns out the issue was actually corrupted storage. After I deleted the PVC's before deleting the pods, new cluster instances werde created which after a short while showed up as healthy in CNPG's Grafana Dashboard. |
Is there an existing issue already for this bug?
I have read the troubleshooting guide
I am running a supported version of CloudNativePG
Contact Details
oren@aperio.ai
Version
1.22.0
What version of Kubernetes are you using?
1.27
What is your Kubernetes environment?
Cloud: Amazon EKS
How did you install the operator?
Helm
What happened?
We deployed the operator using a Helm chart and the PostgreSQL cluster resource using a separate Helm chart. Three instances of PostgreSQL were deployed on EC2 Spot instances. After a few hours, it appears that the primary instance (postgres-1) did not gracefully shut down, leading to the absence of the relevant Write-Ahead Logging (WAL) file.
Subsequently, attempts to use pg_rewind have failed, with the process encountering an empty file. Additionally, the file "pg_xact/0000" is also empty, contributing to the inability to resolve the issue and leaving the old primary instance in a stuck state.
Cluster resource
Relevant log output
Code of Conduct
The text was updated successfully, but these errors were encountered: