-
Notifications
You must be signed in to change notification settings - Fork 1.5k
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[disaster recovery] Cluster unable to recover after crash during intensive writing operations #5836
Comments
This is good to know. We recently disabled probes on Alpha since the recovery time is extremely long when a pod bounces, exceeding any reasonable |
After more tests, we are now able to conclude that probes do not help with the cluster recovery. Once a disaster happens and there is a sudden collapse of all zero and alpha replicas we have been completely unable to recover the database. This discovery is extremely worrying for us as the cluster won't recover from a general outage of the resources (e.g. datacenter failure, kubernetes failure, panic introduced with an update). We have not been able to bring a single cluster back online after a non-graceful shutdown. |
Hey @christian-roggia I've been running into a problem with similar symptoms as outlined here https://discuss.dgraph.io/t/active-mutations-stuck-forever/7140 Do you see your mutations stuck like this? If I overload a node with writes long enough, I will get into this state, and be unable to recover even if I cut all load. I'm not running on preemptible nodes but I am running with the official helm chart on GKE. |
@christian-roggia has this been resolved in master yet? I think i've experienced something like that as well locally (I wasn't running Kubernetes, just the alpha, zero binaries themselves). |
@seanlaff we are setting up monitoring with Datadog right now to better understand where the underlying issue might be coming from. @dmitryyankowski AFAIK there is no fix yet, I also believe that it is still unknown where this issue might be located in the code (or in which component). We confirmed that this issue does not appear just with disasters where all nodes experience an outage, it is enough that a single alpha node goes suddenly offline during write-intensive operations and the cluster won't recover from this state. It might be an issue related to resources starvation or too heavy load for workers, but in general, I would not expect the database to collapse and not recover anymore. We switched to non-preemptible machines where dgraph had around 4 CPUs and 30GB of RAM for each alpha replica (for a total of 12 CPUs and 90 GB of RAM). One alpha node restarted due to a change of the StatefulSet configuration, this happened in the middle of a full import of our dataset (some hundreds of million nodes). The database collapsed and did not recover with the same symptoms described above. We had to wipe dgraph again, this time we restarted the cluster with HA disabled (single node) and run successfully the full import. The current situation is not ideal and I am still worried about large and heavy operations once we go to production and HA will be enabled. |
Thanks for all that info @christian-roggia Please keep us updated in this thread if you run into more issues and such! Definitely want to keep an eye on this! |
@christian-roggia I don't think scaling back the cluster will work. I am not sure exactly what would happen but we can try it on our end. Restarting nodes with clean storage makes things worse because now those nodes need to catch up. The logs don't seem like an issue. Skipping a snapshot just means there are not enough entries in the write ahead log to guarantee making one. Also what are the options you are passing to Also, can you be more specific about the symptoms? Do requests to the cluster fail immediately or do they hang? If they hang I suspect the cluster is recovering but it's doing the recovery while receiving new mutations so it's catching up. Can you try this with no load on the cluster? Just start a new cluster, add some data, and wait for it to crash but stop sending load to the cluster before the crash. If the issue is indeed related to having to replay a lot of entries in the write-ahead log, then the cluster should recover fine when there's no load. We'll look into running Dgraph in preemptive nodes on our end. |
@christian-roggia @seanlaff were you running dgraph in |
@jarifibrahim No ludicrous mode for me. However I have been using My reproduction was, aim a very high write workload at dgraph- moreso than the hardware can handle. Eventually a node will OOM. At this point I cut all load- but after waiting 24 hrs the mutations never progressed. |
another data point here:
badger_v2_disk_writes_total is slowly incrementing according to the prometheus stats. Even running master@156bc23bca4ef941ed5b1e84638961764bd59f27 |
@martinmr we use the standard Helm chart for deployment on Kubernetes, the following args are passed to alpha:
The alpha nodes are hanging and no longer responding to any query or mutation, the CPU and RAM are no longer under intensive utilization. We tried to let the cluster run for a few hours to check if it would recover but it didn't. We are planning now to setup datadog so that we can properly monitor if the cluster is doing anything at all. Also, once an alpha node goes down there is no load anymore on the cluster since it is rejecting all mutations with a timeout (context exceeded deadline). I would suggest that if the cluster is recovering it would be a nice addition to have a goroutine ticking every 1-5 minutes that notifies about the recovery progress in the logs, so that the cluster doesn't look completely unresponsive when it is indeed performing operations in the background.
We reported this log as a symptom of the disaster as it is the only log available after a crash. There is no other log of any kind, just alpha that notifies us that it is skipping snapshots. This makes it even more confusing to understand what is really going on.
Thank you, hopefully it will help also the Dgraph team simulate real disasters. @jarifibrahim we are not using |
Github issues have been deprecated. |
What version of Dgraph are you using?
This issue has been consistently observed in all the following versions:
Have you tried reproducing the issue with the latest release?
Yes.
What is the hardware spec (RAM, OS)?
Steps to reproduce the issue (command/config used to run Dgraph).
Expected behavior and actual result.
The expected result is that dgraph is able to recover after a sudden restart.
It is also expected to observe partial data loss as dgraph was not shutdown gracefully.
NOTE: Preemptible machines are not intended to host databases but are a very good simulation of a disaster (e.g. outage in a Cloud provider, network failure, the unexpected crash of dgraph).
The actual behavior of dgraph is instead a complete failure of the cluster which is unable to recover and enters a state where it is no longer reachable.
What is important to notice here is
and
Attempted solution to the issue
We tried the following steps to force the recovery of the database:
After scaling down the entire cluster to 0 replicas and back to 3 replicas for a clean restart the following logs appear:
Additional notes
We suspect that the issue is caused by dgraph-alpha which accepts and continues to process new entries even when the cluster is not ready (the majority of replicas are still offline). The error might also be caused by dgraph-zero not being available or by the sudden crash of the entire cluster which can be compared to a disaster.
We just enabled liveness and readiness probes on our deployments. We urge the dgraph team to enable such probes by default as it is not expected to have a Helm chart with probes disabled. We also suspect that having probes disabled might largely contribute to this issue as the cluster resumes writing operation too early and the cluster is not yet ready.
This behavior was also observed when a crash occurred only partially, with the majority of replicas going offline and restarting (2 out of 3 replicas not available).
We strongly invite the dgraph team to simulate disasters through preemptible machines as this is a valuable experiment to verify dgraph's disaster recovery capabilities. If a scenario like the one here described was to happen in a production environment, where the cluster experiences a sudden disruption of the services, and it is not able to recover anymore, it would likely have a tremendous impact on normal operations as well as business operations with a long downtime for the service which relies on dgraph.
The text was updated successfully, but these errors were encountered: