New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: opaque "context cancelled" error when some nodes can't access backup destination #32392
Comments
This issue is also happening on an onprem cluster, started occurring today, with the same |
Question is if this cluster is perhaps already in a bad state. A |
@tbg, https://drive.google.com/open?id=1650EzNBA-PZtge6-i-_AcnfUdDKaXNG2 here's a link to the debug zip, let me know if you need me to look at anything in specific. |
Is there anything else you need from me or the client @tbg? They are asking for an update on this issue. Thanks |
Sorry for not taking a look earlier. In
This typically indicates a serious problem (it's a range that is stuck, i.e. a command was proposed but didn't return in a timely manner).
We know one way to get in this situation to be having a backed-up Raft snapshot queue. Unfortunately, due to #32647, I'm only able to see that n8 has an empty Raft snapshot queue (which is good), but some other nodes may be totally backed up. Please ask the user to check the "Raft snapshot queue" chart on the "Queues" page. I went ahead and tried to grep out some ranges that had pending commands (you'd expect two based on the alert) but none came back. I also found that there was a recent restart due to clock uncertainty. This restart is possibly bogus (0 of 1 is a tight margin) but it's not clear.
Not sure what exactly is happening on the cluster. The whole cluster was restarted at 15:12:58.689112. The most recent message indicating a stuck range is from 15:35:40.394714 on r3004 (see attachment), the debug zip was taken closer to I181107 15:43:43.291977. I don't see a
(@nvanbenschoten any idea what this combination means?) The leader's Raft state shows that one replica believes itself to be the leader of a past term while the other two are leader-follower in the new term. As you'd expect, the replica with the stuff in the command queue is the outdated leader. It's unclear how this comes to be. This could be caused by a network partition. Could you ask the user for a range status PS also ask if direct access to this cluster is within the realm of possibility. The problem here isn't obvious to me. |
No, I don't. I looked through a number of possibilities but I couldn't find a single request that would result in that combination of a local write and multiple local and global reads. It doesn't look like any variant of a commit trigger. It's possible that there are two different requests stuck on the Range. |
Hi, we have since scratched the 2.5 Tb environment where we first had the problem. But then the problem occured on a much smaller DB - same schema - on prem. I believe you guys have access to that cluster. I also uploaded the debug from that cluster. Its sort of weird as I have backups working fine on other databases but this DB / schema errors. Am wondering if it is database object related. |
@tbg , Looked at and |
@roncrdb if we indeed have access to that cluster, I'd be happy to take a look. Could you let me know in private how to access? |
This was solved via slack closing issue out. |
If I remember correctly, this was about some nodes not being able to connect to the backup destination due to firewall misconfiguration. The error needs to be improved. |
We have marked this issue as stale because it has been inactive for |
Describe the problem
User was trying to back up a 2.5tb database 12 node cluster on AWS to an s3 bucket.
The backup failed with the error:
There are no reports of this in the logs for the day when the backup was run (20181107) around 13:38.
The logs do show the following info message:
remote declined preemptive snapshot
The command they ran (anonymized) is:
BACKUP DATABASE <database> TO 's3://bucket-url' AS OF SYSTEMTIME '2018-11-07 10:00:00' with revision_history
Jira issue: CRDB-4751
The text was updated successfully, but these errors were encountered: