backupccl: opaque "context cancelled" error when some nodes can't access backup destination #32392

roncrdb · 2018-11-15T18:53:37Z

Describe the problem

User was trying to back up a 2.5tb database 12 node cluster on AWS to an s3 bucket.
The backup failed with the error:

pq: context canceled 
Error: pq: context canceled 
Failed running "sql"

There are no reports of this in the logs for the day when the backup was run (20181107) around 13:38.

The logs do show the following info message:

remote declined preemptive snapshot

The command they ran (anonymized) is:

BACKUP DATABASE <database> TO 's3://bucket-url' AS OF SYSTEMTIME '2018-11-07 10:00:00' with revision_history

Jira issue: CRDB-4751

The text was updated successfully, but these errors were encountered:

roncrdb · 2018-11-21T16:39:59Z

This issue is also happening on an onprem cluster, started occurring today, with the same context canceled error, waiting on debug zip to review logs.

knz · 2018-11-26T11:15:26Z

cc @dt @tbg

tbg · 2018-11-26T11:32:49Z

Question is if this cluster is perhaps already in a bad state. A debug zip would be useful (assuming this issue persists at the time that is taken).

roncrdb · 2018-11-26T14:35:44Z

@tbg,

https://drive.google.com/open?id=1650EzNBA-PZtge6-i-_AcnfUdDKaXNG2 here's a link to the debug zip, let me know if you need me to look at anything in specific.

roncrdb · 2018-11-27T17:36:57Z

Is there anything else you need from me or the client @tbg? They are asking for an update on this issue.

Thanks

tbg · 2018-11-27T19:42:00Z

Sorry for not taking a look earlier.

In alerts, I see

  node_id | store_id | category |    description     | value
+---------+----------+----------+--------------------+-------+
        3 |        3 | metrics  | requests.slow.raft |     2

This typically indicates a serious problem (it's a range that is stuck, i.e. a command was proposed but didn't return in a timely manner).

grep -c 'have been waiting' nodes/*/logs/cockroach.*.log | wc -l
     155

We know one way to get in this situation to be having a backed-up Raft snapshot queue. Unfortunately, due to #32647, I'm only able to see that n8 has an empty Raft snapshot queue (which is good), but some other nodes may be totally backed up.

Please ask the user to check the "Raft snapshot queue" chart on the "Queues" page.

I went ahead and tried to grep out some ranges that had pending commands (you'd expect two based on the alert) but none came back.

I also found that there was a recent restart due to clock uncertainty. This restart is possibly bogus (0 of 1 is a tight margin) but it's not clear.

181107 15:12:39.449372 163 server/server.go:248  [n3] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
goroutine 163 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc4204f6c00, 0xc4204f6cc0, 0x3eb3600, 0x10)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:997 +0xcf
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x47b12e0, 0xc400000004, 0x3eb361c, 0x10, 0xf8, 0xc420a56120, 0x88)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:864 +0x8fe
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x2ff8ae0, 0xc42097bc20, 0x4, 0x2, 0x0, 0x0, 0xc422e9dc58, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2e5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x2ff8ae0, 0xc42097bc20, 0x1, 0x4, 0x0, 0x0, 0xc422e9dc58, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatal(0x2ff8ae0, 0xc42097bc20, 0xc422e9dc58, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:181 +0x6c
github.com/cockroachdb/cockroach/pkg/server.NewServer.func1()
        /go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:248 +0xb0
github.com/cockroachdb/cockroach/pkg/rpc.(*Context).runHeartbeat(0xc420121040, 0xc420445a90, 0xc4204cef40, 0x12, 0xc4200f3380, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:790 +0x667
github.com/cockroachdb/cockroach/pkg/rpc.(*Context).GRPCDial.func1.1.1(0x2ff8ae0, 0xc420a4d770)
        /go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:668 +0x74
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc420550f10, 0xc4209ae000, 0xc420a4d740)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:199 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xad


****************************************************************************

Not sure what exactly is happening on the cluster. The whole cluster was restarted at 15:12:58.689112. The most recent message indicating a stuck range is from 15:35:40.394714 on r3004 (see attachment), the debug zip was taken closer to I181107 15:43:43.291977.

I don't see a num_pending entry on r3004 which is typically there in this scenario, but the command queue shows something:

  "cmd_q_local": {
    "write_commands": 1,
    "read_commands": 2,
    "tree_size": 2
  },
  "cmd_q_global": {
    "read_commands": 1
  },

(@nvanbenschoten any idea what this combination means?)

The leader's Raft state shows that one replica believes itself to be the leader of a past term while the other two are leader-follower in the new term. As you'd expect, the replica with the stuff in the command queue is the outdated leader.

It's unclear how this comes to be. This could be caused by a network partition.

Could you ask the user for a range status /#/reports/range/3004? I'm pretty sure whatever the problem was has cleared up by now, but if not, it'll be good to take further steps.

PS also ask if direct access to this cluster is within the realm of possibility. The problem here isn't obvious to me.

r3004.txt

nvanbenschoten · 2018-11-27T20:08:52Z

(@nvanbenschoten any idea what this combination means?)

No, I don't. I looked through a number of possibilities but I couldn't find a single request that would result in that combination of a local write and multiple local and global reads. It doesn't look like any variant of a commit trigger. It's possible that there are two different requests stuck on the Range.

slateraus · 2018-11-27T21:38:46Z

Hi, we have since scratched the 2.5 Tb environment where we first had the problem. But then the problem occured on a much smaller DB - same schema - on prem. I believe you guys have access to that cluster. I also uploaded the debug from that cluster.

Its sort of weird as I have backups working fine on other databases but this DB / schema errors. Am wondering if it is database object related.

roncrdb · 2018-11-27T22:07:21Z

@tbg ,

Looked at alerts for the second debug zip that @slateraus is referring to, nothing shows up there.

and grep -c 'have been waiting' nodes/*/logs/cockroach.*.log | wc -l returns 80.

tbg · 2018-11-28T12:16:54Z

@roncrdb if we indeed have access to that cluster, I'd be happy to take a look. Could you let me know in private how to access?

roncrdb · 2019-01-29T17:47:30Z

This was solved via slack closing issue out.

tbg · 2019-01-29T17:48:42Z

If I remember correctly, this was about some nodes not being able to connect to the backup destination due to firewall misconfiguration. The error needs to be improved.

github-actions · 2023-09-20T11:12:25Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

roncrdb added the C-investigation Further steps needed to qualify. C-label will change. label Nov 15, 2018

tim-o added S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) A-disaster-recovery labels Nov 16, 2018

knz changed the title ~~Backup failed with "context canceled" error.~~ backupccl: Backup failed with "context canceled" error. Nov 26, 2018

knz added this to Triage in Bulk IO via automation Nov 26, 2018

tbg mentioned this issue Nov 27, 2018

cli: debug zip returns incomplete data #32647

Closed

roncrdb closed this as completed Jan 29, 2019

tbg reopened this Jan 29, 2019

tbg changed the title ~~backupccl: Backup failed with "context canceled" error.~~ backupccl: opaque "context cancelled" error when some nodes can't access backup destination Jan 29, 2019

rolandcrosby removed this from Triage in Bulk IO Jul 8, 2019

rolandcrosby added this to Triage in Disaster Recovery Backlog via automation Jul 8, 2019

dt moved this from Triage to Polish/New Features in Disaster Recovery Backlog Jul 11, 2019

dt moved this from Features and Polish to sort into Backlogs to Backlog: Backup/Restore in Disaster Recovery Backlog Jul 16, 2019

mwang1026 moved this from Backup/Restore/Dump to errors messages in Disaster Recovery Backlog Aug 27, 2020

mwang1026 added this to errors in Disaster Recovery Sprint Status Aug 31, 2020

mwang1026 removed this from errors in Disaster Recovery Sprint Status Nov 30, 2020

kenliu added the T-disaster-recovery label Dec 5, 2020

github-actions bot added the no-issue-activity label Sep 20, 2023

knz removed the no-issue-activity label Sep 20, 2023

dt moved this from Polish: Errors Messages to Backlog in Disaster Recovery Backlog Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: opaque "context cancelled" error when some nodes can't access backup destination #32392

backupccl: opaque "context cancelled" error when some nodes can't access backup destination #32392

roncrdb commented Nov 15, 2018 •

edited by cockroach-jira-scripts

roncrdb commented Nov 21, 2018

knz commented Nov 26, 2018

tbg commented Nov 26, 2018

roncrdb commented Nov 26, 2018

roncrdb commented Nov 27, 2018

tbg commented Nov 27, 2018

nvanbenschoten commented Nov 27, 2018

slateraus commented Nov 27, 2018

roncrdb commented Nov 27, 2018

tbg commented Nov 28, 2018

roncrdb commented Jan 29, 2019

tbg commented Jan 29, 2019

github-actions bot commented Sep 20, 2023

backupccl: opaque "context cancelled" error when some nodes can't access backup destination #32392

backupccl: opaque "context cancelled" error when some nodes can't access backup destination #32392

Comments

roncrdb commented Nov 15, 2018 • edited by cockroach-jira-scripts

roncrdb commented Nov 21, 2018

knz commented Nov 26, 2018

tbg commented Nov 26, 2018

roncrdb commented Nov 26, 2018

roncrdb commented Nov 27, 2018

tbg commented Nov 27, 2018

nvanbenschoten commented Nov 27, 2018

slateraus commented Nov 27, 2018

roncrdb commented Nov 27, 2018

tbg commented Nov 28, 2018

roncrdb commented Jan 29, 2019

tbg commented Jan 29, 2019

github-actions bot commented Sep 20, 2023

roncrdb commented Nov 15, 2018 •

edited by cockroach-jira-scripts