Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: opaque "context cancelled" error when some nodes can't access backup destination #32392

Open
roncrdb opened this issue Nov 15, 2018 · 13 comments
Labels
A-disaster-recovery C-investigation Further steps needed to qualify. C-label will change. S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) T-disaster-recovery

Comments

@roncrdb
Copy link

roncrdb commented Nov 15, 2018

Describe the problem

User was trying to back up a 2.5tb database 12 node cluster on AWS to an s3 bucket.
The backup failed with the error:

pq: context canceled 
Error: pq: context canceled 
Failed running "sql"

There are no reports of this in the logs for the day when the backup was run (20181107) around 13:38.

The logs do show the following info message:

remote declined preemptive snapshot

The command they ran (anonymized) is:

BACKUP DATABASE <database> TO 's3://bucket-url' AS OF SYSTEMTIME '2018-11-07 10:00:00' with revision_history

Jira issue: CRDB-4751

@roncrdb roncrdb added the C-investigation Further steps needed to qualify. C-label will change. label Nov 15, 2018
@tim-o tim-o added S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) A-disaster-recovery labels Nov 16, 2018
@roncrdb
Copy link
Author

roncrdb commented Nov 21, 2018

This issue is also happening on an onprem cluster, started occurring today, with the same context canceled error, waiting on debug zip to review logs.

@knz knz changed the title Backup failed with "context canceled" error. backupccl: Backup failed with "context canceled" error. Nov 26, 2018
@knz knz added this to Triage in Bulk IO via automation Nov 26, 2018
@knz
Copy link
Contributor

knz commented Nov 26, 2018

cc @dt @tbg

@tbg
Copy link
Member

tbg commented Nov 26, 2018

Question is if this cluster is perhaps already in a bad state. A debug zip would be useful (assuming this issue persists at the time that is taken).

@roncrdb
Copy link
Author

roncrdb commented Nov 26, 2018

@tbg,

https://drive.google.com/open?id=1650EzNBA-PZtge6-i-_AcnfUdDKaXNG2 here's a link to the debug zip, let me know if you need me to look at anything in specific.

@roncrdb
Copy link
Author

roncrdb commented Nov 27, 2018

Is there anything else you need from me or the client @tbg? They are asking for an update on this issue.

Thanks

@tbg
Copy link
Member

tbg commented Nov 27, 2018

Sorry for not taking a look earlier.

In alerts, I see

  node_id | store_id | category |    description     | value
+---------+----------+----------+--------------------+-------+
        3 |        3 | metrics  | requests.slow.raft |     2

This typically indicates a serious problem (it's a range that is stuck, i.e. a command was proposed but didn't return in a timely manner).

grep -c 'have been waiting' nodes/*/logs/cockroach.*.log | wc -l
     155

We know one way to get in this situation to be having a backed-up Raft snapshot queue. Unfortunately, due to #32647, I'm only able to see that n8 has an empty Raft snapshot queue (which is good), but some other nodes may be totally backed up.

Please ask the user to check the "Raft snapshot queue" chart on the "Queues" page.

I went ahead and tried to grep out some ranges that had pending commands (you'd expect two based on the alert) but none came back.

I also found that there was a recent restart due to clock uncertainty. This restart is possibly bogus (0 of 1 is a tight margin) but it's not clear.

181107 15:12:39.449372 163 server/server.go:248  [n3] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
goroutine 163 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc4204f6c00, 0xc4204f6cc0, 0x3eb3600, 0x10)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:997 +0xcf
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x47b12e0, 0xc400000004, 0x3eb361c, 0x10, 0xf8, 0xc420a56120, 0x88)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:864 +0x8fe
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x2ff8ae0, 0xc42097bc20, 0x4, 0x2, 0x0, 0x0, 0xc422e9dc58, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2e5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x2ff8ae0, 0xc42097bc20, 0x1, 0x4, 0x0, 0x0, 0xc422e9dc58, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatal(0x2ff8ae0, 0xc42097bc20, 0xc422e9dc58, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:181 +0x6c
github.com/cockroachdb/cockroach/pkg/server.NewServer.func1()
        /go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:248 +0xb0
github.com/cockroachdb/cockroach/pkg/rpc.(*Context).runHeartbeat(0xc420121040, 0xc420445a90, 0xc4204cef40, 0x12, 0xc4200f3380, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:790 +0x667
github.com/cockroachdb/cockroach/pkg/rpc.(*Context).GRPCDial.func1.1.1(0x2ff8ae0, 0xc420a4d770)
        /go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:668 +0x74
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc420550f10, 0xc4209ae000, 0xc420a4d740)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:199 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xad


****************************************************************************

Not sure what exactly is happening on the cluster. The whole cluster was restarted at 15:12:58.689112. The most recent message indicating a stuck range is from 15:35:40.394714 on r3004 (see attachment), the debug zip was taken closer to I181107 15:43:43.291977.

I don't see a num_pending entry on r3004 which is typically there in this scenario, but the command queue shows something:

  "cmd_q_local": {
    "write_commands": 1,
    "read_commands": 2,
    "tree_size": 2
  },
  "cmd_q_global": {
    "read_commands": 1
  },

(@nvanbenschoten any idea what this combination means?)

The leader's Raft state shows that one replica believes itself to be the leader of a past term while the other two are leader-follower in the new term. As you'd expect, the replica with the stuff in the command queue is the outdated leader.

It's unclear how this comes to be. This could be caused by a network partition.

Could you ask the user for a range status /#/reports/range/3004? I'm pretty sure whatever the problem was has cleared up by now, but if not, it'll be good to take further steps.

PS also ask if direct access to this cluster is within the realm of possibility. The problem here isn't obvious to me.

r3004.txt

@nvanbenschoten
Copy link
Member

(@nvanbenschoten any idea what this combination means?)

No, I don't. I looked through a number of possibilities but I couldn't find a single request that would result in that combination of a local write and multiple local and global reads. It doesn't look like any variant of a commit trigger. It's possible that there are two different requests stuck on the Range.

@slateraus
Copy link

Hi, we have since scratched the 2.5 Tb environment where we first had the problem. But then the problem occured on a much smaller DB - same schema - on prem. I believe you guys have access to that cluster. I also uploaded the debug from that cluster.

Its sort of weird as I have backups working fine on other databases but this DB / schema errors. Am wondering if it is database object related.

@roncrdb
Copy link
Author

roncrdb commented Nov 27, 2018

@tbg ,

Looked at alerts for the second debug zip that @slateraus is referring to, nothing shows up there.

and grep -c 'have been waiting' nodes/*/logs/cockroach.*.log | wc -l returns 80.

@tbg
Copy link
Member

tbg commented Nov 28, 2018

@roncrdb if we indeed have access to that cluster, I'd be happy to take a look. Could you let me know in private how to access?

@roncrdb
Copy link
Author

roncrdb commented Jan 29, 2019

This was solved via slack closing issue out.

@roncrdb roncrdb closed this as completed Jan 29, 2019
@tbg tbg reopened this Jan 29, 2019
@tbg
Copy link
Member

tbg commented Jan 29, 2019

If I remember correctly, this was about some nodes not being able to connect to the backup destination due to firewall misconfiguration. The error needs to be improved.

@tbg tbg changed the title backupccl: Backup failed with "context canceled" error. backupccl: opaque "context cancelled" error when some nodes can't access backup destination Jan 29, 2019
@rolandcrosby rolandcrosby removed this from Triage in Bulk IO Jul 8, 2019
@rolandcrosby rolandcrosby added this to Triage in Disaster Recovery Backlog via automation Jul 8, 2019
@dt dt moved this from Triage to Polish/New Features in Disaster Recovery Backlog Jul 11, 2019
@dt dt moved this from Features and Polish to sort into Backlogs to Backlog: Backup/Restore in Disaster Recovery Backlog Jul 16, 2019
@mwang1026 mwang1026 moved this from Backup/Restore/Dump to errors messages in Disaster Recovery Backlog Aug 27, 2020
@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@dt dt moved this from Polish: Errors Messages to Backlog in Disaster Recovery Backlog Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-investigation Further steps needed to qualify. C-label will change. S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) T-disaster-recovery
Development

No branches or pull requests

7 participants