Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed [stalled rangefeed] #119333

Closed
cockroach-teamcity opened this issue Feb 17, 2024 · 15 comments · Fixed by #119512
Closed
Assignees
Labels
A-kv-rangefeed Rangefeed infrastructure, server+client branch-release-23.2.1-rc C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-0 Issues/test failures with a fix SLA of 2 weeks release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Feb 17, 2024

roachtest.c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed with artifacts on release-23.2.1-rc @ 898cd6a363fd47bb92a03bac216f9bed0f64bc08:

(latency_verifier.go:199).assertValid: max latency was more than allowed: 17m13.946417082s vs 2m0s
(monitor.go:153).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/c2c/tpcc/warehouses=1000/duration=60/cutover=30/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-36155

@cockroach-teamcity cockroach-teamcity added branch-release-23.2.1-rc C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery labels Feb 17, 2024
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Feb 17, 2024
@dt dt added P-1 Issues/test failures with a fix SLA of 1 month and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Feb 20, 2024
@msbutler
Copy link
Collaborator

msbutler commented Feb 21, 2024

Node 3 got stuck, and the replanner did not kick in because it made initial progress:

5.unredacted/cockroach.log:I240217 11:40:16.310203 44966 ccl/streamingccl/streamingest/stream_ingestion_frontier_processor.go:571 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=944042679120232450›,stream-ingest-distsql,f‹22f5b763›,distsql.gateway=1,distsql.appname=‹$ internal-resume-job-944042679120232450›] 907  detected a lagging node: node 3 is 10.66 minutes behind the next node. Try replanning: node frontier too far behind other nodes. Don't forward error because replicated time at last check 0,0 is less than current replicated time 1708169362.163534125,0

This happened again an hour later on node 4

5.unredacted/cockroach.log:I240217 12:40:16.877005 44966 ccl/streamingccl/streamingest/stream_ingestion_frontier_processor.go:571 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=944042679120232450›,stream-ingest-distsql,f‹22f5b763›,distsql.gateway=1,distsql.appname=‹$ internal-resume-job-944042679120232450›] 1430  detected a lagging node: node 4 is 17.00 minutes behind the next node. Try replanning: node frontier too far behind other nodes. Don't forward error because replicated time at last check 0,0 is less than current replicated time 1708172582.161500341,0

@msbutler
Copy link
Collaborator

msbutler commented Feb 21, 2024

Looks like a rangefeed on source node 3 stalled for 17 minutes with this fun new error, which was continuously logged from 12:23 to 12:40, which lines up with the 17 minute delay logged by the lag checker:

~/Downloads/artifacts_slow_c2c/logs/3.unredacted
❯ grep "kv/kvserver/rangefeed/task.go:245" cockroach.log | head -n 1
E240217 12:23:06.192206 11736095 kv/kvserver/rangefeed/task.go:245 ⋮ [T1,Vsystem,n3,s3,r229/4:‹/Tenant/2/Table/109/1/2{6…-8…}›,rangefeed] 1784  pushing old intents failed: range barrier failed, range split: key range /Tenant/2/Table/109/1/‹261›/‹"B\xfb\x85F\xfc=D\x00\x80\x00\x00\x00\x00w\xc6("›-/Tenant/2/Table/109/1/‹284›/‹"H\xf2\x8f\xd4\xf4\xb9L\x00\x80\x00\x00\x00\x00\x82p\xb8"› outside of bounds of range /Tenant/2/Table/109/1/‹261›/‹"B\xfb\x85F\xfc=D\x00\x80\x00\x00\x00\x00w\xc6("›-/Tenant/2/Table/109/1/‹273›/‹"E\xf7\n\x8d\xf8{H\x00\x80\x00\x00\x00\x00}\x1bp"›; suggested ranges: [‹desc: r229:/Tenant/2/Table/109/1/2{61/"B\xfb\x85F\xfc=D\x00\x80\x00\x00\x00\x00w\xc6("-73/"E\xf7\n\x8d\xf8{H\x00\x80\x00\x00\x00\x00}\x1bp"} [(n3,s3):4, (n1,s1):5, (n4,s4):6, next=7, gen=31, sticky=1708172411.570988803,0], lease: <empty>, closed_timestamp_policy: LAG_BY_CLUSTER_SETTING›]

~/Downloads/artifacts_slow_c2c/logs/3.unredacted                                                                                                                                                                                            ✔ PIPE|0
❯ grep "kv/kvserver/rangefeed/task.go:245" cockroach.log | tail -n 1
E240217 12:40:23.185617 15466593 kv/kvserver/rangefeed/task.go:245 ⋮ [T1,Vsystem,n3,s3,r229/4:‹/Tenant/2/Table/109/1/2{6…-8…}›,rangefeed] 3063  pushing old intents failed: range barrier failed, range split: key range /Tenant/2/Table/109/1/‹261›/‹"B\xfb\x85F\xfc=D\x00\x80\x00\x00\x00\x00w\xc6("›-/Tenant/2/Table/109/1/‹284›/‹"H\xf2\x8f\xd4\xf4\xb9L\x00\x80\x00\x00\x00\x00\x82p\xb8"› outside of bounds of range /Tenant/2/Table/109/1/‹261›/‹"B\xfb\x85F\xfc=D\x00\x80\x00\x00\x00\x00w\xc6("›-/Tenant/2/Table/109/1/‹273›/‹"E\xf7\n\x8d\xf8{H\x00\x80\x00\x00\x00\x00}\x1bp"›; suggested ranges: [‹desc: r229:/Tenant/2/Table/109/1/2{61/"B\xfb\x85F\xfc=D\x00\x80\x00\x00\x00\x00w\xc6("-73/"E\xf7\n\x8d\xf8{H\x00\x80\x00\x00\x00\x00}\x1bp"} [(n3,s3):4, (n1,s1):5, (n4,s4):6, next=7, gen=31, sticky=1708172411.570988803,0], lease: <empty>, closed_timestamp_policy: LAG_BY_CLUSTER_SETTING›]

@msbutler msbutler added P-0 Issues/test failures with a fix SLA of 2 weeks and removed P-1 Issues/test failures with a fix SLA of 1 month labels Feb 21, 2024
@msbutler
Copy link
Collaborator

looks like this error was recently introduced by #118413 by @erikgrinaker . Since this looks like a rangefeed bug, I imagine this could cause a changefeed to stall. Adding a release blocker.

@msbutler msbutler added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team and removed T-disaster-recovery labels Feb 21, 2024
@blathers-crl blathers-crl bot added this to Incoming in KV Feb 21, 2024
@msbutler
Copy link
Collaborator

msbutler commented Feb 22, 2024

hrm. While I'm pretty sure this range barrier error caused the c2c stream to stall, given the timing, one wrinkle I should note is that the c2c logging indicated that destination node 4 was lagging behind. But according to streaming_ingestion_dist logging, dest node 4 was paired with src node 2:

5.unredacted/cockroach.log:I240217 11:29:35.463107 41017 ccl/streamingccl/streamingest/stream_ingestion_dist.go:744 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=944042679120232450›] 434  physical replication src-dst pair candidate: ‹2› (locality ‹cloud=gce,region=us-east1,zone=us-east1-b›) - 4 (locality ‹cloud=gce,region=us-east1,zone=us-east1-b›)

I suspect the rangefeed in the kv server on node 3 was then passed to dest node 4 via src node 2, but I'm not totally sure.

@stevendanna stevendanna changed the title roachtest: c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed roachtest: c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed [stalled rangefeed] Feb 22, 2024
@erikgrinaker
Copy link
Contributor

I think this indicates that the range was split, but the split wasn't applied to the follower. I'll have a closer look when I have a chance.

@erikgrinaker erikgrinaker self-assigned this Feb 22, 2024
@erikgrinaker
Copy link
Contributor

This happened after r231 merged into r229:

I240217 12:23:05.160686 17579550 kv/kvserver/replica_command.go:718 ⋮ [T1,Vsystem,n1,merge,s1,r229/5:‹/Tenant/2/Table/109/1/2{6…-7…}›] 2788
  initiating a merge of r231:‹/Tenant/2/Table/109/1/2{73/"E\xf7\n\x8d\xf8{H\x00\x80\x00\x00\x00\x00}\x1bp"-84/"H\xf2\x8f\xd4\xf4\xb9L\x00\x80\x00\x00\x00\x00\x82p\xb8"}› [(n1,s1):1, (n3,s3):4, (n4,s4):5, next=6, gen=28, sticky=1708172411.570988803,0]
  into this range (‹lhs+rhs size (32 MiB+32 MiB=63 MiB) below threshold (128 MiB) lhs+rhs cpu (4ms+7ms=11ms) below threshold (250ms)›)
I240217 12:23:05.172789 332 kv/kvserver/store_remove_replica.go:157 ⋮ [T1,Vsystem,n3,s3,r229/4:‹/Tenant/2/Table/109/1/2{6…-7…}›,raft] 1783  removing replica r231/4
E240217 12:23:06.192206 11736095 kv/kvserver/rangefeed/task.go:245 ⋮ [T1,Vsystem,n3,s3,r229/4:‹/Tenant/2/Table/109/1/2{6…-8…}›,rangefeed] 1784
  pushing old intents failed: range barrier failed, range split:
  key range /Tenant/2/Table/109/1/‹261›/‹"B\xfb\x85F\xfc=D\x00\x80\x00\x00\x00\x00w\xc6("›-/Tenant/2/Table/109/1/‹284›/‹"H\xf2\x8f\xd4\xf4\xb9L\x00\x80\x00\x00\x00\x00\x82p\xb8"› outside of bounds of range
  /Tenant/2/Table/109/1/‹261›/‹"B\xfb\x85F\xfc=D\x00\x80\x00\x00\x00\x00w\xc6("›-/Tenant/2/Table/109/1/‹273›/‹"E\xf7\n\x8d\xf8{H\x00\x80\x00\x00\x00\x00}\x1bp"›;
  suggested ranges: [‹desc: r229:/Tenant/2/Table/109/1/2{61/"B\xfb\x85F\xfc=D\x00\x80\x00\x00\x00\x00w\xc6("-73/"E\xf7\n\x8d\xf8{H\x00\x80\x00\x00\x00\x00}\x1bp"} [(n3,s3):4, (n1,s1):5, (n4,s4):6, next=7, gen=31, sticky=1708172411.570988803,0], lease: <empty>, closed_timestamp_policy: LAG_BY_CLUSTER_SETTING›]'

The barrier is sent with the correct, post-merge bounds, but the server sees the old, pre-merge bounds and rejects the request. Why?

@erikgrinaker
Copy link
Contributor

The barrier request doesn't bypass the lease check or anything, so it should be evaluated on the leaseholder, which should have applied the merge already:

cockroach/pkg/kv/kvpb/api.go

Lines 1776 to 1782 in e494351

func (r *BarrierRequest) flags() flag {
flags := isWrite | isRange | isAlone
if r.WithLeaseAppliedIndex {
flags |= isUnsplittable // the LAI is only valid for a single range
}
return flags
}

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Feb 22, 2024

Could this be a tenant thing or something, where the bounds aren't propagated to the tenant yet? 🤔 Wouldn't immediately think so, since this runs in KV.

Haven't seen this happen in the wild, just checked CC.

@erikgrinaker
Copy link
Contributor

All three nodes applied the merge immediately:

I240217 12:23:05.172392 394 kv/kvserver/store_remove_replica.go:157 ⋮ [T1,Vsystem,n1,s1,r229/5:‹/Tenant/2/Table/109/1/2{6…-7…}›,raft] 2789  removing replica r231/1
I240217 12:23:05.172789 332 kv/kvserver/store_remove_replica.go:157 ⋮ [T1,Vsystem,n3,s3,r229/4:‹/Tenant/2/Table/109/1/2{6…-7…}›,raft] 1783  removing replica r231/4
I240217 12:23:05.172585 310 kv/kvserver/store_remove_replica.go:157 ⋮ [T1,Vsystem,n4,s4,r229/6:‹/Tenant/2/Table/109/1/2{6…-7…}›,raft] 1485  removing replica r231/5

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Feb 22, 2024

This resolved at 12:40:23 when the stream was stopped. r229 was merged away shortly after, which is unfortunate since that also removed the range state.

I240217 12:42:05.640932 15144319 kv/kvserver/replica_command.go:718 ⋮ [T1,Vsystem,n2,merge,s2,r228/3:‹/Tenant/2/Table/109/1/2{5…-6…}›] 2005
  initiating a merge of r229:‹/Tenant/2/Table/109/1/2{61/"B\xfb\x85F\xfc=D\x00\x80\x00\x00\x00\x00w\xc6("-84/"H\xf2\x8f\xd4\xf4\xb9L\x00\x80\x00\x00\x00\x00\x82p\xb8"}› [(n2,s2):7, (n1,s1):5, (n4,s4):6, next=8, gen=36, sticky=1708172411.570988803,0]
  into this range (‹lhs+rhs size (32 MiB+64 MiB=95 MiB) below threshold (128 MiB) lhs+rhs cpu (14ms+12ms=26ms) below threshold (250ms)›)

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Feb 22, 2024

Oh no, it looks like isUnsplittable requests don't refresh the DistSender range cache. So if the DistSender has a stale view of the range descriptor, it'll just keep failing until some other request ends up refreshing the cache.

// The batch spans ranges (according to our cached range descriptors).
// Verify that this is ok.
// TODO(tschottdorf): we should have a mechanism for discovering range
// merges (descriptor staleness will mostly go unnoticed), or we'll be
// turning single-range queries into multi-range queries for no good
// reason.
if ba.IsUnsplittable() {
mismatch := kvpb.NewRangeKeyMismatchErrorWithCTPolicy(ctx,
rs.Key.AsRawKey(),
rs.EndKey.AsRawKey(),
ri.Desc(),
nil, /* lease */
ri.ClosedTimestampPolicy(),
)
return nil, kvpb.NewError(mismatch)
}

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Feb 22, 2024

Secondary question: why do we keep trying to push this transaction? Are we not seeing the intent resolution replicated, which would remove it from tracking? Maybe because we don't emit any of the events here when the barrier fails or something. Will look into it after I fix the primary problem.

@erikgrinaker
Copy link
Contributor

I have no less than 3 prototype fixes, we'll need to gauge backport safety.

Secondary question: why do we keep trying to push this transaction? Are we not seeing the intent resolution replicated, which would remove it from tracking?

I think that's exactly the problem: this probably was an aborted transaction, whose txn record was GCed, but there is nothing actually causing the intents to get GCed. We would do so after sending the barrier, but we error out before that.

I'll submit a PR to attempt to resolve intents even if the barrier fails, which should be a bit more robust.

@erikgrinaker
Copy link
Contributor

And another candidate fix:

@erikgrinaker
Copy link
Contributor

After discussing, we're going to go with #119512. We just have to confirm that this will reliably update the range cache first, and add another test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-rangefeed Rangefeed infrastructure, server+client branch-release-23.2.1-rc C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-0 Issues/test failures with a fix SLA of 2 weeks release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Development

Successfully merging a pull request may close this issue.

4 participants