Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: weekly/tpcc/headroom failed #122439

Open
cockroach-teamcity opened this issue Apr 16, 2024 · 4 comments · May be fixed by #122944
Open

roachtest: weekly/tpcc/headroom failed #122439

cockroach-teamcity opened this issue Apr 16, 2024 · 4 comments · May be fixed by #122944
Assignees
Labels
branch-master Failures on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team
Projects
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Apr 16, 2024

roachtest.weekly/tpcc/headroom failed with artifacts on master @ f117eea22dd7be380c7141cdf6cd7aba92dd9c70:

(monitor.go:154).Wait: monitor failure: full command output in run_073655.363388731_n4_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/weekly/tpcc/headroom/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=true
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/test-eng

This test on roachdash | Improve this report!

Jira issue: CRDB-37891

@cockroach-teamcity cockroach-teamcity added branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-testeng TestEng Team labels Apr 16, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Apr 16, 2024
@renatolabs
Copy link
Collaborator

Workload saw the following error

Error: error in delivery: ERROR: result is ambiguous: error=r265/3:(n3,s3) is unavailable (circuit breaker tripped): context canceled [propagate] (last error: intent missing: "sql txn" meta={id=58a41f40 key=/Table/111/1/945/1 iso=Serializable pri=0.00913027 epo=0 ts=1713274007.574041942,0 min=1713274007.574041942,0 seq=0} lock=true stat=PENDING rts=1713274007.574041942,0 wto=false gul=1713274008.074041942,0) (SQLSTATE 40003)

Which is the same error observed in last week's failure (#121997). This failure includes #122255, so I'm guessing that change didn't completely fix the issue. Reassigning to KV.

@renatolabs renatolabs added the T-kv KV Team label Apr 16, 2024
@blathers-crl blathers-crl bot added this to Incoming in KV Apr 16, 2024
@renatolabs renatolabs removed the T-testeng TestEng Team label Apr 16, 2024
@renatolabs renatolabs removed this from Triage in Test Engineering Apr 16, 2024
@arulajmani arulajmani added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 16, 2024
@arulajmani
Copy link
Collaborator

arulajmani commented Apr 16, 2024

@nvanbenschoten, coming back to this again, and I'm wondering if the issue here has to do with the ambiguous nature of the error being returned. The rationale behind why we're marking the error ambiguous is here:

// The circuit breaker may have tripped while a commit proposal was in
// flight, so we have to mark it as ambiguous as well.
if withCommit && ambiguousError == nil {
ambiguousError = br.Error.GoError()
}

That doesn't apply to pre-commit QueryIntentRequests though, right? If we're getting an IntentMissingError, we'd expect it to be handled by divideAndSendParallelCommit correctly. This also comes back to what you said on Slack:

"I’m also a little confused about why we set withCommit for the “pre-commit query intent” sub-batch and how this is supposed to work. The batch is read-only, so we should be free to retry it. We should then be handling ambiguity up above in divideAndSendParallelCommit."

I still don't follow the last bit about handling ambiguity in divideAndSendParallelCommit, but I was wondering if a diff like below would work:

--- a/pkg/kv/kvclient/kvcoord/dist_sender.go
+++ b/pkg/kv/kvclient/kvcoord/dist_sender.go
@@ -1393,7 +1393,8 @@ func (ds *DistSender) divideAndSendParallelCommit(

                // Send the batch with withCommit=true since it will be inflight
                // concurrently with the EndTxn batch below.
-               reply, pErr := ds.divideAndSendBatchToRanges(ctx, qiBa, qiRS, qiIsReverse, true /* withCommit */, qiBatchIdx)
+               // TODO(XXX): update comment ^.
+               reply, pErr := ds.divideAndSendBatchToRanges(ctx, qiBa, qiRS, qiIsReverse, false /* withCommit */, qiBatchIdx)
                qiResponseCh <- response{reply: reply, positions: positions, pErr: pErr}
        }); err != nil {
                return nil, kvpb.NewError(err)

@arulajmani arulajmani self-assigned this Apr 16, 2024
arulajmani added a commit to arulajmani/cockroach that referenced this issue Apr 24, 2024
In certain cases, requests may return an error after successful
evaluation. One example is `IntentMissingErrors` returned by
`QueryIntent` requests when the queried intent is missing. In such
cases, from the perspective of the client, the request is successful.
It shouldn't be retried on another replica and any state maintained
from prior attempts should be forgotten.

The former was already handled, likely by unintentionally. This patch
fixes the latter. In particular, any ambiguous error tracking from
previous attempts is forgotten, and in cases where the request evaluated
successfully but returned an error, the error is no longer wrapped as
ambiguous.

Fixes cockroachdb#122439

Epic: none

Release note: None
@arulajmani arulajmani added branch-release-24.1 Used to mark GA and release blockers and technical advisories for 24.1 and removed GA-blocker branch-release-24.1 Used to mark GA and release blockers and technical advisories for 24.1 labels Apr 24, 2024
@arulajmani
Copy link
Collaborator

Removing the GA-blocker because this issue is caused by DistSender circuit breakers, which are being disabled by default. See #122983.

@miraradeva miraradeva added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Apr 24, 2024
@cockroach-teamcity
Copy link
Member Author

roachtest.weekly/tpcc/headroom failed with artifacts on master @ c4ab095c4f65b9140661ed57adddc690b1e3ce3f:

(monitor.go:154).Wait: monitor failure: full command output in run_081416.890588441_n4_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/weekly/tpcc/headroom/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=true
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@arulajmani arulajmani added the P-2 Issues/test failures with a fix SLA of 3 months label May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team
Projects
KV
Incoming
Development

Successfully merging a pull request may close this issue.

5 participants