Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util/grpcutil: Can no longer detect unambiguous failures #19708

Closed
cockroach-teamcity opened this issue Nov 1, 2017 · 27 comments · Fixed by #105885
Closed

util/grpcutil: Can no longer detect unambiguous failures #19708

cockroach-teamcity opened this issue Nov 1, 2017 · 27 comments · Fixed by #105885
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. skipped-test T-kv KV Team
Projects

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/98f1623d60254505b65c457728617af5c5fec33b

Parameters:

TAGS=
GOFLAGS=

Stress build found a failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=394944&tab=buildLog

I171101 07:09:42.489844 41 vendor/google.golang.org/grpc/resolver_conn_wrapper.go:64  dialing to target with scheme: ""
I171101 07:09:42.489870 41 vendor/google.golang.org/grpc/resolver_conn_wrapper.go:70  could not get resolver for scheme: ""
W171101 07:09:42.490535 56 vendor/google.golang.org/grpc/server.go:839  grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
W171101 07:09:42.490640 21 vendor/google.golang.org/grpc/transport/log.go:36  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:49561->127.0.0.1:33576: use of closed network connection
W171101 07:09:42.491825 52 vendor/google.golang.org/grpc/clientconn.go:934  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:49561: operation was canceled"; Reconnecting to {127.0.0.1:49561 0  <nil>}
W171101 07:09:42.491862 52 vendor/google.golang.org/grpc/clientconn.go:1028  grpc: addrConn.transportMonitor exits due to: context canceled
	grpc_util_test.go:117: request should not have started, but got rpc error: code = Unavailable desc = transport is closing


ERROR: exit status 1

40000 runs completed, 1 failures, over 4m25s
context canceled
Makefile:276: recipe for target 'stress' failed
make: *** [stress] Error 1
@cockroach-teamcity cockroach-teamcity added O-robot Originated from a bot. C-test-failure Broken test (automatically or manually discovered). labels Nov 1, 2017
@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@bdarnell
Copy link
Member

bdarnell commented Nov 9, 2017

The GRPC updates in #19619 have changed the way errors are being reported here in some cases (several of the assertions in TestRequestDidNotStart are failing, both that we get an error that passes RequestDidNotStart when we didn't expect to and that we didn't get one when we should). Note that even though this has been failing every night since those changes went in, it takes a 10+ minutes to reproduce on a gceworker.

I haven't been able to identify exactly what's going on on these error paths, but we knew we were relying on implementation details because GRPC hasn't provided a clean way to tell whether the RPC has been sent or not (grpc/grpc-go#1443).

This affects AmbiguousResultErrors: it appears that we could be returning AmbiguousResultErrors when we don't need to and returning unambiguous errors when we shouldn't. We could fix this conservatively by making all GRPC errors ambiguous, but I think that would make ambiguous errors way too common. We may need to downgrade GRPC to 1.6 instead.

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

@cockroach-teamcity
Copy link
Member Author

bdarnell added a commit to bdarnell/cockroach that referenced this issue Nov 15, 2017
In gRPC 1.6, we could identify certain unambiguous failures with
string matching on the error. In 1.7, the same error message is used
for both ambiguous and unambiguous cases, so we must assume it is
ambiguous.

Release note (bug fix): Fixed a case in which ambiguous errors were
treated as unambiguous and led to inappropriate retries.

Updates cockroachdb#19708
@bdarnell bdarnell changed the title util/grpcutil: TestRequestDidNotStart failed under stress util/grpcutil: Can no longer detect unambiguous failures Nov 15, 2017
@bdarnell
Copy link
Member

PR #20073 makes the conservative change to RequestDidNotStart to assume that all rpc errors are ambiguous, which fixes the potential bug here. However, this will increase the prevalence of ambiguous errors, so I'm leaving this issue open to track alternative solutions. This will probably require either grpc/grpc-go#1443 upstream or reverting to grpc 1.6.

bdarnell added a commit to bdarnell/cockroach that referenced this issue Jan 10, 2018
This is similar to what grpc does internally for fail-fast rpcs, but
lets us control the error returned to work around grpc/grpc-go#1443

Fixes cockroachdb#19708

Release note (performance improvement): Reduced the occurrence of
ambiguous errors when a node is down.
@rafiss rafiss added this to Incoming in KV via automation Jun 27, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Jun 27, 2023
@rafiss
Copy link
Collaborator

rafiss commented Jun 27, 2023

reopening as a skipped test references this issue

@rafiss rafiss reopened this Jun 27, 2023
@erikgrinaker erikgrinaker self-assigned this Jun 30, 2023
@craig craig bot closed this as completed in 17814d8 Jul 6, 2023
KV automation moved this from Incoming to Closed Jul 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. skipped-test T-kv KV Team
Projects
KV
Closed
Development

Successfully merging a pull request may close this issue.

4 participants