Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic when an in flight request is canceled #36

Open
joao-r-reis opened this issue Oct 21, 2021 · 7 comments
Open

Panic when an in flight request is canceled #36

joao-r-reis opened this issue Oct 21, 2021 · 7 comments

Comments

@joao-r-reis
Copy link
Contributor

joao-r-reis commented Oct 21, 2021

I was trying to reproduce a panic race condition in a different code when I ran into 2 different "panics" so I figured it would be best to report it here. In the code that reproduces this, we are opening and closing a lot of connections fast like this:

newCtx, _ := context.WithTimeout(ctx, time.Duration(rand.Intn(200)) * time.Millisecond)
tempCqlConn, err := cqlClient.ConnectAndInit(newCtx, primitive.ProtocolVersion4, 0)

But we are also sending a ton of requests concurrently like this (on another connection):

reqFrame3 := frame.NewFrame(primitive.ProtocolVersion4, 0, queryMsg)
inflightRequest3, err := cqlConn.Send(reqFrame3)
panic: send on closed channel

goroutine 242664 [running]:
github.com/datastax/go-cassandra-native-protocol/client.(*CqlClientConnection).Send(0xc0008a8480, 0xc0003c7420, 0x0, 0x0, 0xc000a97f60, 0xc001991e50)
	C:/Users/JoaoReis/go/pkg/mod/github.com/datastax/go-cassandra-native-protocol@v0.0.0-20210929152221-b2edfb76bdf8/client/client.go:513 +0x26c
github.com/datastax/go-cassandra-native-protocol/client.(*CqlClientConnection).SendAndReceive(0xc0008a8480, 0xc0003c7420, 0xc0003c7420, 0x0, 0x0)
	C:/Users/JoaoReis/go/pkg/mod/github.com/datastax/go-cassandra-native-protocol@v0.0.0-20210929152221-b2edfb76bdf8/client/client.go:546 +0x40
github.com/datastax/go-cassandra-native-protocol/client.(*CqlClientConnection).InitiateHandshake(0xc0008a8480, 0xe304, 0xc000983f80, 0xc0008a8480)
	C:/Users/JoaoReis/go/pkg/mod/github.com/datastax/go-cassandra-native-protocol@v0.0.0-20210929152221-b2edfb76bdf8/client/handshake.go:62 +0x119
github.com/datastax/go-cassandra-native-protocol/client.(*CqlClient).ConnectAndInit(0xc001ac2120, 0xd6e3a0, 0xc000983f80, 0xe304, 0xc000983f80, 0xc000776ef0, 0xc0030cab00)
	C:/Users/JoaoReis/go/pkg/mod/github.com/datastax/go-cassandra-native-protocol@v0.0.0-20210929152221-b2edfb76bdf8/client/client.go:136 +0x9c
github.com/riptano/cloud-gate/integration-tests.TestStressShutdown.func1.2(0xc0007066f0, 0xd6e320, 0xc00007f200, 0xc001ac2120, 0xc000a5be50)
	C:/Github/Riptano/cloud-gate/integration-tests/shutdown_test.go:191 +0x10c
created by github.com/riptano/cloud-gate/integration-tests.TestStressShutdown.func1
	C:/Github/Riptano/cloud-gate/integration-tests/shutdown_test.go:187 +0x55b
panic: send on closed channel

goroutine 283516 [running]:
github.com/datastax/go-cassandra-native-protocol/client.(*inFlightRequest).onFrameReceived(0xc001784b00, 0xc002c256e0, 0x0, 0x0)
	C:/Users/JoaoReis/go/pkg/mod/github.com/datastax/go-cassandra-native-protocol@v0.0.0-20210929152221-b2edfb76bdf8/client/inflight.go:270 +0x127
github.com/datastax/go-cassandra-native-protocol/client.(*inFlightRequestsHandler).onIncomingFrameReceived(0xc000716e40, 0xc002c256e0, 0x1f, 0xc001fd5f20)
	C:/Users/JoaoReis/go/pkg/mod/github.com/datastax/go-cassandra-native-protocol@v0.0.0-20210929152221-b2edfb76bdf8/client/inflight.go:122 +0x132
github.com/datastax/go-cassandra-native-protocol/client.(*CqlClientConnection).processIncomingFrame(0xc000184240, 0xc002c256e0, 0xc001ee0c40)
	C:/Users/JoaoReis/go/pkg/mod/github.com/datastax/go-cassandra-native-protocol@v0.0.0-20210929152221-b2edfb76bdf8/client/client.go:428 +0x312
github.com/datastax/go-cassandra-native-protocol/client.(*CqlClientConnection).readFrame(0xc000184240, 0xd5ea20, 0xc001ee0c40, 0x0)
	C:/Users/JoaoReis/go/pkg/mod/github.com/datastax/go-cassandra-native-protocol@v0.0.0-20210929152221-b2edfb76bdf8/client/client.go:374 +0xc7
github.com/datastax/go-cassandra-native-protocol/client.(*CqlClientConnection).incomingLoop.func1(0xc000184240)
	C:/Users/JoaoReis/go/pkg/mod/github.com/datastax/go-cassandra-native-protocol@v0.0.0-20210929152221-b2edfb76bdf8/client/client.go:257 +0x85
created by github.com/datastax/go-cassandra-native-protocol/client.(*CqlClientConnection).incomingLoop
	C:/Users/JoaoReis/go/pkg/mod/github.com/datastax/go-cassandra-native-protocol@v0.0.0-20210929152221-b2edfb76bdf8/client/client.go:249 +0xe4

@adutra
Copy link
Contributor

adutra commented Oct 25, 2021

For the first case: I cannot reproduce it, even if I place some wait groups to simulate the case where the Send method sends on a channel that has been closed immediately before in another goroutine. The way the Close method is designed expects the outgoing channel to be nilled out before it is closed, see here. In my attempts to reproduce this, I always get a nil channel, not a closed one. I think we might be misusing Go's memory model here. According to this, the write to c.outgoing = nil may not be immediately visible by other go routines. This IMO explains the issue.

For the second case: I didn't explore so deeply, but indeed it seems possible. I think the design suffers from the same memory visibility issue, see here.

I therefore suggest 2 things:

  • Fix the memory visibility issues and use a lock to acquire the channel. This seems a bit convoluted though.
  • Alternatively: do we need to close those channels? Can't we simply leave them open and close the whole connection only?

@joao-r-reis
Copy link
Contributor Author

joao-r-reis commented Oct 25, 2021

I think that one of the main issues is that the connection calls close() itself when the context is closed. We should probably always put that responsibility on the caller so that the caller stops sending requests before actually closing the connection.

This won't fix everything though because there still might be in flight requests and responses may arrive while a close() is happening. To fix this issue, we can probably do it by having a waitGroup that waits until every loop is done and closing the channels only after Wait() returns. So we can start by closing the connection object to force the read loop to return and this read loop will start closing the channels that it writes to, therefore initiating the shutdown in a "cascade" way and when everything is closed then the request channels can be closed (the ones that the user indirectly writes to when a request is sent)

@adutra
Copy link
Contributor

adutra commented Oct 25, 2021

I would start by fixing the write visibility issues, then see if there are more issues to fix.

Leaving those channels open could be a quick win in fact, I don't see any compelling reason to close them when closing the connection.

@joao-r-reis
Copy link
Contributor Author

As long as those channels are not buffered then it could be a quick win indeed.

@adutra
Copy link
Contributor

adutra commented Oct 25, 2021

Hmm they are buffered. What is the problem with them being buffered?

@joao-r-reis
Copy link
Contributor Author

I thought that it would create memory issues but nevermind, they should get garbage collected regardless

@joao-r-reis
Copy link
Contributor Author

We do have to make sure we don't leak goroutines otherwise the buffered channel might never get cleaned up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants