Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean callbacks of connection after run loop stopped (#239) #248

Merged
merged 3 commits into from
Aug 17, 2020

Conversation

rueian
Copy link
Contributor

@rueian rueian commented May 13, 2020

Fixes #239

Motivation

As @wolfstudy pointed out here #239 (comment)

There is a race on callbacks of pendingReqs when closing the connection while the run loop is still running, which will lead to calling a callback up to 2 times:

for _, req := range c.pendingReqs {
req.callback(nil, errors.New("connection closed"))
}

Modifications

Introducing a runLoopStoppedCh to make sure that the run loop has already stopped when cleaning callbacks of pendingReqs in the Close()

@rueian rueian force-pushed the call-pending-callback-after-run-loop branch 2 times, most recently from a6b5467 to 38a16c2 Compare May 13, 2020 19:22
c.pingTicker.Stop()
c.pingCheckTicker.Stop()

for _, listener := range c.listeners {
listener.ConnectionClosed()
}

for _, req := range c.pendingReqs {
if c.runLoopStoppedCh != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you reuse the close channel here? Why create another channel?

Copy link
Contributor Author

@rueian rueian May 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean reusing the closeCh?

It seems that the closeCh is a signal to instruct the run() loop to exit. However we would not know when the run() has finished its job without a dedicated signal.

Therefore, the runLoopStoppedCh is the dedicated signal and is responsible to indicating the run() loop stopped.
When runLoopStoppedCh closed, we are able do further cleanup resources shared with the run() loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the run loop have to exit before removing the pending callback? How could a callback be called twice?

Copy link
Contributor Author

@rueian rueian May 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @wolfstudy pointed out, the problem comes from theinternalWriteData may call on Close()
https://github.com/apache/pulsar-client-go/blob/master/pulsar/internal/connection.go#L331

While the run loop will go to next iteration no matter that the Close() is called inside the loop or from another goroutine. At that point, the run loop can go into incomingCmdCh branch even if the c.closeCh had been closed.

If the run loop goes into the incomingCmdCh branch, then the callbacks in the pendingReqs being called in the Close() will be called second time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue is connection.Close() is called multiple times?

After Close() is the connection struct ever used again?

If we want to ensure Close() is only called once can we use a sync.Once and do something similar to what the consumers do?

https://github.com/apache/pulsar-client-go/blob/master/pulsar/consumer_impl.go#L362

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if order is needed (pendingReqs close callbacks must happen before the others c.listeners and c.consumerHandlers callbacks) maybe @merlimat would know better.

I think moving just the pendingReqs into the run loops is needed. The other structures are protected with locks. Have you tried moving the pendingReqs code from Close() to the run loop and then running some tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, moving just the pendingReqs into the run loops will still pass the existing tests, but it is really hard to write new race tests without to be flaky.

I would suggest we keep cleanup code at only one place, and they would be better triggered by the run loop. However the Close() would become an async operation.

Hi @merlimat, would you have any feedback?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After tracing the call path of the Close(), I found that it is called directly by the Close() in the client_impl.go.

When performing a graceful shutdown, although normally we will first call the producer.Close()/consumer.Close() and then call client.Close() at the end. I think it is still a good idea to keep the Close() as a sync operation to remain similar behavior between the producer.Close()/consumer.Close().

We could move the cleanup code of c.pendingReqs, c.listeners and c.consumerHandlers into the run loop. However in that case, I would suggest keeping the runLoopStoppedCh channel to inform the Close() that the cleanup process has finished.

What do you think? @cckellogg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see the run loop ever getting stuck for a period of time when close is called? I think we should probably only pendingReqs cleanup in the run loop since that map is only accessed in that go routine . We could keep the others where they are since they are protected by locks. We just need to add some comments on why this is the case and why the runLoopStoppedCh is needed.

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see the run loop ever getting stuck for a period of time when close is called?

No, they are separated originally. To keep the original behavior, the runLoopStoppedCh is not needed if we move the pendingReqs cleanup at the end of run loop.

I just updated the PR which moves pendingReqs cleanup into the run loop and adds some notices.

Please let me know if it looks good to you.

@rueian rueian force-pushed the call-pending-callback-after-run-loop branch from 38a16c2 to 9702aff Compare May 14, 2020 10:14
@rueian rueian changed the title Call pending callbacks of connection after run loop stopped (#239) Clean pending callbacks of connection after run loop stopped (#239) May 19, 2020
@rueian rueian changed the title Clean pending callbacks of connection after run loop stopped (#239) Clean callbacks of connection after run loop stopped (#239) May 19, 2020
@wolfstudy wolfstudy added this to the 0.1.1 milestone Jun 9, 2020
@wolfstudy
Copy link
Member

wolfstudy commented Jun 9, 2020

The change LGTM +1, ping @cckellogg @merlimat PTAL

if c.cnx != nil {
c.cnx.Close()
}
c.TriggerClose()
Copy link
Contributor

@cckellogg cckellogg Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a client calls this won't it trigger another call to Close() if the run loop is still going. This would cause all the listeners to be triggered again. Are there potential issues with that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is wrapped with a mutex and a c.state == connectionClosed check.
I think another call to Close() will have no effect.

@wolfstudy
Copy link
Member

Move the milestones to 0.2.0

@wolfstudy wolfstudy modified the milestones: 0.1.1, 0.2.0 Jun 11, 2020
@wolfstudy wolfstudy merged commit 5d57012 into apache:master Aug 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

panic during rapid send of many large-sized payloads: "panic: send on closed channel"
3 participants