Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: Handle sending/receiving in separate goroutines #94

Merged
merged 1 commit into from
Oct 30, 2021

Conversation

kevpar
Copy link
Member

@kevpar kevpar commented Oct 14, 2021

Late last year I created #73 in an attempt to solve #72. This PR had some feedback that I never had a chance to properly address. I've now revisited this area and taken a different approach, which I think is better to put in a fresh PR, so that's what we have here.

Originally I had changed both the client and server code to prevent this deadlock on both ends. However, after thinking on this recently, I think the server code is alright as is. The core issue is that the client would block sending to the server, and the server would block sending back to the client, and thus neither would read from the other and we would deadlock. This is definitely undesirable behavior from the client, but on the server end I think it is reasonable to block reading from a client if it is not draining its end of the connection. So this new approach only changes the client to allow sending and receiving to operate independently.

I intend to close out #73 assuming this new approach is reasonable.

Commit description

Changes the TTRPC client logic so that sending and receiving with the
server are in completely independent goroutines, with shared state
guarded by a mutex. Previously, sending/receiving were tied together by
reliance on a coordinator goroutine. This led to issues where if the
server was not reading from the connection, the client could get stuck
sending a request, causing the client to not read responses from the
server. See [1] for more details.

The new design sets up separate sending/receiving goroutines. These
share state in the form of the set of active calls that have been made
to the server. This state is encapsulated in the callMap type and access
is guarded by a mutex.

The main event loop in run previously handled a lot of state
management for the client. Now that most state is tracked by the
callMap, it mostly exists to notice when the client is closed and take
appropriate action to clean up.

Also did some minor code cleanup. For instance, the code was previously
written to support multiple receiver goroutines, though this was not
actually used. I've removed this for now, since the code is simpler this
way, and it's easy to add back if we actually need it in the future.

[1] #72

Signed-off-by: Kevin Parsons kevpar@microsoft.com

Changes the TTRPC client logic so that sending and receiving with the
server are in completely independent goroutines, with shared state
guarded by a mutex. Previously, sending/receiving were tied together by
reliance on a coordinator goroutine. This led to issues where if the
server was not reading from the connection, the client could get stuck
sending a request, causing the client to not read responses from the
server. See [1] for more details.

The new design sets up separate sending/receiving goroutines. These
share state in the form of the set of active calls that have been made
to the server. This state is encapsulated in the callMap type and access
is guarded by a mutex.

The main event loop in `run` previously handled a lot of state
management for the client. Now that most state is tracked by the
callMap, it mostly exists to notice when the client is closed and take
appropriate action to clean up.

Also did some minor code cleanup. For instance, the code was previously
written to support multiple receiver goroutines, though this was not
actually used. I've removed this for now, since the code is simpler this
way, and it's easy to add back if we actually need it in the future.

[1] containerd#72

Signed-off-by: Kevin Parsons <kevpar@microsoft.com>
@codecov-commenter
Copy link

Codecov Report

Merging #94 (4f0aeb5) into main (77fc8a4) will increase coverage by 0.47%.
The diff coverage is 75.34%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #94      +/-   ##
==========================================
+ Coverage   72.88%   73.35%   +0.47%     
==========================================
  Files          11       11              
  Lines         708      728      +20     
==========================================
+ Hits          516      534      +18     
+ Misses        155      153       -2     
- Partials       37       41       +4     
Impacted Files Coverage Δ
client.go 79.60% <75.34%> (-1.07%) ⬇️
server.go 77.65% <0.00%> (+1.46%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 77fc8a4...4f0aeb5. Read the comment docs.

@dcantah
Copy link
Member

dcantah commented Oct 18, 2021

What did validation for this consist of? Main reason I'm asking is because the sample program in the issue linked takes me to 404 land :/

client.go Show resolved Hide resolved
@kevpar
Copy link
Member Author

kevpar commented Oct 19, 2021

What did validation for this consist of? Main reason I'm asking is because the sample program in the issue linked takes me to 404 land :/

That repo changed to private in the last few months. I've put up a newer version of the test tool here: https://github.com/kevpar/ttrpc-deadlock

For validation I've run my test tool (which repros the deadlock) for ~10 minutes and seen no deadlock occur. Other than that the main validation is what we get through CI.

Copy link
Member

@dcantah dcantah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I can't speak on removing the multiple receiver goroutines as I have no history here.

// Receives calls from dispatch, adds them to the set of active calls, and sends them
// to the server.
go func() {
var streamID uint32 = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not entirely new, but since you've modified and may have fresh memory - how does that work if streamID is getting bigger than what uint32 can represent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will only be a problem if a request with a previous ID is still pending. Given that a uint32 means we can do 100 requests per second for 500 days before we overflow, I think it's not an issue in practice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot we only use odd numbers for requests, so it's for 250 days, not 500. Still a long time though. :)

Copy link
Member

@dmcgowan dmcgowan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I think some further refactoring could be done to make the code a bit easier to follow and the number of goroutines still seems higher than necessary with the addition of a shared locking object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants