Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang in tokio multi-threaded runtime #27

Closed
bin opened this issue Jan 15, 2022 · 2 comments · Fixed by #28
Closed

Hang in tokio multi-threaded runtime #27

bin opened this issue Jan 15, 2022 · 2 comments · Fixed by #28

Comments

@bin
Copy link

bin commented Jan 15, 2022

I suspect this may be some sort of deadlock issue, as this is resolved by running in a single-threaded runtime. You will find minimum viable examples of the multi- and single-threaded behaviors at https://github.com/bin/fred_broken.git

@aembke
Copy link
Owner

aembke commented Jan 15, 2022

Hi @bin,

Good find. I took a look at this and I don't think it's a deadlock, but it is clearly a race condition.

I have a similar setup going in the pipeline_test, but with one key difference - i'm spawning a new task to run concurrent command there instead of in a single loop.

The deadlock-detection logic in parking lot hasn't picked anything up, and there should only be one lock in play in the code paths used by your example. That usage is limited to single-line push_back and pop_front on a VecDeque, so I doubt that's the issue.

In looking at the network traffic I noticed that when this does repro it always happens when the client decides not to flush the socket after writing some bytes. This was just an optimization, and callers can disable it by setting the pipeline flag in the config to false. When I disable pipelining this no longer repros for me as well. But that would make sense, since if the server never received all the bytes in a frame then it won't respond.

I also found an issue in the logic that decides whether to flush the socket, so I think that's the root cause. My suspicion is that it's an atomic ordering issue in the counter that determines whether to flush some bytes, which would explain why it doesn't repro on a single thread.

I'll need to spend some time with loom but if that's the issue I'll get an update out hopefully this weekend.

@aembke
Copy link
Owner

aembke commented Jan 15, 2022

Yeah, that is the issue. It's not an issue of using the wrong Ordering, but I just had an incr line at the wrong location. I was able to patch this locally and I can no longer repro the issue.

I'll get a patch up for 4.x this weekend.

@aembke aembke mentioned this issue Jan 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants