Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unix Domain Socket: peer-peer connections appear to hang #1349

Open
djm1329 opened this issue Nov 27, 2018 · 14 comments
Open

Unix Domain Socket: peer-peer connections appear to hang #1349

djm1329 opened this issue Nov 27, 2018 · 14 comments

Comments

@djm1329
Copy link

djm1329 commented Nov 27, 2018

I have sample code for a peer-peer app that sends messages once every 10ms in both directions. With TCP the code runs "forever", but with domain socket, one side (usually the "client") almost always stops receiving after a short while. The reproducer code is in https://github.com/djm1329/AkkaStreamSocketTests .

The sample is run by starting a server, then a client. The client connects to the server over a domain socket, then both sides start sending messages to each other. After a short while, one side (usually the client) apparently stops receiving messages. Replacing the domain socket with a TCP socket (details on this are in the repo) seems to work fine.

@huntc
Copy link
Contributor

huntc commented Nov 28, 2018

Thanks for the report and for the reproducer. Here's what I see when running on OS X (what's your platform?):

Server

Just one of these:

[info] hello world 1

Client

Lots of these:

[info] [DEBUG] [11/28/2018 12:28:36.599] [UDSTest-akka.actor.default-dispatcher-6] [akka.stream.Log(akka://UDSTest/system/StreamSupervisor-0)] [incoming] Element: HELLO WORLD 13056

@djm1329
Copy link
Author

djm1329 commented Nov 28, 2018

Sorry, I should have specified the platform. It is macOS 10.13.6.

Yes, I found the behaviour varies a bit from run to run. For me, the client usually stops receiving after a number of "HELLO WORLD <n>" from the server. The server keeps on receiving and printing "hello world <n>". But I have also seen the reverse as you report.

For my latest run for example, I get up to

[info] [DEBUG] [11/27/2018 20:49:10.601] [UDSTest-akka.actor.default-dispatcher-2] [akka.stream.Log(akka://UDSTest/system/StreamSupervisor-0)] [incoming] Element: HELLO WORLD 203

on the client, and then it stops. The server keeps going -- when I stopped it it was at

[info] hello world 2413.

But always one side or the other seems to stop.

(sorry for the inconsistent output format between client and server)

@huntc
Copy link
Contributor

huntc commented Nov 28, 2018

Thanks for the additional info.

I’m in the middle of a prod rollout this week and next, so time will be tight. However, I’ll look at it ASAP. My first move will be to create a unit test that reproduces yours, so if you feel like raising a PR for that then it’d save me some time. No worries if you can’t though.

@djm1329
Copy link
Author

djm1329 commented Nov 28, 2018

Thanks so much! No rush for this on my side. Time is short for me next couple of weeks but I will try to produce a unit test PR if I get cycles before you do.

@huntc
Copy link
Contributor

huntc commented Feb 26, 2019

Again, sorry for the delay. I will take a look at this now.

@djm1329
Copy link
Author

djm1329 commented Feb 26, 2019 via email

@huntc
Copy link
Contributor

huntc commented Feb 26, 2019

Hey Doug - I'm unsure if this will help, but I've finally been able to get a long-outstanding PR complete: #1297. I'm now going to try it with some code I have, but perhaps you could build it locally and take it for a spin with your test case. I'm not convinced that it'll fix things, but there's an outside chance.

@huntc
Copy link
Contributor

huntc commented Feb 26, 2019

UPDATE: I just tried your sample project @djm1329 - it seems to work for me... and I didn't update anything. :-( It'd be good to learn of your results.

@djm1329
Copy link
Author

djm1329 commented Feb 26, 2019 via email

@huntc
Copy link
Contributor

huntc commented Feb 26, 2019

Ok, I’ll dig in more. Sounds like a race condition.

@huntc
Copy link
Contributor

huntc commented Feb 27, 2019

Actually, I did manage to reproduce locally - I just didn't realise what the output should have been in your examples. :-)

I've spent another day on this and ended up with another commit to that PR: b0b1617. However, I don't think I've been able to fix the issue here. You'll see that there's a new test that is now ignored. This test is designed to reproduce the condition you see and it does indeed fail by continuing forever. The crux of this issue is that the code wants to write some data and registers interest in doing so with NIO. However, NIO doesn't appear to honour this request and never permits the writing of the data.

My next move is to abandon JNR in favour of providing my own JNA implementation that calls out to LibC - in the same way that Nailgun does: https://github.com/facebook/nailgun/blob/master/nailgun-server/src/main/java/com/facebook/nailgun/NGUnixDomainSocketLibrary.java. I'd also have to abstract kqueue and epoll for the integration with NIO. Having more control over the underlying socket and notification mechanism would provide a greater degree of confidence. I've found a few weird things with JNR along the way, and it could even be that jnr/jnr-unixsocket#68 fixes things for us. Another oddity is that we're able only to send out 8K buffers despite the code trying to send out 64k. I don't know if JNR is a contributor to that or not.

Naturally, there could well be more bugs associated with the code I have here. But I've been through this code with a fine-tooth-comb quite a few times now, and so I'm increasingly suspicious of the code that we depend on.

Also, unfortunately, I've no more time to spend on this project during the week and so it will be best-effort in my own time. That said, it bothers me that the code doesn't work fully as advertised so I'm motivated to fix it. :-)

@2m
Copy link
Member

2m commented Feb 28, 2019

Great analysis. We really appreciate the work you put into this!

@ennru
Copy link
Member

ennru commented Mar 20, 2019

With #1297 merged and 1.0-RC1 released, is this still an issue?

@huntc
Copy link
Contributor

huntc commented Mar 20, 2019

I’d say it is still a problem. My view is to replace JNR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants