-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is the "proper" way to respond to a socket timeout when sending PINGREQ? #563
Comments
It's interesting that it's the sockpair mechanism that is timing out. I'm just reading your linked bug and see you've come to the same conclusion. Regardless of the outcome of your bug, we should still try to prevent this type of error from killing the whole thread. |
@ralight - Unfortunately, the device is at a customer site on a different continent, so I can't reach it directly. The customer has to jump through hoops to deploy code and still has to wait for the bug to reproduce. We may not be able to find a root cause. Regardless, I'll work on a PR to protect this with a try/except. |
We are seeing similar issues, although in a slightly different setting. In our case, we are using asyncio-mqtt as a wrapper around paho-mqtt. asyncio-mqtt doesn't call I think the handling of sockpairR is broken in the sense that publish triggers a write of a single byte and loop() only dequeue's a single byte, but there is no guarantee that:
As a hint towards a potential solution, I think that:
Optionally, the entire sockpair mechanism should be used only in threading mode, giving extra speedups in non-thread mode. |
I've made some changes that should fix this, if you're able to confirm whether it works for you I'd be grateful. The changes are in the |
@ralight, I assume you're pointing at 4910b78? If so, that's not going to help. Let me try to elaborate a bit more. The sockpair is a regular TCP/IP connection going over the loopback interface. This means the kernel will allocate memory for transmit and receive buffers. What paho does (when not using threaded mode and This means those kernel-allocated read/write buffers will fill up. However, these buffers (obviously) have a fixed size. What happens in the kernel is that, when the receive buffer is full, the TCP window size will be advertised as 0 to the sending end of the socket. When that happens a timeout mechanism kicks in. If the receiving end does not make progress within a certain time, the socket is forcefully closed on the transmit side and the next send operation gets an ETIMEOUT error. See TCP_USER_TIMEOUT. So catching the exception only won't help, it's likely it will never even be raised at all, since the above mechanism kicks in when the TCP receive buffer is full, but then there's still the entire TCP send buffer to fill up, 1 byte at a time, before a EWOULDBLOCK is returned. It's not unlikely that the connection is aborted by the kernel before the send buffer fills up too. As a confirmation of my thesis above, I created a script (using our MQTT layer above asyncio-mqtt) to send messages to MQTT as fast as possible. Using the command
At this point, the connection has sent >7M messages, but the recv-Q is full and the timeout has started to kick in. Using
Ultimately, this test took about 10M sent messages before stopping. The timeout mechanism started at around 5.4M messages. As I also tried to explain above, this library is likely to also suffer from this issue, although at a lower rate, when threaded mode is used. From the looks of it, the recv/send operations aren't guaranteed to be balanced, so the same effect will play at an even slower pace. Still, I think using netstat and ss, one should be able to easily detect this issue. |
@stijn-devriendt The commit in question is immediately after the one you highlight, but I had neglected to push it to the repository. I've done that now, sorry for the confusion. Thanks very much for the extra information. I understood the principle, but hadn't been able to trigger it - I simply hadn't sent enough messages. Having expected numbers really helps. |
@ralight That one looks good. I had a quick run and (obviously) there's no more sockpair in our use-case. Our other tests also succeed with this version. I'm now running the original test I had written for my previous comment on this thread, but I don't really expect any hiccups. Currently at 4M messages... As an added bonus, I think it's not unlikely that this is a slight performance improvement too, not having to perform extra reads or writes. Just as an informational query, when is there a release planned on the 1.6.x branch? If that isn't on the table yet, we'll need to schedule work here to backport the patch ourselves for our own S/W. Either way works, no pressure ;) Update: I just stopped the test at >14M messages. Netstat output looks good (no backlog even on the MQTT socket itself). |
@stijn-devriendt Great, thanks for the feedback. I don't have a particular plan for the 1.6 release as yet. The Python client has languished unloved for a while, so there is a bit of a backlog of issues. I'd like to get that a bit more under control first. Now I'm thinking about it, perhaps a planned release date would be good. I'm going to say around the end of September. I wish it could be earlier, but I recognise this isn't my highest priority. |
(Updated my previous comment at the same time as your comment) Great, I'll discuss internally if we'll wait or move ahead with a backported version. Thanks a lot for your help! |
I guess this one can be closed as 1.6 has been released? |
I guess so :) Thanks very much. |
I'm seeing a case where the underlying socket is raising a
TimeoutError
when sending aPINGREQ
. This is causing theloop_forever
thread to crash. If we callpublish
after this, it raises aBrokenPipeError
.2 questions:
publish
, we loose the advantage of the keep_alive messages._check_keepalive
in a try/except or maybe by wrapping the call to_send_pingreq
?(This is the root cause of Azure/azure-iot-sdk-python#747)
-Bert
The text was updated successfully, but these errors were encountered: