Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the "proper" way to respond to a socket timeout when sending PINGREQ? #563

Closed
BertKleewein opened this issue Apr 6, 2021 · 11 comments · Fixed by home-assistant/core#59339

Comments

@BertKleewein
Copy link
Contributor

I'm seeing a case where the underlying socket is raising a TimeoutError when sending a PINGREQ. This is causing the loop_forever thread to crash. If we call publish after this, it raises a BrokenPipeError.

2 questions:

  1. How would we (as consumers of Paho) proactively detect this situation and recover? If we can't detect this failure until we call publish, we loose the advantage of the keep_alive messages.
  2. Should Paho be modified to handle this internally, maybe by wrapping the call to _check_keepalive in a try/except or maybe by wrapping the call to _send_pingreq?

(This is the root cause of Azure/azure-iot-sdk-python#747)

-Bert

Mar 24 14:49:25  Traceback (most recent call last):
Mar 24 14:49:25  File "/media/maps/slip/slip/common/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
Mar 24 14:49:25  self.run()
Mar 24 14:49:25  File "/media/maps/slip/slip/common/usr/lib/python3.6/threading.py", line 864, in run
Mar 24 14:49:25  self._target(*self._args, **self._kwargs)
Mar 24 14:49:25  File "/media/maps/slip/slip/common/usr/lib/python3.6/site-packages/paho/mqtt/client.py", line 3452, in _thread_main
Mar 24 14:49:25  self.loop_forever(retry_first_connection=True)
Mar 24 14:49:25  File "/media/maps/slip/slip/common/usr/lib/python3.6/site-packages/paho/mqtt/client.py", line 1779, in loop_forever
Mar 24 14:49:25  rc = self.loop(timeout, max_packets)
Mar 24 14:49:25  File "/media/maps/slip/slip/common/usr/lib/python3.6/site-packages/paho/mqtt/client.py", line 1201, in loop
Mar 24 14:49:25  return self.loop_misc()
Mar 24 14:49:25  File "/media/maps/slip/slip/common/usr/lib/python3.6/site-packages/paho/mqtt/client.py", line 1628, in loop_misc
Mar 24 14:49:25  self._check_keepalive()
Mar 24 14:49:25  File "/media/maps/slip/slip/common/usr/lib/python3.6/site-packages/paho/mqtt/client.py", line 2418, in _check_keepalive
Mar 24 14:49:25  self._send_pingreq()
Mar 24 14:49:25  File "/media/maps/slip/slip/common/usr/lib/python3.6/site-packages/paho/mqtt/client.py", line 2460, in _send_pingreq
Mar 24 14:49:25  rc = self._send_simple_command(PINGREQ)
Mar 24 14:49:25  File "/media/maps/slip/slip/common/usr/lib/python3.6/site-packages/paho/mqtt/client.py", line 2586, in _send_simple_command
Mar 24 14:49:25  return self._packet_queue(command, packet, 0, 0)
Mar 24 14:49:25  File "/media/maps/slip/slip/common/usr/lib/python3.6/site-packages/paho/mqtt/client.py", line 2911, in _packet_queue
Mar 24 14:49:25  self._sockpairW.send(sockpair_data)
Mar 24 14:49:25  TimeoutError: [Errno 110] Connection timed out
@ralight
Copy link
Contributor

ralight commented Apr 14, 2021

It's interesting that it's the sockpair mechanism that is timing out. I'm just reading your linked bug and see you've come to the same conclusion. Regardless of the outcome of your bug, we should still try to prevent this type of error from killing the whole thread.

@BertKleewein
Copy link
Contributor Author

@ralight - Unfortunately, the device is at a customer site on a different continent, so I can't reach it directly. The customer has to jump through hoops to deploy code and still has to wait for the bug to reproduce. We may not be able to find a root cause.

Regardless, I'll work on a PR to protect this with a try/except.

@stijn-devriendt
Copy link

We are seeing similar issues, although in a slightly different setting. In our case, we are using asyncio-mqtt as a wrapper around paho-mqtt.

asyncio-mqtt doesn't call loop(), instead it calls loop_read(), loop_write() and loop_misc() when needed. However, the read-handling of sockpairR is in loop(). As a result, the sockpairR is never serviced. After a while (we are sending a relatively high-volume through mqtt), the sockpairW.send() gives a timeout, followed by broken pipe. I believe this implies that the kernel buffer (both send-side and recv-side) fill up and no forward progress is made. At that point the timeout occurs and the connection is lost.

I think the handling of sockpairR is broken in the sense that publish triggers a write of a single byte and loop() only dequeue's a single byte, but there is no guarantee that:

  • loop() is ever called (as with asyncio-mqtt)
  • The number of MQTT packets written to the MQTT socket corresponds to the number of bytes dequeued from sockpairR.
    This happens when loop_write() writes more than 1 packets to the MQTT broker, while loop() only dequeues a single byte.
    In this case, the number of bytes in the sockpair slowly builds up.

As a hint towards a potential solution, I think that:

  • loop_write() should perform handling of sockpairR
  • either sockpairR should always have the entire backlog drained and loop_write() should write into sockpairW when messages are still pending when it returns
  • or sockpairR is drained entirely when there are no pending packets when loop_write() returns

Optionally, the entire sockpair mechanism should be used only in threading mode, giving extra speedups in non-thread mode.

@ralight
Copy link
Contributor

ralight commented Aug 20, 2021

I've made some changes that should fix this, if you're able to confirm whether it works for you I'd be grateful. The changes are in the 1.6.x branch.

@stijn-devriendt
Copy link

@ralight, I assume you're pointing at 4910b78? If so, that's not going to help.

Let me try to elaborate a bit more. The sockpair is a regular TCP/IP connection going over the loopback interface. This means the kernel will allocate memory for transmit and receive buffers. What paho does (when not using threaded mode and loop() is never called) is write the the socket but never read from it.

This means those kernel-allocated read/write buffers will fill up. However, these buffers (obviously) have a fixed size. What happens in the kernel is that, when the receive buffer is full, the TCP window size will be advertised as 0 to the sending end of the socket. When that happens a timeout mechanism kicks in. If the receiving end does not make progress within a certain time, the socket is forcefully closed on the transmit side and the next send operation gets an ETIMEOUT error.

See TCP_USER_TIMEOUT.

So catching the exception only won't help, it's likely it will never even be raised at all, since the above mechanism kicks in when the TCP receive buffer is full, but then there's still the entire TCP send buffer to fill up, 1 byte at a time, before a EWOULDBLOCK is returned. It's not unlikely that the connection is aborted by the kernel before the send buffer fills up too.

As a confirmation of my thesis above, I created a script (using our MQTT layer above asyncio-mqtt) to send messages to MQTT as fast as possible. Using the command netstat -anp one can see the send and receive queues for the sockpair connection:

Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp        0  34442 127.0.0.1:39385         127.0.0.1:35532         ESTABLISHED 6801/python3.7       on (10,05/11/0)
tcp   5401554      0 127.0.0.1:35532         127.0.0.1:39385         ESTABLISHED 6801/python3.7       off (0.00/0/0)

At this point, the connection has sent >7M messages, but the recv-Q is full and the timeout has started to kick in.
In this example, the next probe will be sent in 10s and 11 of them have already been sent. Default settings, I believe, cause a connection abort after 15 probes.

Using ss -tmp one can see the socket memory in a bit more detail:

State       Recv-Q    Send-Q      Local Address:Port        Peer Address:Port   Process
ESTAB       0         34442           127.0.0.1:39385          127.0.0.1:35532   users:(("python3.7",pid=6801,fd=8))
	 skmem:(r0,rb131072,t0,tb2626560,f1910,w2627722,o0,bl0,d0)
ESTAB       5401554   0               127.0.0.1:35532          127.0.0.1:39385   users:(("python3.7",pid=6801,fd=7))
	 skmem:(r6291760,rb6291456,t0,tb2626560,f3792,w0,o0,bl0,d113)

Ultimately, this test took about 10M sent messages before stopping. The timeout mechanism started at around 5.4M messages.

As I also tried to explain above, this library is likely to also suffer from this issue, although at a lower rate, when threaded mode is used. From the looks of it, the recv/send operations aren't guaranteed to be balanced, so the same effect will play at an even slower pace.

Still, I think using netstat and ss, one should be able to easily detect this issue.

@ralight
Copy link
Contributor

ralight commented Aug 23, 2021

@stijn-devriendt The commit in question is immediately after the one you highlight, but I had neglected to push it to the repository. I've done that now, sorry for the confusion.

Thanks very much for the extra information. I understood the principle, but hadn't been able to trigger it - I simply hadn't sent enough messages. Having expected numbers really helps.

@stijn-devriendt
Copy link

stijn-devriendt commented Aug 24, 2021

@ralight That one looks good. I had a quick run and (obviously) there's no more sockpair in our use-case. Our other tests also succeed with this version. I'm now running the original test I had written for my previous comment on this thread, but I don't really expect any hiccups. Currently at 4M messages... As an added bonus, I think it's not unlikely that this is a slight performance improvement too, not having to perform extra reads or writes.

Just as an informational query, when is there a release planned on the 1.6.x branch? If that isn't on the table yet, we'll need to schedule work here to backport the patch ourselves for our own S/W. Either way works, no pressure ;)

Update: I just stopped the test at >14M messages. Netstat output looks good (no backlog even on the MQTT socket itself).

@ralight
Copy link
Contributor

ralight commented Aug 24, 2021

@stijn-devriendt Great, thanks for the feedback.

I don't have a particular plan for the 1.6 release as yet. The Python client has languished unloved for a while, so there is a bit of a backlog of issues. I'd like to get that a bit more under control first. Now I'm thinking about it, perhaps a planned release date would be good. I'm going to say around the end of September. I wish it could be earlier, but I recognise this isn't my highest priority.

@stijn-devriendt
Copy link

(Updated my previous comment at the same time as your comment)

Great, I'll discuss internally if we'll wait or move ahead with a backported version. Thanks a lot for your help!

@stijn-devriendt
Copy link

I guess this one can be closed as 1.6 has been released?

@ralight
Copy link
Contributor

ralight commented Nov 25, 2021

I guess so :) Thanks very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants