behavior change, now I get timeouts on ping #175

schaefer01 · 2023-08-29T17:58:28Z

Hi,
I've got a v7 running on a red featherwing, loaded with minimqtt from the v7 bundle around december 2022
and a v8 running on a black featherwing, loaded with minimqtt from the v8 bundle

The v7 with an earlier version of minimqtt runs w/o ping timeouts if I'm waiting and doing nothing

The v8 does this, after a short period of time:
File "code.py", line 1776, in
File "adafruit_minimqtt/adafruit_minimqtt.py", line 1007, in loop
File "adafruit_minimqtt/adafruit_minimqtt.py", line 1030, in _wait_for_msg
File "adafruit_minimqtt/adafruit_minimqtt.py", line 1126, in _sock_exact_recv
MMQTTException: Unable to receive 1 bytes within 60 seconds.

The problem I see is that the exception, in response to a ping should be handled in the ping,
which has its own timeout handling code, which is never reached but even if it were,
would also raise an unhandled exception if it did.
Or in the loop that contains the ping and return an error code for the app to receive and
figure out what to do.

questions:

why does the earlier version of minimqtt not have this problem?
in any case, why aren't ping timeout exceptions handled and rather than return an exception
thatis, return an error code that can be handled as to whether the app wants to restart or not,
and be accurate in that the problem is a ping timeout rather than an "expected 1 byte(s)" timeout

schaefer01 · 2023-08-29T18:07:43Z

I put loop() in a try except block.
ping() exceptions at a 1 per minute rate, which makes be believe that ping is broken but ...

I don't know if this is an mqtt problem or a circuitpy v8 problem because both are different between my rp running v7 and an v7 minimqtt, and my rp running v8 and the latest minimqtt

schaefer01 · 2023-12-04T19:58:48Z

more information, I instrumented the code, the loop is supposed to exit after one second but when a ping occurs, the loop holds its context until the ping timeout.

my application's main loop has time sensitive actions and shouldn't be delayed for one minute for a bad ping

entering client.loop()
Client exception: Unable to receive 1 bytes within 60 seconds.
exited client.loop()
entering client.loop()
client rcs = [208]
exited client.loop()
entering client.loop()
Client exception: Unable to receive 1 bytes within 60 seconds.
exited client.loop()
entering client.loop()
client rcs = [208]
exited client.loop()
entering client.loop()

schaefer01 · 2023-12-05T16:02:25Z

Got it, Its not ping, instrumented the source available after 12/5/2023

in loop() the call to receive generates an exception from if nothing is received but its ok for nothing to be received.

_wait_for_msg() ->_sock_exact_recv(1)

schaefer01 · 2023-12-05T19:13:09Z

in the same function, _sock_exact_recv(), the timeout isn't the tiimeout but the keep alive

read_timeout = self.keep_alive
...
if self.get_monotonic_time() - stamp > read_timeout:

schaefer01 · 2023-12-06T15:43:14Z

problem solved with hack inside _wait_for_msg

note: first I instrumented the code to discover the path, then I shortened the timeout for the first receive
which is ok for the first time in and then lengthened the timout if something was received
this is a hack which allows me to keep going but isn't the right thing to do for the general case

note: keep alive is the wrong value for this function, keep_alive is for ping
the is a self receive timeout that is set but never used

    else:  # socketpool, esp32spi
        #print("2: self._sock_exact_recv(),keep_alive=",self.keep_alive)
        self.keep_alive = 1
        try:
            res = self._sock_exact_recv(1)
        except OSError as error:
            if error.errno in (errno.ETIMEDOUT, errno.EAGAIN):
                # raised by a socket timeout if 0 bytes were present
                return None
            raise MMQTTException from error
        else:
           self.keep_alive = 60

schaefer01 · 2023-12-07T17:08:45Z

the above is not optimum because it causes ping to fail, now adding a parameter in _wait_for_msg to indicate whether a response is definitely expected or might or might not be expected and set the first wait accordingly

schaefer01 · 2023-12-08T17:39:06Z

Solved it. Contact me if you want to know what I did.
The lower limit of timeout in the call loop() to not drop messages is 2 seconds

vladak · 2024-01-20T19:55:03Z

2. in any case, why aren't ping timeout exceptions handled and rather than return an exception
   thatis, return an error code that can be handled as to whether the app wants to restart or not,
   and be accurate  in that the problem is a ping timeout rather than an "expected 1 byte(s)" timeout

Here's my understanding: The trouble is that once the receive loop (common to all MQTT message processing, not just PING) starts reading data into the buffer and hits the receive timeout (modulo #189) that pretty much results in the MQTT session to be useless, because if a partial read was already done, portion of the MQTT message was lost with the buffer (it is local variable) and if the client tried to recover, it would get the remaining portion of the message that would be interpreted as garbage w.r.t. its current state, hence the exception. Maybe it could special case for receive timeout with empty buffer however it should still give an indication (different exception perhaps) to the consumer that the receive timeout was exceeded; possibly not worth the trouble as it usually means that the connection at TCP level is dead anyway.

schaefer01 · 2024-01-22T13:09:30Z

My belief was that the ping timeout was colliding with the "normal" timeout.
There is also a timeout between the first element received and the second element received.
I disabled the ping because I wanted a short timeout for the equivalent of polling (if ping is wanted, it should be sent outside of the polling loop) and I then extended the timeout between the first and second received elements else the function would exit too soon. That is first element received but 2nd element timed out (when the function should have waited longer).

vladak · 2024-01-22T21:33:15Z

My belief was that the ping timeout was colliding with the "normal" timeout. There is also a timeout between the first element received and the second element received. I disabled the ping because I wanted a short timeout for the equivalent of polling (if ping is wanted, it should be sent outside of the polling loop) and I then extended the timeout between the first and second received elements else the function would exit too soon. That is first element received but 2nd element timed out (when the function should have waited longer).

Could you expand on the timeout collision ?

Anyhow, with the fix for #198 , the PINGREQ will be sent only if there has been no outgoing message traffic to the broker for the duration of the keep alive timeout or longer (assuming the loop() is called often enough - could be further improved by making the check in various functions).

schaefer01 · 2024-01-23T13:00:05Z

The "collision" is an assumption on my part.
I was seeing timeouts that did not make sense with respect to the test code I was running.
The test code was a call to client.loop() with a 2 second timeout, where I modified the code to pass the passed-in timeout to _wait_for_msg() to be used not on the first call to _sock_exact_recv() but used on the second call to _sock_exact_recv() inside _wait_for_msg()

I did not instrument/diagnose the path that generated the timeout but when I disabled ping the unwanted
timeouts went away.

My assumption is that time passed so that the ping timeout was generated (even though ping may have or may have not been sent). And I'm not sure the code is "smart enough" to not invoke a ping in the middle of a mqtt transaction.

SRMAX · 2024-01-25T15:21:46Z

2. in any case, why aren't ping timeout exceptions handled and rather than return an exception
   thatis, return an error code that can be handled as to whether the app wants to restart or not,
   and be accurate  in that the problem is a ping timeout rather than an "expected 1 byte(s)" timeout
Here's my understanding: The trouble is that once the receive loop (common to all MQTT message processing, not just PING) starts reading data into the buffer and hits the receive timeout (modulo #189) that pretty much results in the MQTT session to be useless, because if a partial read was already done, portion of the MQTT message was lost with the buffer (it is local variable) and if the client tried to recover, it would get the remaining portion of the message that would be interpreted as garbage w.r.t. its current state, hence the exception. Maybe it could special case for receive timeout with empty buffer however it should still give an indication (different exception perhaps) to the consumer that the receive timeout was exceeded; possibly not worth the trouble as it usually means that the connection at TCP level is dead anyway.

I agree an error code should ALWAYS be returned so the issue can be handled by the program as to what happens next. Whether it is a restart or a loop to wait until connection can be established. This just makes for very unreliable programs. If you ever want CircuitPython to be taken serious them escalate issues like this one to get it correctly. This is only one of several error issues that are not handled properly.

schaefer01 · 2024-01-25T15:35:34Z

While important people might be listening on this thead, if you want CircuitPython to be taken seriously, you really need to:

buckle down and do the work to implement the secure socket layer so it can be on the Internet so you aren't upsetting the professional and hardworking IT folks that support the host organization of developers.
make the wiznet api look more like unix/linux, including make it easier to access/set the mac address rather than DE.AD.BE.EF, again to not upset the professional and hardworking IT folks that support the host organization of developers.
and not crash (or put in a plan to rapidly fix the problem) when there's a new update to macosx
stop telling developers when there's a problem that you don't want to fix, that circuitpython is for hobbyists

vladak · 2024-01-25T16:29:05Z

I did not instrument/diagnose the path that generated the timeout but when I disabled ping the unwanted timeouts went away.

My assumption is that time passed so that the ping timeout was generated (even though ping may have or may have not been sent). And I'm not sure the code is "smart enough" to not invoke a ping in the middle of a mqtt transaction.

Currently the need to send PINGREQ is based on whether more than keep alive time has passed since sending the last PINGREQ (which is wrong but let's continue) and is done as a first thing in the loop(). Now, ping() itself has a _wait_for_msg() cycle, very similar to the cycle in loop() - might be the cause of the behavior you were seeing if you did not pass the timeout all the way to the cycle inside ping(). The cycle in ping() gets interrupted only when PINGRESP was received or the keep alive timeout has passed (which is wrong) . Feel free to grab the code from my 2 PRs and see if it makes any changes. Not sure it makes sense to check the need to send PINGREQ outside the loop() because it seems the rest of public functions send some sort of MQTT message to the broker which should reset the need for PINGREQ.

Also, thinking of this, it seems to me that the timeout specified in loop() should be higher than the socket timeout because this is the granularity of _sock_exact_recv().

schaefer01 · 2024-01-25T16:40:41Z

I have it working/hacked good enough to get by until the next circuitypython official release.

I found for the rp204, that minimum timeouts less than 2 seconds led to timeouts on completing a nominally good message/hanshake, which is, for my application, 2 seconds is good enough.

MQTT is just a step in a long chain of code of things that have to get done for my project. Too many things to do, too little time.

vladak · 2024-01-26T15:23:14Z

I have it working/hacked good enough to get by until the next circuitypython official release.

I found for the rp204, that minimum timeouts less than 2 seconds led to timeouts on completing a nominally good message/hanshake, which is, for my application, 2 seconds is good enough.

Do you explicitly set socket_timeout and recv_timeout when initializing the MQTT object ?

MQTT is just a step in a long chain of code of things that have to get done for my project. Too many things to do, too little time.

I understand. That said, it reminds me of https://xkcd.com/2347/

schaefer01 · 2024-01-26T15:49:49Z

I made sure the timeout parameter at the loop() API made it down the functional call chain to where it needed to be applied. I may or may not have used the existing variable names, I likely did not, to prevent timeout dependency side-effects similar to what I saw with PING.
re: xkcd, this is true for the social structure as well, also known as dejure and defacto.
I'm pretty sure Jon Postel's death by heart attack was caused by the addition of lawyers to the running of the Internet. https://en.wikipedia.org/wiki/Jon_Postel

kevin-tritz · 2024-03-21T18:41:13Z

FYI, in get_monotonic_time(), monotonic_ns()/1000000000 doesn't seem to maintain precision (at least for the RP2040 which I tested on) any better than straight up monotonic, so short timeout values (10-100ms) will start to run into errors within hours or few days of uptime.

vladak · 2024-03-21T19:46:19Z

FYI, in get_monotonic_time(), monotonic_ns()/1000000000 doesn't seem to maintain precision (at least for the RP2040 which I tested on) any better than straight up monotonic, so short timeout values (10-100ms) will start to run into errors within hours or few days of uptime.

Is this because of the limited range that can be represented in the resulting float ? Could you create a new issue (this one is already overloaded), please ?

vladak mentioned this issue Jan 19, 2024

recv_timeout should be used as a receive timeout in _sock_exact_recv() #189

Open

vladak mentioned this issue Jan 29, 2024

MMQTTException vs partial reads on read timeout #201

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

behavior change, now I get timeouts on ping #175

behavior change, now I get timeouts on ping #175

schaefer01 commented Aug 29, 2023

schaefer01 commented Aug 29, 2023

schaefer01 commented Dec 4, 2023

schaefer01 commented Dec 5, 2023

schaefer01 commented Dec 5, 2023

schaefer01 commented Dec 6, 2023

schaefer01 commented Dec 7, 2023 •

edited

schaefer01 commented Dec 8, 2023

vladak commented Jan 20, 2024

schaefer01 commented Jan 22, 2024

vladak commented Jan 22, 2024

schaefer01 commented Jan 23, 2024

SRMAX commented Jan 25, 2024

schaefer01 commented Jan 25, 2024

vladak commented Jan 25, 2024 •

edited

schaefer01 commented Jan 25, 2024 •

edited

vladak commented Jan 26, 2024

schaefer01 commented Jan 26, 2024

kevin-tritz commented Mar 21, 2024 •

edited

vladak commented Mar 21, 2024

behavior change, now I get timeouts on ping #175

behavior change, now I get timeouts on ping #175

Comments

schaefer01 commented Aug 29, 2023

schaefer01 commented Aug 29, 2023

schaefer01 commented Dec 4, 2023

schaefer01 commented Dec 5, 2023

schaefer01 commented Dec 5, 2023

schaefer01 commented Dec 6, 2023

schaefer01 commented Dec 7, 2023 • edited

schaefer01 commented Dec 8, 2023

vladak commented Jan 20, 2024

schaefer01 commented Jan 22, 2024

vladak commented Jan 22, 2024

schaefer01 commented Jan 23, 2024

SRMAX commented Jan 25, 2024

schaefer01 commented Jan 25, 2024

vladak commented Jan 25, 2024 • edited

schaefer01 commented Jan 25, 2024 • edited

vladak commented Jan 26, 2024

schaefer01 commented Jan 26, 2024

kevin-tritz commented Mar 21, 2024 • edited

vladak commented Mar 21, 2024

schaefer01 commented Dec 7, 2023 •

edited

vladak commented Jan 25, 2024 •

edited

schaefer01 commented Jan 25, 2024 •

edited

kevin-tritz commented Mar 21, 2024 •

edited