Small percentage of messages lost (via Mosquitto) - high frequency messages from single publisher #442

MattBrittan · 2020-07-16T08:47:08Z

I have an application that is publishing a QOS1 message every 250ms to a Mosquitto instance via a satellite link. The messages are consumed by another golang app that is local to the broker. This configuration worked fine when tested using a LAN & cell link but following the move to the satellite link messages are being lost periodically. The distribution seems random; sometimes all messages are received for a couple of minutes and sometimes there are multiple breaks within 10s; 1-3 messages are lost each time this happens. While I have been testing the link has been stable with errors/disconnects reported in the mosquitto log (all logging enabled).

The messages are protobuf encoded and include an identifier from the previous message enabling the receiver to check for missing messages (this is mainly used to reorder messages before passing them on for processing). I have added debugging code to confirm that all messages are being submitted to Publish() on the sending side and confirmed that some are not being received by the receiver. I am using FileStore and the persistence folder is not growing; so either the messages are never being stored OR a they are being deleted (which probably means a PUBACK is being received from the broker). At this point I have not been able to confirm if the issue lies with this package or with Mosquitto (I'm confident that my mosquitto config is OK and all components are on their latest releases).

Due to the high frequency of messages debugging is a little difficult (enabling debugging in this package would lead to huge files) especially given that its not 100% clear if the issue is with the sender, mosquitto or the receiver. My first thought was to watch the IDs in the Mosquitto log to confirm delivery but unfortunately the way this package allocates packet identifiers made this problematic because the IDs are reused rapidly (there are between 3-15 messages in flight at any one time based upon the filestore folder so the id was only ever getting up to around 15).

To enable further investigation I have altered this package to allocate packet IDs sequentially (rather than starting from 1 each time) meaning message ids will not be reused for around 4 hours. The aim was to make it simpler to identify where the issue lies (checking for missing packet IDs in the mosquitto log). Unfortunately (or maybe fortunately...) making this change has fixed the issue so it looks like one of the components is not handling packet IDs correctly (have reviewed the code in this package and did not spot any issues). Given that this fixes the issue (and makes tracing this kind of thing easier) I will submit a pull request (this way of allocating IDs is likely to be a little more efficient in systems with a lot of inflight messages and may avoid bugs in brokers).

The text was updated successfully, but these errors were encountered:

justinclift · 2020-07-20T10:49:46Z

making this change has fixed the issue

Kind of sounds like a race condition problem?

MattBrittan · 2020-07-20T21:55:24Z

Kind of sounds like a race condition problem?

Possibly but I could not spot one (and have not been able to confirm that the issue lies in this package and not in Mosquitto). In any event I believe that allocating MID's in the way my pull request does makes more sense (it makes it a lot easier to correlate the logs between the client and the broker). I think it's also worth noting that Mosquitto appears to use a similar approach.

pmalhaire · 2020-08-31T15:50:24Z

I have a quite similar issue.
I have what looks like collision in message ids. I ended up doing a bold thing using a crc of messages instead of message ids.

MattBrittan · 2020-08-31T20:08:10Z

@pmalhaire sounds similar; have you had any luck tracing the cause? I have not looked at this again since implementing PR #443 (which has fixed it for me but was only intended to assist with diagnosis). How are you making use of CRC for this? (I can see how you could use that to trace message delivery but are you then passing the CRC back in a separate message to verify delivery?) Also - which broker are you using?

pmalhaire · 2020-09-01T20:33:22Z

I am using activeMQ as a broker.
I could not trace the exact cause yet.
I has the issue while testing connection loss using two methods :

hang the consumer in a never stopping thread.
restart the broker
I used qos 2 and duplicated message method.
I switched to qos 1 and added client detection of duplicated message using messageID (it was fine for my test case).
MessagesID did sometime what I assumed to be collisions.
I did a simple crc of the message content instead of using messageID and all went fine.
I did not dig to see how messageID where generated. Maybe a simple random generator would work.

MattBrittan · 2020-09-01T21:33:25Z

@pmalhaire the current code in Master maintains a map of used message ID's. Every time a new message ID is needed it starts at 1 and counts up until it finds an unused message ID. This means that message IDs are reused rapidly and when the broker is responding quickly it's likely that they will always stay low (say 0-5). My PR #443 changes this so that a record is retained of the last message ID allocated and the search for a free ID starts from that point (meaning that it will be a considerable amount of time before an ID is reused).

If your subscriber is receiving duplicate message IDS then these are either coming from the broker (which generates the IDS for anything it sends out) or due to an issue with the way this library is persisting the inbound messages (note that it is possible that the same message will be received twice at QOS1). With QOS1 the message ID is available for reuse immediately after the ACK is received so it is quite conceivable that you would receive messages with the same ID fairly rapidly (so using the message ID for other purposes is problematic).

At QOS=2 things are a bit more complicated because the process is:

--> PUBLISH
<-- PUBREC
--> PUBREL
<-- PUBCOMP

I believe that is currently a potential to receive a duplicate PUBLISH at QOS 2. If the broker has not received the PUBREC then it will resend the 'PUBLISH' and I cannot see any checks in this package for that situation (it should really check if there is a PUBREC for the packet in the outbound store and ignore the PUBLISH if that is the case). I also suspect that the PUBLISH may stay in the inbound store longer than it should (based on the comment here) meaning that it may be resubmitted to your code after a reconnect.

pmalhaire · 2020-09-02T08:49:52Z

@MattBrittan Thank you for this very detailed information.
I'll test you PR and look deeper as soon as I got time to.

MattBrittan mentioned this issue Jul 16, 2020

Avoid reusing packet IDs immediately #443

Merged

MattBrittan mentioned this issue Aug 20, 2020

isConnectionOpen() and the return token from Publish() didn't precisely reflect the TCP connection #449

Closed

alsm closed this as completed in #443 Sep 18, 2020

MattBrittan mentioned this issue Oct 25, 2020

Invalid confirmation of success publish on reconnect #458

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small percentage of messages lost (via Mosquitto) - high frequency messages from single publisher #442

Small percentage of messages lost (via Mosquitto) - high frequency messages from single publisher #442

MattBrittan commented Jul 16, 2020

justinclift commented Jul 20, 2020

MattBrittan commented Jul 20, 2020

pmalhaire commented Aug 31, 2020

MattBrittan commented Aug 31, 2020

pmalhaire commented Sep 1, 2020

MattBrittan commented Sep 1, 2020

pmalhaire commented Sep 2, 2020

Small percentage of messages lost (via Mosquitto) - high frequency messages from single publisher #442

Small percentage of messages lost (via Mosquitto) - high frequency messages from single publisher #442

Comments

MattBrittan commented Jul 16, 2020

justinclift commented Jul 20, 2020

MattBrittan commented Jul 20, 2020

pmalhaire commented Aug 31, 2020

MattBrittan commented Aug 31, 2020

pmalhaire commented Sep 1, 2020

MattBrittan commented Sep 1, 2020

pmalhaire commented Sep 2, 2020