Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In a weak network environment, the MQTT Broker is unable to detect disconnections #1883

Closed
gmiloveyou opened this issue Nov 17, 2023 · 8 comments · Fixed by #1891
Closed
Labels
bug Something isn't working

Comments

@gmiloveyou
Copy link

gmiloveyou commented Nov 17, 2023

Describe the bug

1.unable to detect disconnections in a weak network environment

Which component is your bug related to?

-Server

To Reproduce

Steps to reproduce the behavior:

  1. Using this version of MQTTnet ''v4.3.0.858"
  2. Using Server with logger then publish
    image
  3. run the server
  4. simulate weak network environment by "clumsy"
    image
  5. using an client to connect this broker
  6. After approximately 30 minutes to an hour, server did not fire an disconnected event even when the client's network cable has been unplugged.

Expected behavior

Disconnect Client Event shuld fired

Additional context / logging

We add smoe extra log-point.
The message stops here.
MQTTnet\Server\Internal\MqttClient.cs
image

Latest Updated: 2023/11/20
by more logging point

MQTTnet.Adapter > public sealed class MqttChannelAdapter> private async Task ReceiveAsync
We discover that the code never reach "finally" part, stuck in "do while"
So the parameter 'IsReadingPacket" always be true
image

In that case:
MQTTnet.Server>public sealed class MqttServerKeepAliveMonitor>TryProcessClient
the function will always return when check "IsReadingPacket"
that makes the broker never raised the disconnected event
image

But we still do not known

1.the reason that makes this situtation exactly
2.how to fixed.
We will try to figure out continuously .
If anyone can help. that will be grateful.

--
2023/11/22 Update
With additional logging points,
Finally, we found out the code stuck in
MQTTnet-master\Source\MQTTnet\Implementations\MqttTcpChannel.cs
-> public async Task ReadAsync( byte[] buffer, int offset, int count, CancellationToken cancellationToken )

await stream.ReadAsync(buffer.AsMemory(offset, count), cancellationToken).ConfigureAwait(false);

image

Our guess is that:
CancellToken do not apply any timeout setting.
So mqtt client always waits for tcp packets forever.
We will take the modify as describe in the link below.
https://stackoverflow.com/questions/12421989/networkstream-readasync-with-a-cancellation-token-never-cancels

Now, We had already started an new experimnet to verify it.
Hope this will work.

@gmiloveyou gmiloveyou added the bug Something isn't working label Nov 17, 2023
@gmiloveyou gmiloveyou changed the title In a weak network environment, the MQTT Broker is unable to detect disconnections and enter an unknown infinite loop of Ping In a weak network environment, the MQTT Broker is unable to detect disconnections Nov 17, 2023
@gmiloveyou
Copy link
Author

gmiloveyou commented Nov 17, 2023

Update Description, the infinite loop of Ping is from another team's client.
But the diconnected event is still not raised

@gmiloveyou
Copy link
Author

gmiloveyou commented Nov 20, 2023

Update:

by more logging point

MQTTnet.Adapter > public sealed class MqttChannelAdapter> private async Task ReceiveAsync
We discover that the code never reach "finally" part, stuck in "do while"
So the parameter 'IsReadingPacket" always be true
image

In that case:
MQTTnet.Server>public sealed class MqttServerKeepAliveMonitor>TryProcessClient
the function will always return when check "IsReadingPacket"
that makes the broker never raised the disconnected event
image

But we still do not known

  1. the reason that makes this situtation exactly
    2.how to fixed.

We will try to figure out continuously .
If anyone can help. that will be grateful.

@chkr1011
Copy link
Collaborator

We may need to add an additional check here. Right now the "Receive" call on socket level does not get cancelled because there is no timeout check. We may need to insert an additional timeout here so that received a message is allowed to take more time than the keep alive interval. But this would require users to use higher keep alive values if they want to transfer larger payloads on slower connections.

I create a branch for this ticket. Then we can test.

@chkr1011
Copy link
Collaborator

I added a new setting in the server options to alter the handling of keep alive stuff.
The problem is that users already wanted the existing behavior and now we need to support the other way as well. So in my opinion there will be no single solution to this. @gmiloveyou Please let me know if setting the new property to true fixes the issue for you.

@logicaloud
Copy link
Contributor

The MQTT 5 spec says:

The Keep Alive is a Two Byte Integer which is a time interval measured in seconds. It is the maximum time interval that is permitted to elapse between the point at which the Client finishes transmitting one MQTT Control Packet and the point it starts sending the next. It is the responsibility of the Client to ensure that the interval between MQTT Control Packets being sent does not exceed the Keep Alive value. If Keep Alive is non-zero and in the absence of sending any other MQTT Control Packets, the Client MUST send a PINGREQ packet [MQTT-3.1.2-20].

If the Keep Alive value is non-zero and the Server does not receive an MQTT Control Packet from the Client within one and a half times the Keep Alive time period, it MUST close the Network Connection to the Client as if the network had failed [MQTT-3.1.2-22].

And the disconnect reason may be:

The Connection is closed because no packet has been received for 1.5 times the Keepalive time.

My interpretation is, that clients must begin sending another packet within the keep-alive interval with the intent that the packet is completely received on the server side no later than 1.5 times the keep-alive timeout.

That would mean that being in the process of receiving a packet is not enough to keep the client alive. The current more tolerant behavior for large packets probably makes life easier for clients, but such clients would also have the option to request a longer keep alive timeout.

So to me, it looks like that enabling the new KeepAliveOptions.DisconnectClientWhenReadingPayload option makes the server follow the specification more closely.

Maybe that be could be expressed in the naming of the option somehow, but I'm not sure what that could be.

@gmiloveyou
Copy link
Author

gmiloveyou commented Nov 27, 2023

@chkr1011 Sorry for my delayed response.

Before today, to veriy my idea is workable.
(Just add an extra timeout)

image

I build a test env then keep it running for serval days.
After a week , it really works and save my teams temporary. Although it is not a complete solution.

--
I will now use the version from your branch for another experimentation. I will synchronize the results here afterward

@gmiloveyou
Copy link
Author

@chkr1011 After five days long-running experiment.
The new property to true fixes the issue.
Thanks for help.

@chkr1011
Copy link
Collaborator

chkr1011 commented Dec 5, 2023

@gmiloveyou Thanks for testing this. I would stick to @logicaloud opinion and remove the code entirely so that it matches the MQTT spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants