Client ID not unique for a connection #9409

ttulka · 2022-11-22T16:11:49Z

What happened?

When a client sets a timeout greater than the idle_timeout for zone:mqtt it happens that multiple connections with the same client ID are kept active.

What did you expect to happen?

Only one active connection per client ID should exist as specified by MQTT.

How can we reproduce it (as minimally and precisely as possible)?

A client with a timeout greater then the broker connects multiple times.

Anything else we need to know?

No response

EMQX version

$ ./bin/emqx_ctl broker
sysdescr  : EMQ X Broker
version   : 4.4.10
uptime    : 20 days, 3 hours, 0 minutes, 41 seconds
datetime  : 2022-11-22 16:10:49

OS version

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
$ uname -a
Linux prod-mb-3 4.19.0-22-cloud-amd64 #1 SMP Debian 4.19.260-1 (2022-09-29) x86_64 GNU/Linux

</details>


### Log files

<details>

</details>

lafirest · 2022-11-24T06:11:58Z

@ttulka
Thanks for your input, but I can't reproduce that with MQTTX, maybe I missed something important, could you bring a detailed log for this?

ttulka · 2022-11-24T08:59:13Z

@lafirest Thanks for the reply, I have a debug log but it contains some semi-sensitive information i don't want to share public on GitHub. Can I send you the log in a private channel (email or similar)?

lafirest · 2022-11-24T09:51:38Z

@lafirest Thanks for the reply, I have a debug log but it contains some semi-sensitive information i don't want to share public on GitHub. Can I send you the log in a private channel (email or similar)?

Yes, of course, My email is blankalupo@163.com

gbrehmer · 2023-03-08T20:53:26Z

we still have a lot of duplications with 4.4.15!
And there is no difference between clients with higher or lower keepalive setting.

There are duplications on each nodes. username and client_id is the same for each session. If we try to kickout the client by webadmin action, the DELETE API call is successful, but the session is still alive. Probably those clients have a bad internet/wlan connection, which produces a lot of reconnects, sometimes one reconnect per minute. It is possible that this high connection rate of the same client produces such faulty sessions? And those clients receive old retained messages each time after reconnect (which are already cleared up several hours ago).

@lafirest do you have any suggestions how we can get more details about the cause which can help you to fix the bug? Thanks!

Search result from one node (All clients are using MQTT v3.1.1):

lafirest · 2023-03-09T02:21:05Z

we still have a lot of duplications with 4.4.15! And there is no difference between clients with higher or lower keepalive setting.

There are duplications on each nodes. username and client_id is the same for each session. If we try to kickout the client by webadmin action, the DELETE API call is successful, but the session is still alive. Probably those clients have a bad internet/wlan connection, which produces a lot of reconnects, sometimes one reconnect per minute. It is possible that this high connection rate of the same client produces such faulty sessions? And those clients receive old retained messages each time after reconnect (which are already cleared up several hours ago).

@lafirest do you have any suggestions how we can get more details about the cause which can help you to fix the bug? Thanks!

Search result from one node (All clients are using MQTT v3.1.1):

@gbrehmer
It seems the old session hung for some reason, cause the kick action didn't work as expected.
I will seek help from my team, please wait a while.

zmstone · 2023-03-09T07:31:50Z

There had been a few similar issues reported on different versions, and we have failed to reproduce it so far.
If possible, I'd like to have a online troubleshooting session with screen shared.
you can reach out to me in email which can be found in my github profile. (the same as my github handle, and @ gmail.com)

zmstone · 2023-03-13T10:47:28Z

Had an online session with @gbrehmer. With some debugging commands we were able to identify the root cause of this:

the deferred worker pool member is overloaded by retained message scans/dispatches, hence also delaying the clean-ups of the stale connections (which also relies on the same worker pool).

steps to confirm if it's the same cause when similar behaviour is observed:

Identify to which node the client with duplicated ID is connected (this can be done from the dashboard by choosing the node name in the drop-down)
Attach to the node with emqx eval 'observer_cli:start().' to see if their is any process named emqx_pool_N with a long MsgQueue.

In such case, the stale clients visible from the dashboard will eventually disappear, it does not affect anything.

To fix it, there are 2 TODOs for EMQX team:

Do not return "connected" clients when the Pid is already dead
Do not defer the clean up to the pool worker (or create a new pool for this)

gbrehmer · 2023-03-15T10:40:52Z

@zmstone thanks again for the session. We identified another problem on our side during this session, which produces a lot of uncleared retained messages (2,7M messages = 1.8 GB ets DB size, expected message count: < 100k). Now the load could be reduced by cleaning up these old messages.
In addition we have plans to migrate to 5.x because the load was also caused by a lot of wildcard subscriptions on those retained messages. With 5.x it is possible to create "indexes" for the topic tree, which should speed up those parts a lot

ttulka added the BUG label Nov 22, 2022

lafirest mentioned this issue Nov 28, 2022

fix(cm): clear channel info when discarding/killing failed #9430

Closed

7 tasks

lafirest mentioned this issue Jan 31, 2023

exist two client session with the same client id occurs on a single node #9860

Closed

zmstone self-assigned this Mar 9, 2023

zmstone changed the title ~~Client ID not unique for a connection when timeout misconfigured~~ Client ID not unique for a connection Mar 13, 2023

terry-xiaoyu mentioned this issue Mar 21, 2023

Dedicated pool for retainer #10189

Merged

8 tasks

id closed this as completed Jun 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client ID not unique for a connection #9409

Client ID not unique for a connection #9409

ttulka commented Nov 22, 2022

lafirest commented Nov 24, 2022

ttulka commented Nov 24, 2022

lafirest commented Nov 24, 2022

gbrehmer commented Mar 8, 2023 •

edited

Loading

lafirest commented Mar 9, 2023

zmstone commented Mar 9, 2023

zmstone commented Mar 13, 2023 •

edited

Loading

gbrehmer commented Mar 15, 2023 •

edited

Loading

Client ID not unique for a connection #9409

Client ID not unique for a connection #9409

Comments

ttulka commented Nov 22, 2022

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

EMQX version

OS version

lafirest commented Nov 24, 2022

ttulka commented Nov 24, 2022

lafirest commented Nov 24, 2022

gbrehmer commented Mar 8, 2023 • edited Loading

lafirest commented Mar 9, 2023

zmstone commented Mar 9, 2023

zmstone commented Mar 13, 2023 • edited Loading

gbrehmer commented Mar 15, 2023 • edited Loading

gbrehmer commented Mar 8, 2023 •

edited

Loading

zmstone commented Mar 13, 2023 •

edited

Loading

gbrehmer commented Mar 15, 2023 •

edited

Loading