Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client ID not unique for a connection #9409

Closed
ttulka opened this issue Nov 22, 2022 · 8 comments
Closed

Client ID not unique for a connection #9409

ttulka opened this issue Nov 22, 2022 · 8 comments
Assignees
Labels

Comments

@ttulka
Copy link

ttulka commented Nov 22, 2022

What happened?

When a client sets a timeout greater than the idle_timeout for zone:mqtt it happens that multiple connections with the same client ID are kept active.

emqx-active-connections

What did you expect to happen?

Only one active connection per client ID should exist as specified by MQTT.

How can we reproduce it (as minimally and precisely as possible)?

A client with a timeout greater then the broker connects multiple times.

Anything else we need to know?

No response

EMQX version

$ ./bin/emqx_ctl broker
sysdescr  : EMQ X Broker
version   : 4.4.10
uptime    : 20 days, 3 hours, 0 minutes, 41 seconds
datetime  : 2022-11-22 16:10:49

OS version

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
$ uname -a
Linux prod-mb-3 4.19.0-22-cloud-amd64 #1 SMP Debian 4.19.260-1 (2022-09-29) x86_64 GNU/Linux

</details>


### Log files

<details>

</details>
@ttulka ttulka added the BUG label Nov 22, 2022
@lafirest
Copy link
Member

@ttulka
Thanks for your input, but I can't reproduce that with MQTTX, maybe I missed something important, could you bring a detailed log for this?

@ttulka
Copy link
Author

ttulka commented Nov 24, 2022

@lafirest Thanks for the reply, I have a debug log but it contains some semi-sensitive information i don't want to share public on GitHub. Can I send you the log in a private channel (email or similar)?

@lafirest
Copy link
Member

@lafirest Thanks for the reply, I have a debug log but it contains some semi-sensitive information i don't want to share public on GitHub. Can I send you the log in a private channel (email or similar)?

Yes, of course, My email is blankalupo@163.com

@gbrehmer
Copy link

gbrehmer commented Mar 8, 2023

we still have a lot of duplications with 4.4.15!
And there is no difference between clients with higher or lower keepalive setting.

There are duplications on each nodes. username and client_id is the same for each session. If we try to kickout the client by webadmin action, the DELETE API call is successful, but the session is still alive. Probably those clients have a bad internet/wlan connection, which produces a lot of reconnects, sometimes one reconnect per minute. It is possible that this high connection rate of the same client produces such faulty sessions? And those clients receive old retained messages each time after reconnect (which are already cleared up several hours ago).

@lafirest do you have any suggestions how we can get more details about the cause which can help you to fix the bug? Thanks!

Search result from one node (All clients are using MQTT v3.1.1):

image

@lafirest
Copy link
Member

lafirest commented Mar 9, 2023

we still have a lot of duplications with 4.4.15! And there is no difference between clients with higher or lower keepalive setting.

There are duplications on each nodes. username and client_id is the same for each session. If we try to kickout the client by webadmin action, the DELETE API call is successful, but the session is still alive. Probably those clients have a bad internet/wlan connection, which produces a lot of reconnects, sometimes one reconnect per minute. It is possible that this high connection rate of the same client produces such faulty sessions? And those clients receive old retained messages each time after reconnect (which are already cleared up several hours ago).

@lafirest do you have any suggestions how we can get more details about the cause which can help you to fix the bug? Thanks!

Search result from one node (All clients are using MQTT v3.1.1):

image

@gbrehmer
It seems the old session hung for some reason, cause the kick action didn't work as expected.
I will seek help from my team, please wait a while.

@zmstone
Copy link
Member

zmstone commented Mar 9, 2023

There had been a few similar issues reported on different versions, and we have failed to reproduce it so far.
If possible, I'd like to have a online troubleshooting session with screen shared.
you can reach out to me in email which can be found in my github profile. (the same as my github handle, and @ gmail.com)

@zmstone zmstone self-assigned this Mar 9, 2023
@zmstone
Copy link
Member

zmstone commented Mar 13, 2023

Had an online session with @gbrehmer. With some debugging commands we were able to identify the root cause of this:

the deferred worker pool member is overloaded by retained message scans/dispatches, hence also delaying the clean-ups of the stale connections (which also relies on the same worker pool).

steps to confirm if it's the same cause when similar behaviour is observed:

  1. Identify to which node the client with duplicated ID is connected (this can be done from the dashboard by choosing the node name in the drop-down)
  2. Attach to the node with emqx eval 'observer_cli:start().' to see if their is any process named emqx_pool_N with a long MsgQueue.

In such case, the stale clients visible from the dashboard will eventually disappear, it does not affect anything.

To fix it, there are 2 TODOs for EMQX team:

  1. Do not return "connected" clients when the Pid is already dead
  2. Do not defer the clean up to the pool worker (or create a new pool for this)

@zmstone zmstone changed the title Client ID not unique for a connection when timeout misconfigured Client ID not unique for a connection Mar 13, 2023
@gbrehmer
Copy link

gbrehmer commented Mar 15, 2023

@zmstone thanks again for the session. We identified another problem on our side during this session, which produces a lot of uncleared retained messages (2,7M messages = 1.8 GB ets DB size, expected message count: < 100k). Now the load could be reduced by cleaning up these old messages.
In addition we have plans to migrate to 5.x because the load was also caused by a lot of wildcard subscriptions on those retained messages. With 5.x it is possible to create "indexes" for the topic tree, which should speed up those parts a lot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants