Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MQTT pubsub component can deadlock on disconnect. #1053

Closed
CodeMonkeyLeet opened this issue Aug 3, 2021 · 5 comments · Fixed by #1192
Closed

MQTT pubsub component can deadlock on disconnect. #1053

CodeMonkeyLeet opened this issue Aug 3, 2021 · 5 comments · Fixed by #1192
Labels
kind/bug Something isn't working

Comments

@CodeMonkeyLeet
Copy link
Contributor

CodeMonkeyLeet commented Aug 3, 2021

Expected Behavior

TestPubsubConformance/pubsub/mqtt-vernemq should pass consistently in automation.

Actual Behavior

The test fails by hitting the 15 min timeout occasionally as seen in:
https://github.com/dapr/components-contrib/runs/3235599504?check_suite_focus=true
https://github.com/dapr/components-contrib/runs/3229595617?check_suite_focus=true
https://github.com/dapr/components-contrib/runs/3226893816?check_suite_focus=true

This seems to be due to not all of the messages expected by the test being delivered at least once.

Steps to Reproduce the Problem

Run the TestPubsubConformance/pubsub/mqtt-vernemq as seen in automation.

Running tests for TestPubsubConformance/pubsub/mqtt-vernemq ... 
=== RUN   TestPubsubConformance
=== RUN   TestPubsubConformance/mqtt-vernemq
=== RUN   TestPubsubConformance/mqtt-vernemq/init
time="2021-08-03T03:04:23.560339616Z" level=debug msg="mqtt message bus initialization complete" instance=fv-az180-356 scope=testLogger type=log ver=unknown
=== RUN   TestPubsubConformance/mqtt-vernemq/subscribe
=== RUN   TestPubsubConformance/mqtt-vernemq/publish
time="2021-08-03T03:04:28.563000768Z" level=debug msg="mqtt publishing topic testTopic with data: [109 101 115 115 97 103 101 45 53 51 98 51 54 56 53 99 45 97 98 56 97 45 52 48 99 49 45 98 51 54 102 45 102 48 54 49 100 52 51 49 57 48 98 99 45 49]" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:28.563848828Z" level=debug msg="mqtt publishing topic testTopic with data: [109 101 115 115 97 103 101 45 53 51 98 51 54 56 53 99 45 97 98 56 97 45 52 48 99 49 45 98 51 54 102 45 102 48 54 49 100 52 51 49 57 48 98 99 45 50]" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:28.565000409Z" level=debug msg="mqtt publishing topic testTopic with data: [109 101 115 115 97 103 101 45 53 51 98 51 54 56 53 99 45 97 98 56 97 45 52 48 99 49 45 98 51 54 102 45 102 48 54 49 100 52 51 49 57 48 98 99 45 51]" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:28.565924474Z" level=debug msg="mqtt publishing topic testTopic with data: [109 101 115 115 97 103 101 45 53 51 98 51 54 56 53 99 45 97 98 56 97 45 52 48 99 49 45 98 51 54 102 45 102 48 54 49 100 52 51 49 57 48 98 99 45 52]" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:28.568451552Z" level=debug msg="Processing MQTT message testTopic/1" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:28.568509256Z" level=debug msg="mqtt publishing topic testTopic with data: [109 101 115 115 97 103 101 45 53 51 98 51 54 56 53 99 45 97 98 56 97 45 52 48 99 49 45 98 51 54 102 45 102 48 54 49 100 52 51 49 57 48 98 99 45 53]" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:28.568835879Z" level=debug msg="mqtt publishing topic testTopic with data: [109 101 115 115 97 103 101 45 53 51 98 51 54 56 53 99 45 97 98 56 97 45 52 48 99 49 45 98 51 54 102 45 102 48 54 49 100 52 51 49 57 48 98 99 45 54]" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:28.56955543Z" level=debug msg="mqtt publishing topic testTopic with data: [109 101 115 115 97 103 101 45 53 51 98 51 54 56 53 99 45 97 98 56 97 45 52 48 99 49 45 98 51 54 102 45 102 48 54 49 100 52 51 49 57 48 98 99 45 55]" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:28.570036464Z" level=debug msg="mqtt publishing topic testTopic with data: [109 101 115 115 97 103 101 45 53 51 98 51 54 56 53 99 45 97 98 56 97 45 52 48 99 49 45 98 51 54 102 45 102 48 54 49 100 52 51 49 57 48 98 99 45 56]" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:28.570333185Z" level=debug msg="mqtt publishing topic testTopic with data: [109 101 115 115 97 103 101 45 53 51 98 51 54 56 53 99 45 97 98 56 97 45 52 48 99 49 45 98 51 54 102 45 102 48 54 49 100 52 51 49 57 48 98 99 45 57]" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:28.57112054Z" level=debug msg="mqtt publishing topic testTopic with data: [109 101 115 115 97 103 101 45 53 51 98 51 54 56 53 99 45 97 98 56 97 45 52 48 99 49 45 98 51 54 102 45 102 48 54 49 100 52 51 49 57 48 98 99 45 49 48]" instance=fv-az180-356 scope=testLogger type=log ver=unknown
=== RUN   TestPubsubConformance/mqtt-vernemq/verify_read
    pubsub.go:194: waiting for 1m0s to complete read
time="2021-08-03T03:04:29.568771448Z" level=error msg="Error processing MQTT message: testTopic/1. Retrying..." instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:34.569113514Z" level=debug msg="Processing MQTT message testTopic/1" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:40.56966179Z" level=debug msg="Processing MQTT message testTopic/1" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:40.569780199Z" level=info msg="Successfully processed MQTT message after it previously failed: testTopic/1" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:40.569810201Z" level=debug msg="Processing MQTT message testTopic/2" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:40.569886306Z" level=debug msg="Processing MQTT message testTopic/3" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:41.570162938Z" level=error msg="Error processing MQTT message: testTopic/3. Retrying..." instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:46.57049212Z" level=debug msg="Processing MQTT message testTopic/3" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:46.570614929Z" level=info msg="Successfully processed MQTT message after it previously failed: testTopic/3" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:46.570653231Z" level=debug msg="Processing MQTT message testTopic/4" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:46.570695834Z" level=debug msg="Processing MQTT message testTopic/5" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:46.571103163Z" level=debug msg="Processing MQTT message testTopic/6" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:46.571323179Z" level=debug msg="Processing MQTT message testTopic/7" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:47.57165318Z" level=error msg="Error processing MQTT message: testTopic/7. Retrying..." instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:52.572007436Z" level=debug msg="Processing MQTT message testTopic/7" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:52.572125445Z" level=info msg="Successfully processed MQTT message after it previously failed: testTopic/7" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:52.572151647Z" level=debug msg="Processing MQTT message testTopic/8" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:52.57220465Z" level=debug msg="Processing MQTT message testTopic/9" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:52.572317258Z" level=debug msg="Processing MQTT message testTopic/10" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:52.572355561Z" level=debug msg="Processing MQTT message testTopic/7" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:52.572674983Z" level=debug msg="Processing MQTT message testTopic/8" instance=fv-az180-356 scope=testLogger type=log ver=unknown
time="2021-08-03T03:04:52.572781091Z" level=debug msg="Processing MQTT message testTopic/9" instance=fv-az180-356 scope=testLogger type=log ver=unknown
panic: test timed out after 15m0s

Release Note

RELEASE NOTE: UPDATE paho.mqtt.golang to v1.3.5 to resolve deadlocks on disconnect.

@dapr-bot
Copy link
Collaborator

dapr-bot commented Sep 2, 2021

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.

@dapr-bot dapr-bot added the stale label Sep 2, 2021
@dapr-bot
Copy link
Collaborator

dapr-bot commented Sep 9, 2021

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as pinned, good first issue, help wanted or triaged/resolved. Thank you for your contributions.

@CodeMonkeyLeet
Copy link
Contributor Author

CodeMonkeyLeet commented Sep 9, 2021

Bug remains active. More generally, all the MQTT bindings conformance tests are also flakey, with the Vernemq profile being the most common failure, followed by EMQX then Mosquitto.

@dapr-bot dapr-bot removed the stale label Sep 9, 2021
@dapr-bot
Copy link
Collaborator

dapr-bot commented Oct 9, 2021

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.

@dapr-bot dapr-bot added the stale label Oct 9, 2021
@CodeMonkeyLeet
Copy link
Contributor Author

This looks like a robustness issue in the underlying paho.mqtt.golang library:

  • The test has a default 1 minute timeout for the Read operation that is not getting triggered. What kills the test is the test framework itself detecting the deadlocked process after 15 min, hence the long timeout.
  • The hang happens in the mqttMsg.Ack() call itself, outside of any locking maintained by dapr or the MQTT pubsub component.
  • This corroborates with the MQTT conformance test bindings failures as well, where most of them are also hitting the 15 minute time out.

After some further debugging, it closely resembles the known issue eclipse/paho.mqtt.golang#501, which has a fix in the v1.3.5 release (Dapr is currently on v1.3.2 which has other known hang issues apparently). Currently running the upgrade on repeated test loop to see if it addresses the problem.

@dapr-bot dapr-bot removed the stale label Oct 11, 2021
CodeMonkeyLeet pushed a commit to CodeMonkeyLeet/components-contrib that referenced this issue Oct 11, 2021
There are a couple of fixes for deadlock issues in paho.mqtt.golang in
versions 1.3.4 and 1.3.5, which are good robustness upgrades for the
Dapr MQTT components in addition to reduce the incidence of deadlock
failures in the conformance tests for MQTT pubsub and bindings.

Resolves dapr#1053
@CodeMonkeyLeet CodeMonkeyLeet changed the title TestPubsubConformance/pubsub/mqtt-vernemq is unreliable MQTT pubsub component can deadlock on disconnect. Oct 12, 2021
artursouza pushed a commit that referenced this issue Oct 12, 2021
There are a couple of fixes for deadlock issues in paho.mqtt.golang in
versions 1.3.4 and 1.3.5, which are good robustness upgrades for the
Dapr MQTT components in addition to reduce the incidence of deadlock
failures in the conformance tests for MQTT pubsub and bindings.

Resolves #1053
amimimor pushed a commit to amimimor/components-contrib that referenced this issue Dec 9, 2021
There are a couple of fixes for deadlock issues in paho.mqtt.golang in
versions 1.3.4 and 1.3.5, which are good robustness upgrades for the
Dapr MQTT components in addition to reduce the incidence of deadlock
failures in the conformance tests for MQTT pubsub and bindings.

Resolves dapr#1053
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants