Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mqtt + mqtt5 tests together sometimes "hang" forever #250

Closed
thomas-roos opened this issue Jan 11, 2023 · 5 comments
Closed

mqtt + mqtt5 tests together sometimes "hang" forever #250

thomas-roos opened this issue Jan 11, 2023 · 5 comments

Comments

@thomas-roos
Copy link

Found some strange issue causing the aws-c-mqtt tests to hang. It occurs on randomly on all architectures we (meta-aws) test. Seems this happens a lot with a CodeBuild instance, not that often on my EC2. I only can say that this hang is infinitely (stopped by 8h from CodeBuild).
This script does run the tests: https://github.com/aws4embeddedlinux/meta-aws/blob/999c6b8e077d6e0bc104714e92a437e82187641f/recipes-sdk/aws-c-mqtt/aws-c-mqtt/run-ptest#L288
This test seems to be hanging in this case: mqtt_connection_unsub_timeout
I have just this attached log showing some traces, making me think in the direction of a hanging wait something on a socket?!

version 0.8.4.

aws-c-mqtt_ptest.log

@TingDaoK
Copy link
Contributor

You mean your tests does run the MQTT_TESTS that commented out?

But the log you attached are ending with all the mqtt5 client service tasks,

[mqtt5-client] - id=0x56fef0: scheduled service task for time 171967973840                                                                                                                                                                                 |
[event-loop] - id=0x56c118: activity on fd 17, invoking handler.                                                                                                                                                                                           |
[event-loop] - id=0x56d7f0: detected more scheduled tasks with the next occurring at 0, using timeout of 0.                                                                                                                                                |
[event-loop] - id=0x56d7f0: waiting for a maximum of 0 ms                                                                                                                                                                                                  |
[event-loop] - id=0x56d7f0: wake up with 0 events to process.                                                                                                                                                                                              |
[socket] - id=0xb5500a98 fd=17: is readable                                                                                                                                                                                                                |
[event-loop] - id=0x56d7f0: running scheduled tasks.                                                                                                                                                                                                       |
[socket-handler] - id=0xb5511c40: socket is now readable                                                                                                                                                                                                   |
[task-scheduler] - id=0xb4b00ea8: Running socket_written_task task with <Running> status                                                                                                                                                                   |
[socket-handler] - id=0xb5511c40: invoking read. Downstream window 4294967295, max_to_read 16384                                                                                                                                                           |
[socket-handler] - static: write of size 4096, completed on channel 0xb4b01ac8                                                                                                                                                                             |
[channel] - id=0xb5501140: acquired message 0xb5509620 of capacity 16384 from pool 0xb5501348. Requested size was 16384                                                                                                                                    |
[event-loop] - id=0x56d7f0: cancelling task 0x56ff08                                                                                                                                                                                                       |
[task-scheduler] - id=0x56ff08: Running Mqtt5Service task with <Canceled> status                                                                                                                                                                           |
[mqtt5-client] - id=0x56fef0: cancelling previously scheduled service task                                                                                                                                                                                 |
[event-loop] - id=0x56d7f0: scheduling task 0x56ff08 in-thread for timestamp 144325383731                                                                                                                                                                  |
[socket] - id=0xb5500a98 fd=17: read of 4096                                                                                                                                                                                                               |
[task-scheduler] - id=0x56ff08: Scheduling Mqtt5Service task for future execution at time 144325383731                                                                                                                                                     |
[socket-handler] - id=0xb5511c40: read 4096 from socket                                                                                                                                                                                                    |
[mqtt5-client] - id=0x56fef0: scheduled service task for time 144325383731                                                                                                                                                                                 |
[channel] - id=0xb5501140: sending read message of size 4096, from slot 0xb5505598 to slot 0xb5505570 with handler 0xb5511a78.                                                                                                                             |
[event-loop] - id=0x56d7f0: detected more scheduled tasks with the next occurring at 7060819, using timeout of 7.                                                                                                                                          |
[channel] - id=0xb5501140: acquired message 0xb5509620 of capacity 12288 from pool 0xb5501348. Requested size was 12288                                                                                                                                    |
[event-loop] - id=0x56d7f0: waiting for a maximum of 7 ms                                                                                                                                                                                                  |
[socket] - id=0xb5500a98 fd=17: read of -1                                                                                                                                                                                                                 |
[socket] - id=0xb5500a98 fd=17: read would block                                                                                                                                                                                                           |
[event-loop] - id=0x56d7f0: wake up with 1 events to process.                                                                                                                                                                                              |
[socket-handler] - id=0xb55055a8: total read on this tick 4096                                                                                                                                                                                             |
[event-loop] - id=0x56d7f0: activity on fd 16, invoking handler.                                                                                                                                                                                           |
[socket-handler] - id=0xb5511c40: out of data to read on socket. Waiting on event-loop notification.      

This test seems to be hanging in this case: mqtt_connection_unsub_timeout

Why do you think it hangs from mqtt_connection_unsub_timeout?

Can you provide more details?

@thomas-roos thomas-roos changed the title mqtt_connection_unsub_timeout sometimes "hangs" forever mqtt + mqtt5 tests together sometimes "hang" forever Jan 18, 2023
@thomas-roos
Copy link
Author

Thank you to look into this - retried and found when disabling those tests it does not hang anymore - so I changed the title a bit:
https://github.com/aws4embeddedlinux/meta-aws/blob/812452dcaa4b5016ad66e117bc3775a3ab692659/recipes-sdk/aws-c-mqtt/aws-c-mqtt/run-ptest#L261-L287

@thomas-roos
Copy link
Author

mqtt_connection_unsub_timeout was my guess because it was the last I saw in the logs, sadly I do not have them anymore - the 10000 lines limit exporting logs...
But what I'm basically do in this test is running your unit tests and if they are supposed to run all after one another, it seems that there is either a bug in the test mock code or a bug in aws-c-mqtt itself.

@thomas-roos
Copy link
Author

thomas-roos commented Jan 18, 2023

did spend some time to reproduce the bug - and I have something.
What I did is to run the tests in a while true loop and check if it hangs - after an hour it hangs on my EC2.
Should be easy to reproduce with the script (all tests enabled), and this command:

while true; do ./run-ptest > log.log  2>&1; done &
tail -f -s 10 log.log 

Here is the logfile.
mqtt-hang.log

And top shows that it hangs on this test: mqtt_connection_not_resend_packets_on_healthy_connection

@bretambrose
Copy link
Contributor

The underlying insta-reconnect is still possible and fixing it with a refactoring of the 311 client is still on our roadmap. But in the meantime, we discovered an error in the 311 client's reconnect timing that was causing instant reconnects instead of minimum delay reconnects. In fixing that error, the CI issues look to have disappeared. We will make the long-term fix/refactor as soon as we can, but for now, I believe the direct issue is addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants