Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: zenoh_session_multicast test #1004

Merged

Conversation

YuanYuYuan
Copy link
Contributor

Closes #1002.

@YuanYuYuan YuanYuYuan marked this pull request as draft May 2, 2024 09:12
@YuanYuYuan
Copy link
Contributor Author

In the above repeated tests, an unexpectedly slow run is caught. https://github.com/eclipse-zenoh/zenoh/actions/runs/8921424650/job/24501502386?pr=1004#step:7:113

@YuanYuYuan YuanYuYuan force-pushed the fix/zenoh-session-multicast-test branch 2 times, most recently from 3dff7a4 to 1b152a7 Compare May 2, 2024 14:31
@YuanYuYuan
Copy link
Contributor Author

YuanYuYuan commented May 2, 2024

As the timeout of nextest is disabled, we found that the ztimeout! within the test appears to be ineffective. We might fail into the issue of tokio::timeout(fut).await never yields whenever the future fut is a blocking call per se.

@YuanYuYuan YuanYuYuan force-pushed the fix/zenoh-session-multicast-test branch from 091f568 to 45e07fd Compare May 2, 2024 16:49
@YuanYuYuan
Copy link
Contributor Author

YuanYuYuan commented May 6, 2024

Two issues have been identified:

  1. Passed with panicking error.
2024-05-06T04:23:29.255530Z TRACE                    rx-0 ThreadId(10) zenoh_link_udp::multicast: Read error on UDP link 192.168.68.114:55606 => 224.0.0.1:17448: A Tokio 1.x context was found, but it is being shutdown. at io/zenoh-links/zenoh-link-udp/src/multicast.rs:97.
2024-05-06T04:23:29.255557Z DEBUG                    rx-0 ThreadId(10) zenoh_transport::multicast::link: Read error on UDP link 192.168.68.114:55606 => 224.0.0.1:17448: A Tokio 1.x context was found, but it is being shutdown. at io/zenoh-links/zenoh-link-udp/src/multicast.rs:97.

And then the error of "thread rx-0 panicked at ... The hashmap should contain net after initialization..." may occur. It's due to spawning another termination task after the global ZRuntime has been clear.

https://github.com/ZettaScaleLabs/zenoh/blob/936bde67d95d102d6d582e884a1f931956141bd3/io/zenoh-transport/src/multicast/link.rs?plain=1#L384

Log file

  1. Failed with timeout error.

zenoh::session::close -> zenoh::net::routing::dispatcher::tables::close_face deadlock found.

https://github.com/ZettaScaleLabs/zenoh/blob/936bde67d95d102d6d582e884a1f931956141bd3/zenoh/src/net/routing/dispatcher/tables.rs?plain=1#L177

(zlock! is the blocking one.)

Log file

To reproduce:

export RUST_LOG="zenoh=trace"
cargo nextest run -p zenoh --test session -E "test(=zenoh_session_multicast)" -F unstable --no-capture

@YuanYuYuan
Copy link
Contributor Author

Regarding the first issue, the removal of atexit callback is proposed after the discussion. The reasons are listed below,

  1. Previously with async-std, we don't have such kind of cleanup on the global runtime.
  2. Since Rust doesn't drop the static variable, we need to add this cleanup callback to pass the memory leak check with Valgrind.
    However, we can workaround it by the previous solution "a runtime guard" as well.
    The downside of the previous solution is users may find the memory leak reported by Valgrind in our examples or the bindings that no runtime guard is used by default.
  3. We found that the unsafe atexit is unsound. For example, we got the following error if we tried to log with tracing while dropping ZRuntimePool.
thread '<unnamed>' panicked at 'use of std::thread::current() is not possible after the thread's local data has been destroyed', library/std/src/thread/mod.rs:711:35
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Moreover, we still have an unresolved issue with Windows DLL. See #973 for details.

Once we remove the atexit cleanup, the first issue should disappear including the panics found on rmw_zenoh.

@YuanYuYuan
Copy link
Contributor Author

The first issue has been addressed via #1015

@YuanYuYuan YuanYuYuan marked this pull request as ready for review May 6, 2024 15:38
@YuanYuYuan YuanYuYuan force-pushed the fix/zenoh-session-multicast-test branch from 28b79e7 to 1072be1 Compare May 6, 2024 15:38
@Mallets Mallets merged commit 7e5d5e8 into eclipse-zenoh:main May 6, 2024
11 checks passed
@Mallets Mallets deleted the fix/zenoh-session-multicast-test branch May 6, 2024 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sporadic test failure of zenoh_session_multicast
3 participants