Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tracing spans #99

Merged
merged 5 commits into from
May 3, 2023
Merged

Fix tracing spans #99

merged 5 commits into from
May 3, 2023

Conversation

sarsko
Copy link
Contributor

@sarsko sarsko commented Mar 30, 2023

I guess this is easiest described by showing the output. Two lines of output withRUST_LOG=trace turned on before this change:

2023-03-30T01:30:25.407978Z TRACE execution{i=0}:spawn_some_futures_and_set_tag{num_threads=0}:spawn_some_threads_and_set_tag{num_threads=1}:spawn_some_threads_and_set_tag{num_threads=2}:spawn_some_threads_and_set_tag{num_threads=3}:spawn_some_threads_and_set_tag{num_threads=1}:spawn_some_threads_and_set_tag{num_threads=2}:spawn_some_threads_and_set_tag{num_threads=4}:spawn_some_threads_and_set_tag{num_threads=5}:spawn_some_threads_and_set_tag{num_threads=6}:spawn_some_threads_and_set_tag{num_threads=3}:spawn_some_threads_and_set_tag{num_threads=1}:spawn_some_threads_and_set_tag{num_threads=8}:spawn_some_threads_and_set_tag{num_threads=7}:spawn_some_futures_and_set_tag{num_threads=1}:spawn_some_threads_and_set_tag{num_threads=10}:spawn_some_threads_and_set_tag{num_threads=11}:spawn_some_threads_and_set_tag{num_threads=12}:spawn_some_threads_and_set_tag{num_threads=9}:spawn_some_threads_and_set_tag{num_threads=5}:spawn_some_threads_and_set_tag{num_threads=2}:spawn_some_threads_and_set_tag{num_threads=13}:spawn_some_threads_and_set_tag{num_threads=4}:spawn_some_threads_and_set_tag{num_threads=14}:spawn_some_threads_and_set_tag{num_threads=16}:spawn_some_threads_and_set_tag{num_threads=3}:spawn_some_threads_and_set_tag{num_threads=18}:spawn_some_threads_and_set_tag{num_threads=17}:spawn_some_threads_and_set_tag{num_threads=15}:spawn_some_threads_and_set_tag{num_threads=19}:spawn_some_futures_and_set_tag{num_threads=2}:spawn_some_threads_and_set_tag{num_threads=4}:spawn_some_threads_and_set_tag{num_threads=5}:spawn_some_threads_and_set_tag{num_threads=6}:spawn_some_futures_and_set_tag{num_threads=3}:spawn_some_threads_and_set_tag{num_threads=7}:spawn_some_threads_and_set_tag{num_threads=9}:spawn_some_futures_and_set_tag{num_threads=4}:spawn_some_futures_and_set_tag{num_threads=5}:spawn_some_threads_and_set_tag{num_threads=11}:spawn_some_threads_and_set_tag{num_threads=8}:spawn_some_threads_and_set_tag{num_threads=10}:spawn_some_threads_and_set_tag{num_threads=7}:spawn_some_threads_and_set_tag{num_threads=6}:spawn_some_threads_and_set_tag{num_threads=16}:spawn_some_threads_and_set_tag{num_threads=14}:spawn_some_threads_and_set_tag{num_threads=9}:spawn_some_threads_and_set_tag{num_threads=15}:spawn_some_threads_and_set_tag{num_threads=11}:spawn_some_threads_and_set_tag{num_threads=12}:spawn_some_futures_and_set_tag{num_threads=7}:spawn_some_futures_and_set_tag{num_threads=6}:spawn_some_threads_and_set_tag{num_threads=10}:spawn_some_threads_and_set_tag{num_threads=13}:spawn_some_threads_and_set_tag{num_threads=8}:spawn_some_threads_and_set_tag{num_threads=17}:spawn_some_threads_and_set_tag{num_threads=18}:spawn_some_futures_and_set_tag{num_threads=9}:spawn_some_threads_and_set_tag{num_threads=19}:spawn_some_threads_and_set_tag{num_threads=16}:spawn_some_threads_and_set_tag{num_threads=14}:spawn_some_futures_and_set_tag{num_threads=10}:spawn_some_futures_and_set_tag{num_threads=11}:spawn_some_threads_and_set_tag{num_threads=15}:spawn_some_futures_and_set_tag{num_threads=8}:spawn_some_threads_and_set_tag{num_threads=12}:spawn_some_threads_and_set_tag{num_threads=13}: shuttle::thread: done dropping thread locals
2023-03-30T01:30:25.410818Z TRACE execution{i=0}:spawn_some_futures_and_set_tag{num_threads=0}:spawn_some_threads_and_set_tag{num_threads=1}:spawn_some_threads_and_set_tag{num_threads=2}:spawn_some_threads_and_set_tag{num_threads=3}:spawn_some_threads_and_set_tag{num_threads=1}:spawn_some_threads_and_set_tag{num_threads=2}:spawn_some_threads_and_set_tag{num_threads=4}:spawn_some_threads_and_set_tag{num_threads=5}:spawn_some_threads_and_set_tag{num_threads=6}:spawn_some_threads_and_set_tag{num_threads=3}:spawn_some_threads_and_set_tag{num_threads=1}:spawn_some_threads_and_set_tag{num_threads=8}:spawn_some_threads_and_set_tag{num_threads=7}:spawn_some_futures_and_set_tag{num_threads=1}:spawn_some_threads_and_set_tag{num_threads=10}:spawn_some_threads_and_set_tag{num_threads=11}:spawn_some_threads_and_set_tag{num_threads=12}:spawn_some_threads_and_set_tag{num_threads=9}:spawn_some_threads_and_set_tag{num_threads=5}:spawn_some_threads_and_set_tag{num_threads=2}:spawn_some_threads_and_set_tag{num_threads=13}:spawn_some_threads_and_set_tag{num_threads=4}:spawn_some_threads_and_set_tag{num_threads=14}:spawn_some_threads_and_set_tag{num_threads=16}:spawn_some_threads_and_set_tag{num_threads=3}:spawn_some_threads_and_set_tag{num_threads=18}:spawn_some_threads_and_set_tag{num_threads=17}:spawn_some_threads_and_set_tag{num_threads=15}:spawn_some_threads_and_set_tag{num_threads=19}:spawn_some_futures_and_set_tag{num_threads=2}:spawn_some_threads_and_set_tag{num_threads=4}:spawn_some_threads_and_set_tag{num_threads=5}:spawn_some_threads_and_set_tag{num_threads=6}:spawn_some_futures_and_set_tag{num_threads=3}:spawn_some_threads_and_set_tag{num_threads=7}:spawn_some_threads_and_set_tag{num_threads=9}:spawn_some_futures_and_set_tag{num_threads=4}:spawn_some_futures_and_set_tag{num_threads=5}:spawn_some_threads_and_set_tag{num_threads=11}:spawn_some_threads_and_set_tag{num_threads=8}:spawn_some_threads_and_set_tag{num_threads=10}:spawn_some_threads_and_set_tag{num_threads=7}:spawn_some_threads_and_set_tag{num_threads=6}:spawn_some_threads_and_set_tag{num_threads=16}:spawn_some_threads_and_set_tag{num_threads=14}:spawn_some_threads_and_set_tag{num_threads=9}:spawn_some_threads_and_set_tag{num_threads=15}:spawn_some_threads_and_set_tag{num_threads=11}:spawn_some_threads_and_set_tag{num_threads=12}:spawn_some_futures_and_set_tag{num_threads=7}:spawn_some_futures_and_set_tag{num_threads=6}:spawn_some_threads_and_set_tag{num_threads=10}:spawn_some_threads_and_set_tag{num_threads=13}:spawn_some_threads_and_set_tag{num_threads=8}:spawn_some_threads_and_set_tag{num_threads=17}:spawn_some_threads_and_set_tag{num_threads=18}:spawn_some_futures_and_set_tag{num_threads=9}:spawn_some_threads_and_set_tag{num_threads=19}:spawn_some_threads_and_set_tag{num_threads=16}:spawn_some_threads_and_set_tag{num_threads=14}:spawn_some_futures_and_set_tag{num_threads=10}:spawn_some_futures_and_set_tag{num_threads=11}:spawn_some_threads_and_set_tag{num_threads=15}:spawn_some_futures_and_set_tag{num_threads=8}:spawn_some_threads_and_set_tag{num_threads=12}:spawn_some_threads_and_set_tag{num_threads=13}: shuttle::runtime::execution: runnable=[TaskId(0), TaskId(48), TaskId(56), TaskId(67), TaskId(73), TaskId(91), TaskId(110), TaskId(112), TaskId(113), TaskId(600)] next_task=Some(TaskId(113))

and after:

2023-03-30T01:33:26.796829Z TRACE spawn_some_futures_and_set_tag{num_threads=19}: shuttle::runtime::execution: runnable=[TaskId(51), TaskId(106), TaskId(123), TaskId(130), TaskId(133), TaskId(705), TaskId(706), TaskId(707), TaskId(708)] next_task=Some(TaskId(51))
2023-03-30T01:33:26.796928Z TRACE spawn_some_threads_and_set_tag{num_threads=15}: shuttle::runtime::execution: runnable=[TaskId(51), TaskId(106), TaskId(123), TaskId(130), TaskId(133), TaskId(705), TaskId(706), TaskId(707), TaskId(708)] next_task=Some(TaskId(130))

As for testing: I did start making the test which I used during development into a test where everything was automatic by turning on tracing and then capturing the output. This turned out to be quite hacky and required pulling in dependencies, so I decided to abandon it. If this code should break for some reason, then that does not pose a correctness issue, so no big need to have a test.


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@jamesbornholt
Copy link
Member

This is definitely better but not sure it's quite right if the test code uses spans itself. I wrote this test:

use shuttle::{thread, check_random};
use shuttle::sync::{Arc, Mutex};
use tracing::{warn_span, warn};
use test_log::test;

#[test]
fn tracing_nested_spans() {
    check_random(|| {
        let lock = Arc::new(Mutex::new(0));
        let threads: Vec<_> = (0..3).map(|i| {
            let lock = lock.clone();
            thread::spawn(move || {
                let outer = warn_span!("outer", tid=i);
                let _outer = outer.enter();
                {
                    let mut locked = lock.lock().unwrap();
                    let inner = warn_span!("inner", tid=i);
                    let _inner = inner.enter();
                    warn!("incrementing from {}", *locked);
                    *locked += 1;
                }
            })
        }).collect();

        for thread in threads {
            thread.join().unwrap();
        }
    }, 10)
}

and in the trace output, all the step spans are gone, and the execution span is missing half the time:

2023-03-30T04:04:44.953212Z TRACE execution{i=0}: shuttle::runtime::execution: runnable=[TaskId(0)] next_task=Some(TaskId(0))
2023-03-30T04:04:44.953315Z TRACE execution{i=0}: shuttle::runtime::execution: runnable=[TaskId(0), TaskId(1)] next_task=Some(TaskId(1))
2023-03-30T04:04:44.953360Z TRACE outer{tid=0}: shuttle::sync::mutex: waiting to acquire mutex 0x6000033c0100 holder=None waiters=TaskSet { }
2023-03-30T04:04:44.953386Z TRACE outer{tid=0}: shuttle::sync::mutex: acquired mutex 0x6000033c0100 waiters=TaskSet { }
2023-03-30T04:04:44.953416Z TRACE outer{tid=0}: shuttle::runtime::execution: runnable=[TaskId(0), TaskId(1)] next_task=Some(TaskId(1))
2023-03-30T04:04:44.953441Z  WARN outer{tid=0}:inner{tid=0}: r#mod::basic::tracing: incrementing from 0
2023-03-30T04:04:44.953468Z TRACE outer{tid=0}: shuttle::sync::mutex: releasing mutex 0x6000033c0100 waiters=TaskSet { }
2023-03-30T04:04:44.953491Z TRACE outer{tid=0}: shuttle::runtime::execution: runnable=[TaskId(0), TaskId(1)] next_task=Some(TaskId(0))
2023-03-30T04:04:44.953518Z TRACE execution{i=0}: shuttle::runtime::execution: runnable=[TaskId(0), TaskId(1), TaskId(2)] next_task=Some(TaskId(1))
...

I think we probably have to do something smarter than just exiting the top span every context switch. Since we know spans form a tree, maybe we do something like remember the entire path below some known sentinel span, exit all those at a context switch, and re-enter them when we switch back? All of Shuttle's spans are above any application spans in the tree, so this should work unless we end up violating that property.

Testing wise, you could just write your own mock Subscriber implementation that records the enter/exit events and checks they work the way we expect them to. But that's still kinda annoying, so maybe not worth it—up to you.

@sarsko
Copy link
Contributor Author

sarsko commented Mar 31, 2023

WIP, pushing now in as I will be out next week, and others can have a look/do stuff with it if they want.

Did another attempt at it. This one is based on how they do it in Loom. Seems to be better than the previous version, but may still have issues. Have not tested it for futures yet. Left a couple of TODOs for where I will pick back up.

@sarsko
Copy link
Contributor Author

sarsko commented Mar 31, 2023

Still haven't tested whether it gives the right output for futures, but it should hopefully not crash when they are involved anymore.

@@ -227,3 +227,36 @@ fn iterate_over_hash_set(num_entries: u64) {
fn hashset_without_set_seed() {
check_uncontrolled_nondeterminism(|| iterate_over_hash_set(10), 1000);
}

Copy link
Contributor Author

@sarsko sarsko Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 230-262 is going to be removed.

@sarsko
Copy link
Contributor Author

sarsko commented Mar 31, 2023

Hmm so there seems to be no issues with futures.

However, the scheme is not correct. What I believe we should be seeing (run outside of Shuttle):

2023-03-31T13:53:58.419852Z  WARN outer{tid=0}:inner{tid=0}: tracing: incrementing from 0
2023-03-31T13:53:58.419912Z  WARN outer{tid=1}:inner{tid=1}: tracing: incrementing from 1
2023-03-31T13:53:58.419969Z  WARN outer{tid=2}:inner{tid=2}: tracing: incrementing from 2

And what we are seeing:

2023-03-31T13:25:17.276058Z  WARN execution{i=9}:step{i=2 task=2}:inner{tid=1}: r#mod::basic::trace: incrementing from 2

(lacking outer{tid=0})

So we seem to be seeing the same traces as Loom, but neither are displaying the correct traces.

I'll look at this more once I am back. Sorry for the noise.

@sarsko
Copy link
Contributor Author

sarsko commented Apr 18, 2023

Cleaned up the proposed fix. I want to invoke the power of frugality and suggest to merge this (after eventual comments). It is not perfect, but it is way way better than the current version of spans, which is pretty much useless. I want to spend time to fix it properly sometime in the future, but I think my time currently is better spent elsewhere.

@sarsko
Copy link
Contributor Author

sarsko commented Apr 18, 2023

Of course, if someone else want to take a look at it, then go ahead.

@sarsko sarsko mentioned this pull request Apr 20, 2023
@sarsko
Copy link
Contributor Author

sarsko commented Apr 20, 2023

Added an issue for further fix here: #104

src/runtime/task/mod.rs Outdated Show resolved Hide resolved
src/runtime/execution.rs Outdated Show resolved Hide resolved
src/runtime/execution.rs Outdated Show resolved Hide resolved
@jamesbornholt jamesbornholt merged commit 6974706 into awslabs:main May 3, 2023
jorajeev pushed a commit that referenced this pull request Feb 29, 2024
* Fix tracing spans

* Address issues with previous attempt to fix the tracing spans

* Make the tracing fix not cause crashes for futures

* Cleanup the tracing span fix

* Address comments from @jamesbornholt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants