Skip to content

Generators --> Corosensei#204

Merged
sarsko merged 3 commits intoawslabs:mainfrom
dylanjwolff:corosensei-leak
Oct 23, 2025
Merged

Generators --> Corosensei#204
sarsko merged 3 commits intoawslabs:mainfrom
dylanjwolff:corosensei-leak

Conversation

@dylanjwolff
Copy link
Copy Markdown
Contributor

@dylanjwolff dylanjwolff commented Aug 22, 2025

This PR switches Shuttle continuations to use corosensei rather than the generators library. In the context-switch heavy micro-benchmarks (for example, shuttle/benches/counter.rs with the RandomScheduler), this results in 5-20% faster scheduling throughput because of reduced context-switching overhead.

Major points for discussion:

  1. There is an API change with corosensei where we now need to pass an explicit yielder in order to suspend a coroutine. Because the continuations are kept in a pool and reused, this requires us to exfiltrate the yielder from the closure passed in to Coroutine::with_stack and persist it for the lifetime of the Continuation. Currently, the yielder is carried around as a raw pointer on the Task struct, but I suspect there might be a nicer way to do this. Unfortunately, naively embedding it in the Continuation struct is not possible because the Continuation is already borrowed by its resume method when switch is called.

  2. There is an issue where corosensei expects to catch a ForcedUnwind when dropping coroutines. Something in the way Shuttle handles panics is intercepting the ForcedUnwind and not propagating it up to where corosensei expects it, causing it to panic. This only arises on an initial panic, or during cleanup if some Tasks are detached. The current solution is to call force_reset instead of force_unwind in these cases, which does not attempt to unwind the coroutine, but therefore may leak resources allocated on the coroutine. In these two scenarios, we probably don't care about leaking resources, because the whole test is about to exit. UPDATE this is fixed in force_unwind and panic!"the ForcedUnwind panic was caught and not rethrown" Amanieu/corosensei#57.

TODOS before this is ready for merging:

  • test on some client projects over a longer period of time
  • update the ignored test (depends on how we handle (1))
  • general cleanup (remove unnecessary pub, delete commented out println's etc.)
    refactor ContinuationState enum to contain the coroutine function itself -- I'd like to move this refactor to a future PR so that we can land this PR sooner rather than later

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@sarsko
Copy link
Copy Markdown
Contributor

sarsko commented Sep 17, 2025

Sorry for doing an oopsie on your PR history.

Regarding Continuation::drop:
As far as I can tell, force_unwind with a caught panic, as in:

loop {
    let res = std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| self.coroutine.force_unwind())); 
    if res.is_ok() {
        break;
    }
}

Seems to work.

A few notes:

  1. In my limited testing the first call is always a panic (becoming an Err) and then the second call is an Ok. We should probably limit the number of calls to 5 or something, and just error log and force_reset if it fails the 5th time.
  2. On an Ok we should be able to reuse the continuation, meaning that PooledContinuation should be such that we call the "try_reset" functionality, then if that succeeds we reuse the continuation.
  3. We should info or debug trace before and after doing continuation cleanups (ie we should instrument ExecutionState::cleanup)
  4. We need to take the panic hook before doing this, if not it will be invoked repeatedly.
  5. We don't wanna do cleanup if we're panicking. Just leak memory and get outta there (ie force_reset)

@sarsko
Copy link
Copy Markdown
Contributor

sarsko commented Sep 17, 2025

Okay nevermind on the loop, as per Amanieu/corosensei#57 only the first call matters

@dylanjwolff dylanjwolff force-pushed the corosensei-leak branch 2 times, most recently from a0e2074 to 40cc2a5 Compare October 13, 2025 19:47
Comment thread shuttle/src/runtime/thread/continuation.rs Outdated
Comment thread shuttle/src/runtime/thread/continuation.rs Outdated
Comment thread shuttle/src/runtime/thread/continuation.rs Outdated
Comment thread shuttle/src/runtime/thread/continuation.rs Outdated
Comment thread shuttle/src/runtime/thread/continuation.rs Outdated
Ready, // has a suspended function in its cell; waiting for input about what to do next
Running, // currently inside a user-provided function
FinishedIteration, // has finished the previous function, can be initialized with a new one
Exited, // the internal coroutine has exited its loop and cannot receive new functions to execute
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need FinishedIteration and Exited?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FinishedIteration is similar to NotReady, except that it has an old function in its cell, instead of no function at all.

Exited indicates that the coroutine has exited the inner loop and cannot be reused or resumed.

I think the new states make the state-machine easier to understand, even if they aren't strictly necessary (obviously they aren't because we got away with only three states before).

Comment thread shuttle/src/runtime/thread/continuation.rs Outdated
Comment thread shuttle/src/runtime/thread/continuation.rs
Comment thread shuttle/src/runtime/thread/continuation.rs Outdated
Comment thread shuttle/src/runtime/thread/continuation.rs
Comment thread shuttle/src/runtime/thread/continuation.rs Outdated
Comment thread shuttle/src/lib.rs
pub fn new() -> Self {
Self {
stack_size: 0x8000,
stack_size: 0xf000,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my testing the corosensei coroutines were running out of stack space occasionally. I guess there is some slight constant memory usage overhead as compared to generators

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per the comment on the longer benchmarking time with backtraces, this is probably because corosensei produces deeper callstacks than generators

Comment thread shuttle/src/runtime/thread/continuation.rs Outdated
}

impl Drop for ContinuationPool {
fn drop(&mut self) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is is fine to remove this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "cheat" was necessary because of an internal implementation detail of the Generators library. As far as I can see corosensei does not internally use ThreadLocals in the same way. I think the "cheat" is a bit difficult to reason about (overloading a state, messing with drop behavior) and so if it is not necessary I believe it should be removed.

Comment thread shuttle/src/runtime/thread/continuation.rs
@dylanjwolff dylanjwolff marked this pull request as ready for review October 13, 2025 21:54
Comment thread shuttle/src/runtime/thread/continuation.rs
Comment thread shuttle/src/runtime/thread/continuation.rs
@sarsko
Copy link
Copy Markdown
Contributor

sarsko commented Oct 14, 2025

Benchmarks take a gigantuan amount of time. Not uncommon for them to take long, but they shouldn't time out?

@dylanjwolff
Copy link
Copy Markdown
Contributor Author

dylanjwolff commented Oct 14, 2025

Benchmarks take a gigantuan amount of time. Not uncommon for them to take long, but they shouldn't time out?

I reran the CI job last night with #213 and it finished successfully in 30m:

dylanjwolff#11

Corosensei is faster across the board.

Must be that capturing a backtrace from corosensei is more expensive (deeper stack)

@sarsko
Copy link
Copy Markdown
Contributor

sarsko commented Oct 21, 2025

Could you rebase so that benchmarks get run (I merged the backtrace fix)

@sarsko sarsko enabled auto-merge (squash) October 23, 2025 22:00
@sarsko sarsko disabled auto-merge October 23, 2025 22:00
@sarsko sarsko merged commit 8ec3c44 into awslabs:main Oct 23, 2025
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants