Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(sequencer-relayer): avoid hanging while waiting for submitter task to return #1206

Merged
merged 1 commit into from
Jun 27, 2024

Conversation

Fraser999
Copy link
Contributor

Summary

This fixes a bug whereby the relayer hangs during shutdown due to the submitter task not being instructed to shut down.

Background

Bugfix.

Changes

  • Rather than cloning a single cancellation token, we now have the following hierarchy:
    • SequencerRelayer::new constructs a ShutdownHandle which wraps the top-level cancellation token. This handle is held as a member variable of SequencerRelayer and is used to cancel all relayer child tasks if the API task exits. A clone of the ShutdownHandle is also held in main and is used to cancel all tasks if a SIGTERM is received.
    • A child of ShutdownHandle's token is provided to the relayer task, named relayer_shutdown_token. It is never actively cancelled - only ever via the parent token.
    • A child of relayer_shutdown_token is provided to the submitter task and also held as a member of Relayer. It is actively cancelled after the relayer's main select! loop has exited if the submitter task is still running.

Testing

Manual test only. I could see no way to artificially cause the relayer's select! loop to exit. We could potentially try dropping in some #[cfg(test)] gated code in the loop to force this, but that would also mean we couldn't use the blackbox tests for it.

Related Issues

Closes #1095.

@Fraser999 Fraser999 requested a review from a team as a code owner June 25, 2024 18:00
@github-actions github-actions bot added the sequencer-relayer pertaining to the astria-sequencer-relayer crate label Jun 25, 2024
info!("received shutdown signal");
break Ok("shutdown signal received");
}

_ = &mut submitter_task => break Err(eyre!("Celestia submission task returned")),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unfortunate side-effect of changing to use child cancellation tokens - the submitter task (child) is no longer able to cancel the relayer task (parent). Since the submitter task can exit with error during init, we need to poll for that here or else we hang in this select! loop.

This also means we need to fuse the submitter task and outside the loop we need to check whether it's already terminated or not before awaiting it, or else we'd hang there.

There are other potential approaches to address that, but I couldn't see anything cleaner than this (except for my still-preferred approach of a single cancellation token for all tasks rather than this hierarchy).

@@ -607,6 +608,7 @@ impl TestSequencerRelayer {
.and_then(|value: serde_json::Value| serde_json::from_value::<ZPage>(value).ok())
.map_or("unknown".to_string(), |zpage| zpage.status);

error!("timed out; context: `{context}`, state: `{state}`, healthz: `{healthz}`");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helps diagnose the point at which the test fails, since we can get several further log lines after this point before the panic is printed.

@Fraser999 Fraser999 added this pull request to the merge queue Jun 27, 2024
Merged via the queue into main with commit 2daebe5 Jun 27, 2024
38 checks passed
@Fraser999 Fraser999 deleted the fraser/exit-relayer-if-reader-loop-ends branch June 27, 2024 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sequencer-relayer pertaining to the astria-sequencer-relayer crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

(sequencer-relayer): exit process if reader task ends
2 participants