Skip to content

Comments

ref(service): Add task spawning with panic isolation to StorageService#322

Merged
jan-auer merged 11 commits intomainfrom
ref/task-spawning
Feb 20, 2026
Merged

ref(service): Add task spawning with panic isolation to StorageService#322
jan-auer merged 11 commits intomainfrom
ref/task-spawning

Conversation

@jan-auer
Copy link
Member

@jan-auer jan-auer commented Feb 20, 2026

Each StorageService operation now runs in a spawned tokio task, communicating results via a oneshot channel. This gives two guarantees:

  • run-to-completion (operations finish even if the caller is cancelled, critical for tombstone consistency)
  • panic isolation (a backend panic yields Error::Panic instead of propagating).

Business logic moved from StorageService to a private StorageServiceInner struct. StorageService is now a thin public API that takes owned values (ObjectId, Metadata), spawns each operation in a separate task, and delegates to the inner type. Existing tests target StorageServiceInner methods directly (same-module access), so they required no changes. New tests exercise the public spawn-based API including panic recovery and receiver-drop completion.

Next steps

  • Two-tier concurrency limiting: Add a high web-request limit (middleware) and a lower service-operation limit (semaphore in ServiceRunner), replacing the current InFlightRequestsLayer with purpose-specific controls.
  • Bounded task queue: Replace direct tokio::spawn with a bounded channel dispatching to a fixed worker pool, enabling backpressure, fire-and-forget operations, and priority scheduling.
  • Cancellation: Use CancellationToken to actively cancel backend operations on client disconnect, with safety invariants protecting tombstone writes from cancellation.

Ref FS-171

Public methods now take owned values and spawn each operation in an
isolated task. This ensures run-to-completion on caller cancellation
(critical for tombstone consistency) and prevents backend panics from
propagating.
@jan-auer jan-auer requested a review from a team as a code owner February 20, 2026 14:51
Integration tests (real backends) now use the public StorageService API.
Unit tests use StorageServiceInner directly.
Copy link
Member Author

@jan-auer jan-auer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am considering to split the public and inner types in two modules eventually. For now, they can remain in one.

/// isolated from panics in backend code — a failure in one operation does not
/// bring down other in-flight work. See [`Error::Panic`].
#[derive(Clone, Debug)]
pub struct StorageService(Arc<StorageServiceInner>);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This merely moved so that the struct definition and impl block aren't separated, and another section on task isolation was added.

{
let (tx, rx) = tokio::sync::oneshot::channel();
tokio::spawn(async move {
let result = std::panic::AssertUnwindSafe(f)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a panic handler in requests, too. This one allows the request handler to catch panics individually, such as in the batch endpoint.

long_term_backend: BoxedBackend,
}

impl StorageServiceInner {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All actual business logic has been moved to StorageServiceInner now. This avoid an indirection via self.0 in the implementation.

.unwrap();

let (_metadata, stream) = service.get_object(&key).await.unwrap().unwrap();
let (_metadata, stream) = service.get_object(key).await.unwrap().unwrap();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integration tests use the public API while all lower-level logic tests use the inner service.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

@jan-auer jan-auer merged commit 4f5938f into main Feb 20, 2026
20 checks passed
@jan-auer jan-auer deleted the ref/task-spawning branch February 20, 2026 18:52
jan-auer added a commit that referenced this pull request Feb 21, 2026
Builds on [#322](#322)
(task spawning with panic isolation).

Adds a semaphore-based concurrency limit to `StorageService` that caps
the total number of in-flight backend operations. When the limit is
reached, new operations are rejected immediately with HTTP 429 rather
than queueing, preventing backend overload during traffic bursts.

The limit is acquired before spawning each operation task and the permit
is held until the task completes (including after panics). Configured
via `service.max_concurrency` (default: 500, effectively unlimited
without configuration).

Includes tests for rejection at capacity and permit release after
panics.
@linear
Copy link

linear bot commented Feb 23, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants