feat(upload): Retry objectstore requests by jjbayer · Pull Request #5836 · getsentry/relay

jjbayer · 2026-04-14T14:39:43Z

If necessary, retry requests to objectstore.

For now, a simple loop with one second between attempts. Later on we might implement exponential backoff or similar.
Only 429, 502, 503, 504 responses, connection errors, and timeout errors are retried.
For streams, add a new wrapper that makes them retriable until they are polled the first time.

fixes: INGEST-826

linear-code · 2026-04-14T15:29:26Z

INGEST-826 Implement retries for objectstore requests

Dav1dde

I think in an ideal world you only need this:

A wrapper which keeps track of whether a stream was polled or not. This is just another Stream combinator.
The wrapper allows recovery if it hasn't been polled yet.
You pass a &mut wrapper_stream to the client to transfer it.
The request fails, you attempt to recover from the wrapper via wrapper_stream.reocver() -> Option<S>
If you get None, the connection failed, if you get Some you can attempt to retry

There is just a small problem where the objectstore client requires BoxStream<'static, ..>, which I assume can't be changed because of reqwest requirements (?).

But you can can keep a similar API and functionality, you now need to employ interior mutability. Which leads to something similar what you have with the API surface. But instead of doing the Drop/oneshot dance I'd look into using a Mutex<Option<T>> (or fancier atomic based variant) and take()'ing the item on poll instead of trying to recover on Drop.

This avoids the oneshot and drop dance, which actually depends on the client dropping the stream where there is no guarantee it actually has to since you're giving full ownership.

jjbayer · 2026-04-15T08:20:20Z

But instead of doing the Drop/oneshot dance I'd look into using a Mutex<Option<T>> (or fancier atomic based variant) and take()'ing the item on poll instead of trying to recover on Drop.

My first iteration had an Arc<Mutex<Option<T>>>. I thought the channel implementation was more elegant, but you are right that the API is awkward. E.g. I would never wait on the channel Receiver, always call try_recv instead. I'll give this another go and will let you know how it goes.

Dav1dde · 2026-04-15T08:33:56Z

My first iteration had an Arc<Mutex<Option>>. [...]

It's a bit awkward to deal with directly, but I think with the right amount of wrappers (maybe some kind of TakeOnce(Mutex<Option<T>>)) it could turn out quite nice.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jjbayer · 2026-04-15T13:15:06Z

            }
            Objectstore::TraceAttachment(attachment) => {
-                self.handle_trace_attachment(attachment).await
+                let _ = self.handle_trace_attachment(attachment).await;


This is another case where the function ideally wouldn't even return anything. I didn't want to touch that function in this PR though.

jjbayer · 2026-04-15T13:15:45Z

-                    relay_statsd::metric!(
-                        counter(RelayCounters::AttachmentUpload) += 1,
-                        result = match &result {
-                            Ok(_) => "success",
-                            Err(e) => e.as_str(),
-                        },
-                        type = "envelope",
-                    );


This metric is now logged further down the call stack, so it disappears in a few places.

jjbayer · 2026-04-15T13:23:09Z

+pub enum RetriableStream<S> {
+    /// The stream has not been polled.
+    /// Other owners of `S` can recover it by calling [`TakeOnce::take`].
+    New(TakeOnce<S>),
+    /// The stream has been polled at least once and is no longer retriable.
+    ///
+    /// This state is an optimization such that the stream does not have to lock a mutex
+    /// on every poll.
+    Used(S),
+}


The reason why I added a new type, rather than implementing Stream for TakeOnce<S: Stream>, is that there is no point in having a mutex after the first poll -> the internal state of the stream transitions to a simple wrapper.

Dav1dde · 2026-04-15T13:29:21Z

+    }
+
+    /// Takes the item, making it inaccessible for other owners.
+    pub fn take(&mut self) -> Option<T> {


Shouldn't the signature here not include the mut?:

Suggested change

pub fn take(&mut self) -> Option<T> {

pub fn take(&self) -> Option<T> {

Dav1dde · 2026-04-15T13:36:34Z

+        retention_hours: Option<u16>,
+    ) -> Result<ObjectstoreKey, Error> {
+        tokio::time::timeout(self.timeout, async {
+            let mut result = Err(Error::Config("zero retries configured"));


NonZero* type in the config and make that state impossible?

Dav1dde · 2026-04-15T13:37:10Z

+
+                match &result {
+                    Err(e) if e.is_retriable() => {
+                        tokio::time::sleep(self.retry_interval).await;


We have a RetryBackoff type, maybe worth also using here.

👍 , will consider it in case the simple loop doesn't work.

cursor · 2026-04-15T14:37:13Z

+                            | StatusCode::TOO_MANY_REQUESTS
+                            | StatusCode::SERVICE_UNAVAILABLE
+                    )
+                );


Retries 503 but PR specifies only 429, 502, 504

Low Severity

is_retriable includes StatusCode::SERVICE_UNAVAILABLE (503) in the set of retriable status codes, but the PR description explicitly states "Only 429, 502, 504 responses and connection errors are retried." While 503 is arguably a reasonable retriable status, it wasn't part of the documented intent and could lead to unexpected retry behavior for service-unavailable responses.

^{Reviewed by Cursor Bugbot for commit 4317a26. Configure here.}

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 5259894. Configure here.}

sentry · 2026-04-16T09:15:07Z

+                    counter(RelayCounters::AttachmentUpload) += 1,
+                    result = e.as_str(),


Bug: In handle_trace_attachment, a session creation failure causes an early return, preventing the AttachmentUpload metric from being emitted and silently dropping the error.
_{Severity: MEDIUM}

Suggested Fix

Modify the error handling in handle_trace_attachment to ensure the AttachmentUpload metric is emitted when session creation fails. This can be done by explicitly handling the session creation error and emitting the metric before returning, similar to how it's handled in handle_envelope and handle_event_attachment.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: relay-server/src/services/objectstore.rs#L413-L414 Potential issue: When creating a session for a trace attachment fails within `do_handle_trace_attachment`, the error is wrapped as `Error::UploadFailed` and the function returns early. This prevents the `upload()` function from being called, which is where the `AttachmentUpload` metric is supposed to be emitted. Consequently, session creation failures for trace attachments are not reported, creating an observability gap. This behavior is inconsistent with other handlers like `handle_envelope` and `handle_event_attachment`, which correctly emit metrics in this scenario.

I will follow-up.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jjbayer added 3 commits April 14, 2026 15:54

basic

9023d07

ref: use it

28d0617

file

5bd71f0

jjbayer marked this pull request as ready for review April 14, 2026 15:28

jjbayer requested a review from a team as a code owner April 14, 2026 15:28

Dav1dde reviewed Apr 15, 2026

View reviewed changes

Comment thread relay-server/src/services/objectstore.rs Outdated

jjbayer marked this pull request as draft April 15, 2026 08:20

jjbayer and others added 8 commits April 15, 2026 12:13

use & rewrite

9f02261

test: add unit tests for RetriableStream

f8e85a7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

retry condition

c552b75

test: add integration test proving upload retries on connect error

adcff88

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor

2241366

retries everywhere

2f09824

self review

905142d

self review 2

299b151

jjbayer commented Apr 15, 2026

View reviewed changes

jjbayer changed the title ~~feat(upload): Retriable streams~~ feat(upload): Retry objectstore requests Apr 15, 2026

lint

c5a9be6

jjbayer marked this pull request as ready for review April 15, 2026 13:27

jjbayer requested a review from Dav1dde April 15, 2026 13:27

sentry Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread relay-server/src/services/objectstore.rs Outdated

Dav1dde reviewed Apr 15, 2026

View reviewed changes

more review

62e2f8a

cursor Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread relay-server/src/services/objectstore.rs

Comment thread relay-server/src/services/objectstore.rs

Comment thread relay-server/src/services/objectstore.rs Outdated

fix potential panic

0d9dd1d

sentry Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread relay-server/src/services/objectstore.rs

NonZero

4317a26

cursor Bot reviewed Apr 15, 2026

View reviewed changes

Dav1dde approved these changes Apr 16, 2026

View reviewed changes

jjbayer and others added 5 commits April 16, 2026 09:45

review comments

61f339d

rename retriable -> retryable

b594bd9

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

restore missing metric

5dd5029

Merge remote-tracking branch 'origin/master' into feat/retriable-stream

a54ee49

changelog

5259894

sentry Bot reviewed Apr 16, 2026

View reviewed changes

Comment thread relay-server/src/services/objectstore.rs Outdated

cursor Bot reviewed Apr 16, 2026

View reviewed changes

Comment thread relay-server/src/services/objectstore.rs

meta: show black diff to debug CI failure

b0b4efa

sentry Bot reviewed Apr 16, 2026

View reviewed changes

Comment thread relay-server/src/services/objectstore.rs

more exceptions, I should clean this up

511ad1a

jjbayer enabled auto-merge April 16, 2026 09:12

sentry Bot reviewed Apr 16, 2026

View reviewed changes

jjbayer added this pull request to the merge queue Apr 16, 2026

Merged via the queue into master with commit 216f360 Apr 16, 2026
30 checks passed

jjbayer deleted the feat/retriable-stream branch April 16, 2026 09:46

jjbayer mentioned this pull request Apr 17, 2026

ref(upload): Clean up objectstore error handling #5851

Merged

constantinius pushed a commit that referenced this pull request Apr 23, 2026

feat(upload): Retry objectstore requests (#5836)

cb130b0

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

	pub fn take(&mut self) -> Option<T> {
	pub fn take(&self) -> Option<T> {

		counter(RelayCounters::AttachmentUpload) += 1,
		result = e.as_str(),

Conversation

jjbayer commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linear-code Bot commented Apr 14, 2026

Uh oh!

Dav1dde left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jjbayer commented Apr 15, 2026

Uh oh!

Dav1dde commented Apr 15, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot Apr 15, 2026

Choose a reason for hiding this comment

Retries 503 but PR specifies only 429, 502, 504

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sentry Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jjbayer commented Apr 14, 2026 •

edited

Loading

Dav1dde left a comment •

edited

Loading