Skip to content

feat(upload): Retry objectstore requests#5836

Merged
jjbayer merged 22 commits intomasterfrom
feat/retriable-stream
Apr 16, 2026
Merged

feat(upload): Retry objectstore requests#5836
jjbayer merged 22 commits intomasterfrom
feat/retriable-stream

Conversation

@jjbayer
Copy link
Copy Markdown
Member

@jjbayer jjbayer commented Apr 14, 2026

If necessary, retry requests to objectstore.

  • For now, a simple loop with one second between attempts. Later on we might implement exponential backoff or similar.
  • Only 429, 502, 503, 504 responses, connection errors, and timeout errors are retried.
  • For streams, add a new wrapper that makes them retriable until they are polled the first time.

fixes: INGEST-826

@jjbayer jjbayer marked this pull request as ready for review April 14, 2026 15:28
@jjbayer jjbayer requested a review from a team as a code owner April 14, 2026 15:28
@linear-code
Copy link
Copy Markdown

linear-code Bot commented Apr 14, 2026

Copy link
Copy Markdown
Member

@Dav1dde Dav1dde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in an ideal world you only need this:

  • A wrapper which keeps track of whether a stream was polled or not. This is just another Stream combinator.
  • The wrapper allows recovery if it hasn't been polled yet.
  • You pass a &mut wrapper_stream to the client to transfer it.
  • The request fails, you attempt to recover from the wrapper via wrapper_stream.reocver() -> Option<S>
  • If you get None, the connection failed, if you get Some you can attempt to retry

There is just a small problem where the objectstore client requires BoxStream<'static, ..>, which I assume can't be changed because of reqwest requirements (?).

But you can can keep a similar API and functionality, you now need to employ interior mutability. Which leads to something similar what you have with the API surface. But instead of doing the Drop/oneshot dance I'd look into using a Mutex<Option<T>> (or fancier atomic based variant) and take()'ing the item on poll instead of trying to recover on Drop.

This avoids the oneshot and drop dance, which actually depends on the client dropping the stream where there is no guarantee it actually has to since you're giving full ownership.

Comment thread relay-server/src/services/objectstore.rs Outdated
@jjbayer
Copy link
Copy Markdown
Member Author

jjbayer commented Apr 15, 2026

But instead of doing the Drop/oneshot dance I'd look into using a Mutex<Option<T>> (or fancier atomic based variant) and take()'ing the item on poll instead of trying to recover on Drop.

My first iteration had an Arc<Mutex<Option<T>>>. I thought the channel implementation was more elegant, but you are right that the API is awkward. E.g. I would never wait on the channel Receiver, always call try_recv instead. I'll give this another go and will let you know how it goes.

@jjbayer jjbayer marked this pull request as draft April 15, 2026 08:20
@Dav1dde
Copy link
Copy Markdown
Member

Dav1dde commented Apr 15, 2026

My first iteration had an Arc<Mutex<Option>>. [...]

It's a bit awkward to deal with directly, but I think with the right amount of wrappers (maybe some kind of TakeOnce(Mutex<Option<T>>)) it could turn out quite nice.

}
Objectstore::TraceAttachment(attachment) => {
self.handle_trace_attachment(attachment).await
let _ = self.handle_trace_attachment(attachment).await;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another case where the function ideally wouldn't even return anything. I didn't want to touch that function in this PR though.

Comment on lines -336 to -343
relay_statsd::metric!(
counter(RelayCounters::AttachmentUpload) += 1,
result = match &result {
Ok(_) => "success",
Err(e) => e.as_str(),
},
type = "envelope",
);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric is now logged further down the call stack, so it disappears in a few places.

Comment on lines +38 to +47
pub enum RetriableStream<S> {
/// The stream has not been polled.
/// Other owners of `S` can recover it by calling [`TakeOnce::take`].
New(TakeOnce<S>),
/// The stream has been polled at least once and is no longer retriable.
///
/// This state is an optimization such that the stream does not have to lock a mutex
/// on every poll.
Used(S),
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I added a new type, rather than implementing Stream for TakeOnce<S: Stream>, is that there is no point in having a mutex after the first poll -> the internal state of the stream transitions to a simple wrapper.

Comment thread tests/integration/test_upload.py Outdated
@jjbayer jjbayer changed the title feat(upload): Retriable streams feat(upload): Retry objectstore requests Apr 15, 2026
@jjbayer jjbayer marked this pull request as ready for review April 15, 2026 13:27
@jjbayer jjbayer requested a review from Dav1dde April 15, 2026 13:27
Comment thread relay-server/src/services/objectstore.rs Outdated
}

/// Takes the item, making it inaccessible for other owners.
pub fn take(&mut self) -> Option<T> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the signature here not include the mut?:

Suggested change
pub fn take(&mut self) -> Option<T> {
pub fn take(&self) -> Option<T> {

retention_hours: Option<u16>,
) -> Result<ObjectstoreKey, Error> {
tokio::time::timeout(self.timeout, async {
let mut result = Err(Error::Config("zero retries configured"));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NonZero* type in the config and make that state impossible?


match &result {
Err(e) if e.is_retriable() => {
tokio::time::sleep(self.retry_interval).await;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a RetryBackoff type, maybe worth also using here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 , will consider it in case the simple loop doesn't work.

Comment thread relay-server/src/services/objectstore.rs
Comment thread relay-server/src/services/objectstore.rs
Comment thread relay-server/src/services/objectstore.rs Outdated
Comment thread relay-server/src/services/objectstore.rs
Comment thread relay-server/src/services/objectstore.rs Outdated
| StatusCode::TOO_MANY_REQUESTS
| StatusCode::SERVICE_UNAVAILABLE
)
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retries 503 but PR specifies only 429, 502, 504

Low Severity

is_retriable includes StatusCode::SERVICE_UNAVAILABLE (503) in the set of retriable status codes, but the PR description explicitly states "Only 429, 502, 504 responses and connection errors are retried." While 503 is arguably a reasonable retriable status, it wasn't part of the documented intent and could lead to unexpected retry behavior for service-unavailable responses.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4317a26. Configure here.

Comment thread relay-server/src/services/objectstore.rs Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 5259894. Configure here.

Comment thread relay-server/src/services/objectstore.rs
Comment thread relay-server/src/services/objectstore.rs
@jjbayer jjbayer enabled auto-merge April 16, 2026 09:12
Comment on lines +413 to +414
counter(RelayCounters::AttachmentUpload) += 1,
result = e.as_str(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: In handle_trace_attachment, a session creation failure causes an early return, preventing the AttachmentUpload metric from being emitted and silently dropping the error.
Severity: MEDIUM

Suggested Fix

Modify the error handling in handle_trace_attachment to ensure the AttachmentUpload metric is emitted when session creation fails. This can be done by explicitly handling the session creation error and emitting the metric before returning, similar to how it's handled in handle_envelope and handle_event_attachment.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: relay-server/src/services/objectstore.rs#L413-L414

Potential issue: When creating a session for a trace attachment fails within
`do_handle_trace_attachment`, the error is wrapped as `Error::UploadFailed` and the
function returns early. This prevents the `upload()` function from being called, which
is where the `AttachmentUpload` metric is supposed to be emitted. Consequently, session
creation failures for trace attachments are not reported, creating an observability gap.
This behavior is inconsistent with other handlers like `handle_envelope` and
`handle_event_attachment`, which correctly emit metrics in this scenario.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will follow-up.

@jjbayer jjbayer added this pull request to the merge queue Apr 16, 2026
Merged via the queue into master with commit 216f360 Apr 16, 2026
30 checks passed
@jjbayer jjbayer deleted the feat/retriable-stream branch April 16, 2026 09:46
constantinius pushed a commit that referenced this pull request Apr 23, 2026
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants