Skip to content

Conversation

@giortzisg
Copy link
Collaborator

@giortzisg giortzisg commented Nov 3, 2025

DESCRIBE YOUR PR

Add the docs for the backend telemetry buffer

IS YOUR CHANGE URGENT?

Help us prioritize incoming PRs by letting us know when the change needs to go live.

  • Urgent deadline (GA date, etc.):
  • Other deadline:
  • None: Not urgent, can wait up to 1 week+

@giortzisg giortzisg self-assigned this Nov 3, 2025
@vercel
Copy link

vercel bot commented Nov 3, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
develop-docs Ready Ready Preview Comment Nov 25, 2025 11:05am
1 Skipped Deployment
Project Deployment Preview Comments Updated (UTC)
sentry-docs Ignored Ignored Preview Nov 25, 2025 11:05am

@giortzisg giortzisg force-pushed the feat/telemetry-buffers branch from f84179f to e89c19b Compare November 7, 2025 11:19
@giortzisg giortzisg changed the title add initial buffer docs feat(develop-docs): add Backend Telemetry Buffer Nov 7, 2025
@giortzisg giortzisg marked this pull request as ready for review November 7, 2025 11:21
@giortzisg giortzisg force-pushed the feat/telemetry-buffers branch from 799c68b to d677af4 Compare November 7, 2025 11:49
- Flush(timeout).
- Close(timeout).

```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

Text is being cut off in this diagram

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed back from the mermaid chart to just an ascii chart when I saw the cut off, so should be ok now.

Copy link
Contributor

@sfanahata sfanahata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the diagram that is a little off. Everything else looks great. 🎸

#### How the Buffer works

- **Smart batching**: Logs are batched into single requests; errors, transactions, and monitors are sent immediately.
- **Pre-send rate limiting**: The scheduler checks rate limits before dispatching, avoiding unnecessary requests while keeping items buffered.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should clarify what "while keeping items buffered" means. Today, we drop data if are rate-limited, is the plan to keep them in the buffer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should drop, to avoid filling up the buffer.

### Configuration

#### Buffer Options
- **Capacity**: 100 items for errors and check-ins, 10*BATCH_SIZE for logs, 1000 for transactions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have multiple buffers per telemetry type or just one that changes its capacity based on which telemetry is configured to be emitted by the SDK?


```go
func (s *Scheduler) flush() {
// should process all store buffers and send to transport
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// should process all store buffers and send to transport
// should process all store buffers and send to transport

### Priorities
- CRITICAL: Error, Feedback.
- HIGH: Session, CheckIn.
- MEDIUM: Transaction, ClientReport, Span.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is already set in stone, but I feel that logs should be at least as important as spans, thinking about it from a debuggability point of view.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since logs are batched I don't think it makes lots of sense to prioritize them with the same importance. Also the main point of the design was to avoid starvation of other types by the overabundance of logs.

- **Bounded capacity**: Default to 100 items for errors, logs, and monitors; 1000 for transactions. This prevents unbounded memory growth regardless of telemetry volume and backpressure handling.
- **Overflow policies**:
- `drop_oldest` (default): Evicts the oldest item when the buffer is full, making room for new data.
- `drop_newest`: Rejects incoming items when full, preserving what's already queued.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this can be made optional as I don't foresee a need to use this one.

- `batchSize`: Number of items to combine into a single batch (1 for errors, transactions, and monitors; 100 for logs).
- `timeout`: Maximum time to wait before sending a partial batch (5 seconds for logs).
- **Bucketed Storage Support**: The storage interface should satisfy both bucketed and single-item implementations, allowing sending spans per trace id (required for Span First).
- **Observability**: Each store tracks dropped item counts for client reports.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it just track them and then some time later create the client report or does it generate the client report immediately on each dropped envelope/item?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is a nice catch. The correct way to do this would be to add to the ClientReports Buffer and then send the reports when the scheduler selects the priority. I am planning to update both this and the client reports page when we implement Client Reports in Go, so will remove the line for now.

@giortzisg giortzisg force-pushed the feat/telemetry-buffers branch from 02c36ee to 2cad4c9 Compare November 25, 2025 10:34
func (b *Buffer) Flush(timeout time.Duration) {
scheduler.flush()
transport.flush(timeout)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Flush timeout not respected by scheduler

The Buffer.Flush() method accepts a timeout parameter but only passes it to transport.flush(), not to scheduler.flush(). If the scheduler takes significant time processing all buffers and building envelopes, the transport may not have sufficient time remaining to complete within the specified timeout, causing the overall flush operation to exceed the intended duration.

Fix in Cursor Fix in Web


s.mu.Unlock()
s.processNextBatch()
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Scheduler never exits on context cancellation

The scheduler's run() method has an infinite outer loop that never checks if the context is cancelled. While the inner loop at line 267 checks s.ctx.Err() == nil, when the context is cancelled, the inner loop exits but the outer loop continues indefinitely. This prevents the scheduler from shutting down properly, causing goroutine leaks and blocking graceful shutdown. The outer loop needs a break condition or context check to exit when s.ctx.Err() != nil.

Fix in Cursor Fix in Web

@giortzisg giortzisg merged commit 14a50f9 into master Nov 25, 2025
14 checks passed
@giortzisg giortzisg deleted the feat/telemetry-buffers branch November 25, 2025 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants