-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
feat(develop-docs): add Backend Telemetry Buffer #15381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
1 Skipped Deployment
|
f84179f to
e89c19b
Compare
799c68b to
d677af4
Compare
| - Flush(timeout). | ||
| - Close(timeout). | ||
|
|
||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed back from the mermaid chart to just an ascii chart when I saw the cut off, so should be ok now.
sfanahata
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the diagram that is a little off. Everything else looks great. 🎸
| #### How the Buffer works | ||
|
|
||
| - **Smart batching**: Logs are batched into single requests; errors, transactions, and monitors are sent immediately. | ||
| - **Pre-send rate limiting**: The scheduler checks rate limits before dispatching, avoiding unnecessary requests while keeping items buffered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should clarify what "while keeping items buffered" means. Today, we drop data if are rate-limited, is the plan to keep them in the buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should drop, to avoid filling up the buffer.
| ### Configuration | ||
|
|
||
| #### Buffer Options | ||
| - **Capacity**: 100 items for errors and check-ins, 10*BATCH_SIZE for logs, 1000 for transactions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have multiple buffers per telemetry type or just one that changes its capacity based on which telemetry is configured to be emitted by the SDK?
|
|
||
| ```go | ||
| func (s *Scheduler) flush() { | ||
| // should process all store buffers and send to transport |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // should process all store buffers and send to transport | |
| // should process all store buffers and send to transport |
| ### Priorities | ||
| - CRITICAL: Error, Feedback. | ||
| - HIGH: Session, CheckIn. | ||
| - MEDIUM: Transaction, ClientReport, Span. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is already set in stone, but I feel that logs should be at least as important as spans, thinking about it from a debuggability point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since logs are batched I don't think it makes lots of sense to prioritize them with the same importance. Also the main point of the design was to avoid starvation of other types by the overabundance of logs.
| - **Bounded capacity**: Default to 100 items for errors, logs, and monitors; 1000 for transactions. This prevents unbounded memory growth regardless of telemetry volume and backpressure handling. | ||
| - **Overflow policies**: | ||
| - `drop_oldest` (default): Evicts the oldest item when the buffer is full, making room for new data. | ||
| - `drop_newest`: Rejects incoming items when full, preserving what's already queued. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this can be made optional as I don't foresee a need to use this one.
| - `batchSize`: Number of items to combine into a single batch (1 for errors, transactions, and monitors; 100 for logs). | ||
| - `timeout`: Maximum time to wait before sending a partial batch (5 seconds for logs). | ||
| - **Bucketed Storage Support**: The storage interface should satisfy both bucketed and single-item implementations, allowing sending spans per trace id (required for Span First). | ||
| - **Observability**: Each store tracks dropped item counts for client reports. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it just track them and then some time later create the client report or does it generate the client report immediately on each dropped envelope/item?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is a nice catch. The correct way to do this would be to add to the ClientReports Buffer and then send the reports when the scheduler selects the priority. I am planning to update both this and the client reports page when we implement Client Reports in Go, so will remove the line for now.
02c36ee to
2cad4c9
Compare
develop-docs/sdk/telemetry/telemetry-buffer/backend-telemetry-buffer.mdx
Outdated
Show resolved
Hide resolved
develop-docs/sdk/telemetry/telemetry-buffer/backend-telemetry-buffer.mdx
Outdated
Show resolved
Hide resolved
develop-docs/sdk/telemetry/telemetry-buffer/backend-telemetry-buffer.mdx
Outdated
Show resolved
Hide resolved
| func (b *Buffer) Flush(timeout time.Duration) { | ||
| scheduler.flush() | ||
| transport.flush(timeout) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Flush timeout not respected by scheduler
The Buffer.Flush() method accepts a timeout parameter but only passes it to transport.flush(), not to scheduler.flush(). If the scheduler takes significant time processing all buffers and building envelopes, the transport may not have sufficient time remaining to complete within the specified timeout, causing the overall flush operation to exceed the intended duration.
|
|
||
| s.mu.Unlock() | ||
| s.processNextBatch() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Scheduler never exits on context cancellation
The scheduler's run() method has an infinite outer loop that never checks if the context is cancelled. While the inner loop at line 267 checks s.ctx.Err() == nil, when the context is cancelled, the inner loop exits but the outer loop continues indefinitely. This prevents the scheduler from shutting down properly, causing goroutine leaks and blocking graceful shutdown. The outer loop needs a break condition or context check to exit when s.ctx.Err() != nil.

DESCRIBE YOUR PR
Add the docs for the backend telemetry buffer
IS YOUR CHANGE URGENT?
Help us prioritize incoming PRs by letting us know when the change needs to go live.