Summary
This tracks the end-to-end work to expose task execution exception context from the worker to taskbroker logs and Sentry.
Today, when a task fails, the worker may know the exception details, but taskbroker logs often do not include enough structured context to understand the failure without digging into worker logs or reproducing the task manually.
This work adds a bounded TaskError envelope to the status update flow and surfaces structured exception context consistently across:
- worker logs
- taskbroker logs
- Sentry events
Goal
When a task fails or retries because of a caught exception, we want to be able to answer all of the following quickly:
- which task failed
- which activation failed
- which namespace it belonged to
- what the top-level exception was
- what the deepest/root cause was
- what the worker observed at execution time
- what taskbroker received and logged
Scope
This is being delivered in 3 PRs:
-
getsentry/sentry-protos
- add
TaskError
- add optional
error field to SetTaskStatusRequest
-
getsentry/taskbroker
- taskbroker server logs structured failure context from
SetTaskStatusRequest.error
- python taskbroker client exposes
error_hook
- worker captures exceptions and passes
TaskError into SetTaskStatusRequest
-
getsentry/sentry
- add sentry-side
TaskErrorCaptureHook
- wire it into taskworker runtime
- add tests
- consume the released
taskbroker-client version that includes error_hook
Desired behavior
Worker side
Worker logs should emit a single-line structured summary on task failures, suitable for ingestion by Vector/log pipelines, for example:
taskworker.task_failed task_id="..." taskname="..." namespace="..." exception_type="..." exception_message="..." root_cause_type="..." root_cause_message="..."
Worker should still send the bounded traceback in the TaskError envelope and capture the original exception to Sentry.
Broker side
Taskbroker should log structured context when it receives a failure/retry status update with an attached error envelope, for example:
task reported failure task_id=... taskname=... namespace=... status=Failure attempts=... exception_type="..." exception_message="..."
Non-goals
- no new high-cardinality metric labels
- no persistence of failure reasons in the inflight store
- no DLQ wire-format change
- no synchronous Kafka produce in the gRPC status path
- no changes to
@instrumented_task / @retry
- no raw multiline traceback in worker stdout logs
Rollout order
- Merge and release
sentry-protos
- Merge and release
taskbroker / taskbroker-client
- Update Sentry to consume the released client version
Validation
- unit tests in all repos
- local smoke test with a deliberately crashing task
- verify:
- worker emits structured
taskworker.task_failed
- broker emits structured
task reported failure
- task IDs match across worker/broker
- Sentry event is captured with task tags
Tracking
Summary
This tracks the end-to-end work to expose task execution exception context from the worker to taskbroker logs and Sentry.
Today, when a task fails, the worker may know the exception details, but taskbroker logs often do not include enough structured context to understand the failure without digging into worker logs or reproducing the task manually.
This work adds a bounded
TaskErrorenvelope to the status update flow and surfaces structured exception context consistently across:Goal
When a task fails or retries because of a caught exception, we want to be able to answer all of the following quickly:
Scope
This is being delivered in 3 PRs:
getsentry/sentry-protosTaskErrorerrorfield toSetTaskStatusRequestgetsentry/taskbrokerSetTaskStatusRequest.errorerror_hookTaskErrorintoSetTaskStatusRequestgetsentry/sentryTaskErrorCaptureHooktaskbroker-clientversion that includeserror_hookDesired behavior
Worker side
Worker logs should emit a single-line structured summary on task failures, suitable for ingestion by Vector/log pipelines, for example:
taskworker.task_failed task_id="..." taskname="..." namespace="..." exception_type="..." exception_message="..." root_cause_type="..." root_cause_message="..."Worker should still send the bounded traceback in the
TaskErrorenvelope and capture the original exception to Sentry.Broker side
Taskbroker should log structured context when it receives a failure/retry status update with an attached error envelope, for example:
task reported failure task_id=... taskname=... namespace=... status=Failure attempts=... exception_type="..." exception_message="..."Non-goals
@instrumented_task/@retryRollout order
sentry-protostaskbroker/taskbroker-clientValidation
taskworker.task_failedtask reported failureTracking
taskbroker-clientwitherror_hooksupport