Skip to content

Expose taskworker exception context end-to-end in taskbroker and Sentry #114010

@s-starostin

Description

@s-starostin

Summary

This tracks the end-to-end work to expose task execution exception context from the worker to taskbroker logs and Sentry.

Today, when a task fails, the worker may know the exception details, but taskbroker logs often do not include enough structured context to understand the failure without digging into worker logs or reproducing the task manually.

This work adds a bounded TaskError envelope to the status update flow and surfaces structured exception context consistently across:

  • worker logs
  • taskbroker logs
  • Sentry events

Goal

When a task fails or retries because of a caught exception, we want to be able to answer all of the following quickly:

  • which task failed
  • which activation failed
  • which namespace it belonged to
  • what the top-level exception was
  • what the deepest/root cause was
  • what the worker observed at execution time
  • what taskbroker received and logged

Scope

This is being delivered in 3 PRs:

  1. getsentry/sentry-protos

    • add TaskError
    • add optional error field to SetTaskStatusRequest
  2. getsentry/taskbroker

    • taskbroker server logs structured failure context from SetTaskStatusRequest.error
    • python taskbroker client exposes error_hook
    • worker captures exceptions and passes TaskError into SetTaskStatusRequest
  3. getsentry/sentry

    • add sentry-side TaskErrorCaptureHook
    • wire it into taskworker runtime
    • add tests
    • consume the released taskbroker-client version that includes error_hook

Desired behavior

Worker side

Worker logs should emit a single-line structured summary on task failures, suitable for ingestion by Vector/log pipelines, for example:

taskworker.task_failed task_id="..." taskname="..." namespace="..." exception_type="..." exception_message="..." root_cause_type="..." root_cause_message="..."

Worker should still send the bounded traceback in the TaskError envelope and capture the original exception to Sentry.

Broker side

Taskbroker should log structured context when it receives a failure/retry status update with an attached error envelope, for example:

task reported failure task_id=... taskname=... namespace=... status=Failure attempts=... exception_type="..." exception_message="..."

Non-goals

  • no new high-cardinality metric labels
  • no persistence of failure reasons in the inflight store
  • no DLQ wire-format change
  • no synchronous Kafka produce in the gRPC status path
  • no changes to @instrumented_task / @retry
  • no raw multiline traceback in worker stdout logs

Rollout order

  1. Merge and release sentry-protos
  2. Merge and release taskbroker / taskbroker-client
  3. Update Sentry to consume the released client version

Validation

  • unit tests in all repos
  • local smoke test with a deliberately crashing task
  • verify:
    • worker emits structured taskworker.task_failed
    • broker emits structured task reported failure
    • task IDs match across worker/broker
    • Sentry event is captured with task tags

Tracking

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels
    No fields configured for issues without a type.

    Projects

    Status

    Waiting for: Support

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions