observability: add structured diagnostics to CeleryExecutor sync gap …#67086
Open
prince8273 wants to merge 3 commits into
Open
observability: add structured diagnostics to CeleryExecutor sync gap …#67086prince8273 wants to merge 3 commits into
prince8273 wants to merge 3 commits into
Conversation
…events CeleryExecutor.update_all_workload_states() and update_task_state() had two silent failure modes with no structured logging or metrics: - When a tracked Celery task returns a falsy/None state from the broker (reachable on AMQP and non-KV backends where BulkStateFetcher does not guarantee PENDING conversion), the task was silently skipped with no log line and no metric. - When update_task_state() hit the unexpected-state branch, the existing log.info emitted only the TaskInstanceKey and the raw state string, with no worker hostname, no Celery info payload, and no metric. Changes (additive only, no behavior change): - Add else-branch in update_all_workload_states() loop: warning log with celery_task_id + key, Stats.incr(celery.task_not_found_in_broker) - Replace single-line log.info in update_task_state() unexpected-state branch: warning log with key + celery_state + worker hostname (extracted from info dict when available) + raw info payload, Stats.incr(celery.task_unexpected_state) - Level bump log.info -> log.warning for unexpected-state: this branch represents a state the executor has no handler for No new imports required -- Stats already present via airflow.providers.common.compat.sdk. Signed-off-by: Prince Kumar <princesingh29757@gmail.com>
- celery.task_not_found_in_broker - celery.task_unexpected_state Signed-off-by: Prince Kumar <princesingh29757@gmail.com>
Signed-off-by: Prince Kumar <princesingh29757@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two silent failure modes in
CeleryExecutorwith no structured loggingor metrics at the executor layer.
Gap 1 — task absent from broker state
When
BulkStateFetcher.get_many()returns a falsy/Nonestate, theexecutor silently skips the task inside the
if state:guard. Reachableon AMQP and non-KV backends when a worker dies mid-task.
Gap 2 — unexpected state with no context
The unexpected-state branch emitted a single
log.infowith onlyTaskInstanceKeyand raw state — no worker hostname, no Celeryinfopayload, no metric.
Changes
Additive only — no behavior change, no new imports.
update_all_workload_states():else-branch on falsy state —log.warningwithcelery_task_id+key,Stats.incr("celery.task_not_found_in_broker")update_task_state(): replacelog.infoin unexpected-state branch —log.warningwithkey,celery_state, worker hostname, rawinfo,Stats.incr("celery.task_unexpected_state")Statsalready imported viaairflow.providers.common.compat.sdk.