Cascade Delete between Trigger and TaskInstance #32403
Replies: 1 comment 2 replies
-
I am not sure, but I think this is likely connected to a few issues that we recently encountered when it comes to some triggers running and yielding events even when the tasks for them are finished, because they are running in endless loops or because are yielding events faster than they can be consumed, leading them to building up the queue of events to handle (needlessly). Example related issues:
I am not sure if I am interpreting it correctly - but maybe others who were involved in solving it and implementing deferrables can help to diagnose this. It might be simply a badly written trigger, or maybe something that we could prevent from happening in the Trigger event loop. I do not know all the details, but that feels like a bigger part of our deferrable approach that we should fix. Calling for those who are much more knowledgeable about it than I am : @andrewgodwin @hussein-awala @uranusjr @pankajastro @syedahsn @vandonr-amz |
Beta Was this translation helpful? Give feedback.
-
Hi everybody,
we encountered a very strange behavior with our airflow instance that is related to deferred tasks and their related trigger.
What happened
We use deferred HTTP Operations to query some api within our dags. As this http calls might take some time to be executed, the way to go was deferred Tasks including the corresponding Trigger that gets back in case of Success or Failure.
As all the called apis are within our own control we have a good overview of what is actually happening.
At some day we realized that for whatever reason, the same api endpoints (two) were called ~ 1 req/sec (sometimes even 5 req/sec).
Looking at the airflow webgui the corresponding DAG Run, and it's TaskInstances have been already finished with success. Meaning from this perspective there is no reason for the Airflow Triggerer to subsequent calling the api (neither with this frequency or at all).
It took some time to figure out what the reason was for this misbehave.
The Reason
As soon as we found out which dag and task has to be responsible for this api calls, we tried to find corresponding log lines within our triggerer Pod.
We have been looking for some line like
Which we found indeed, very often within our triggerer logs. We wondered that even if most of them have been reported success (except for them that had to wait due to internal api overload).
So we took a look into the database, and what I've found was a bit curious to me:
Looking into the task-instance table that is related to trigger table, there was no entries with the corresponding trigger id set.
Temporary resolve
After we deleted this two rows from our trigger table, the api calls immediately stopped. Resulting in less load on the api and the trigger (which is for me a clear indication that I triggered (haha) the right action to solve.
Unclear to me
Improvement proposals
Long story, but I hope that one can understand my problem. I opened a discussion only because, I cannot reproduce this behavior at the moment.
What do you folks think about this? Am I miss interpreting airflow behavior or db relations? Have you ever encountered something similar?
BR Daniel (and @mkBGD - who was also very helpful on investigation)
Beta Was this translation helpful? Give feedback.
All reactions