Cascade Delete between Trigger and TaskInstance #32403

dhofstetter · 2023-07-06T11:44:19Z

dhofstetter
Jul 6, 2023

Hi everybody,
we encountered a very strange behavior with our airflow instance that is related to deferred tasks and their related trigger.

What happened

We use deferred HTTP Operations to query some api within our dags. As this http calls might take some time to be executed, the way to go was deferred Tasks including the corresponding Trigger that gets back in case of Success or Failure.
As all the called apis are within our own control we have a good overview of what is actually happening.

At some day we realized that for whatever reason, the same api endpoints (two) were called ~ 1 req/sec (sometimes even 5 req/sec).

Looking at the airflow webgui the corresponding DAG Run, and it's TaskInstances have been already finished with success. Meaning from this perspective there is no reason for the Airflow Triggerer to subsequent calling the api (neither with this frequency or at all).

It took some time to figure out what the reason was for this misbehave.

The Reason

As soon as we found out which dag and task has to be responsible for this api calls, we tried to find corresponding log lines within our triggerer Pod.
We have been looking for some line like

trigger <dag-name>/<dag-run>/<task-group>.<task-name>/.... (ID <something>) [starting|completed|exited with ....]

Which we found indeed, very often within our triggerer logs. We wondered that even if most of them have been reported success (except for them that had to wait due to internal api overload).

So we took a look into the database, and what I've found was a bit curious to me:

id	classpath	kwargs	created_date	triggerer_id
3297647	trigger class	kwargs as dict	2023-05-22 22:17:40.706 +0200	triggerer-id
3301138	trigger class	kwargs as dict	2023-05-23 03:14:53.005 +0200	triggerer-id

Looking into the task-instance table that is related to trigger table, there was no entries with the corresponding trigger id set.

Temporary resolve

After we deleted this two rows from our trigger table, the api calls immediately stopped. Resulting in less load on the api and the trigger (which is for me a clear indication that I triggered (haha) the right action to solve.

Unclear to me

How might such a situation occur? Is there any known bug or behavior that might lead to this?
Why is the relation between trigger and task instance, such that deletion of trigger leads to deleting a task instance and not vice versa? I'm asking, because if a task is deferred multiple times, wouldn't it be required, that the task remains even if one trigger is deleted? But if a task-instance is removed, would deleting corresponding triggers to release resource be the sufficient way?

Improvement proposals

Invert the delete cascade rule or include the second direction as well. From my point of view, it would be sufficient, that postgres takes care that, in case of a task instance to be deleted, the corresponding triggers should be deleted as well.
(optional) In case of triggerer process working on a trigger without having a corresponding task instance, it should either omit that trigger from execution, or simply delete it afterwards.

Long story, but I hope that one can understand my problem. I opened a discussion only because, I cannot reproduce this behavior at the moment.

What do you folks think about this? Am I miss interpreting airflow behavior or db relations? Have you ever encountered something similar?

BR Daniel (and @mkBGD - who was also very helpful on investigation)

potiuk · 2023-07-11T07:18:58Z

potiuk
Jul 11, 2023
Collaborator

I am not sure, but I think this is likely connected to a few issues that we recently encountered when it comes to some triggers running and yielding events even when the tasks for them are finished, because they are running in endless loops or because are yielding events faster than they can be consumed, leading them to building up the queue of events to handle (needlessly).

Example related issues:

I am not sure if I am interpreting it correctly - but maybe others who were involved in solving it and implementing deferrables can help to diagnose this. It might be simply a badly written trigger, or maybe something that we could prevent from happening in the Trigger event loop. I do not know all the details, but that feels like a bigger part of our deferrable approach that we should fix.

Calling for those who are much more knowledgeable about it than I am : @andrewgodwin @hussein-awala @uranusjr @pankajastro @syedahsn @vandonr-amz

2 replies

potiuk Jul 11, 2023
Collaborator

cc: @andrewgodwin -> got your handle cut in the initial post, mentioning you separately to make sure you see it

uranusjr Jul 11, 2023
Collaborator

For this question specifically

Why is the relation between trigger and task instance, such that deletion of trigger leads to deleting a task instance and not vice versa?

Technically a trigger can continue to work after a task instance is deleted, but it would end itself once it tries to call back to the scheduler but finds the corresponding task instance is gone. Since the deletion is deferred (are there too many puns in this thread?), a database level cascade won’t work, and the cleanup needs to be implemented in code via some sort of error recovery. This already have been fixed though as mentioned above, we did have a bunch of improvements on cleanup recently.

I hope to find some time to reproduce this (as in manually replicating the db state), but it would be even better if someone else’s got the time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cascade Delete between Trigger and TaskInstance #32403

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Cascade Delete between Trigger and TaskInstance #32403

dhofstetter Jul 6, 2023

What happened

The Reason

Temporary resolve

Unclear to me

Improvement proposals

Replies: 1 comment · 2 replies

potiuk Jul 11, 2023 Collaborator

potiuk Jul 11, 2023 Collaborator

uranusjr Jul 11, 2023 Collaborator

dhofstetter
Jul 6, 2023

Replies: 1 comment 2 replies

potiuk
Jul 11, 2023
Collaborator

potiuk Jul 11, 2023
Collaborator

uranusjr Jul 11, 2023
Collaborator