[AIRFLOW-1623] Trigger on_kill method in operators#2975
[AIRFLOW-1623] Trigger on_kill method in operators#2975jgao54 wants to merge 1 commit intoapache:masterfrom
Conversation
|
@jgao54 I agree but Im still not convinced this will work in a distributed setup. |
f133220 to
eb2fa58
Compare
|
@milanvdm Thanks for taking a look! I'm not 100% convinced myself either since I haven't run the code with CeleryExecutor :P but I am hopeful that this could potentially solve the issue in the distributed setup as well. Reason being: When The LocalTaskJob (which is created via So the main idea being, the detection of a shutdown is via the db, so it should be able to successfully terminate the running process. Would be great to have someone running CeleryExecutor or have experience with Celery to test this out. I'll see what I can do on my end as well. |
|
@jgao54 We have a Celery setup so I will try to find some time for testing this in a distributed context :) |
|
@jgao54 I created some time next week to test this in a distributed setup :) |
|
@milanvdm awesome, thanks! |
|
I found a few spare moments across the day and this change looks good to me. An extra pair of eyes would be useful to double check as this was done in moments snatched here and there. Full log (with the log format adjusted to show PID just after the time stamp)The interesting lines are these It shows that pid 1898 is the |
|
Please rebase against master to trigger a new build. The CI should be fixed. |
db6d916 to
f24b18a
Compare
|
@jgao54 I tested and it works! :) |
|
Looking forward to your contribution to the SparkSubmitOperator @milanvdm |
|
Hi @Fokko! I rebased a few times but travis still isn't very happy in some env. Are you familiar with the issue? |
|
Hi @jgao54 This looks like failing tests, Python 2.7 Postgres: The same for Python 3: I think you need some fix some tests ;) |
airflow/utils/helpers.py
Outdated
There was a problem hiding this comment.
What are the benefits of using the shell to do this? I only see downsides
There was a problem hiding this comment.
@bolkedebruin the alternative could be along the lines of what @milanvdm proposed in #2877. I think that's a more flexible/safe solution, assuming we remove this shell-based code all together. My only concern is having to do db migration on TaskInstance, which could be difficult for long running DBs.
There was a problem hiding this comment.
The other alternative to be os.kill(), or given we already have psutil.Process objects descendant.terminate() and then descendant.kill()
See also http://psutil.readthedocs.io/en/latest/#terminate-my-children
There was a problem hiding this comment.
@ashb ah gotcha. I assume that's what @bolkedebruin meant too. Sure let me take a look.
|
@jgao54 Any updates on this? |
|
Haven't got a chance to revisit this yet, will do sometimes this week for sure |
|
@Fokko one unit test is failing in some environment, it doesn't look related, and I can't replicate it locally. |
|
So, please rebase onto master. The Python 3 failures are unrelated, the Python 2.7 feels related to me: |
|
Let me check if I can figure out what's happening |
|
So I'm able to replicate it on my local machine: |
|
So I suspect that somewhere in the code, the TaskInstance is upserted with the failed state. |
|
This makes sense: db816ee The test is from two years back: 7c0f837 And the ticket: https://issues.apache.org/jira/projects/AIRFLOW/issues/AIRFLOW-234 Personally I have the strong suspicon that the: Query returns more than one result, and therefore fetches a TI with state |
|
When I run the test isolated on Python 3 locally, I get the same error as with Python 2.7 |
|
Cuuurious. I wonder if this might be related to some of the sporadic reports we get of "loosing track of tasks"? |
|
If the task gets "loose", i.e. being delete from the database, the task will be killed. |
|
I can't think of a situation where the the |
|
Will rebase, just in case this is no longer an issue. |
|
@Fokko Thanks for looking into this! I traced back on the warning message you provided, and definitely agree that the failure is caused by my PR. In the past we only send SIGTERM to the child processes of Also note that the query you mentioned above should not return more than 1 row because the primary key of the TaskInstance is task_id + dag_id + execution_date, so the db guarantees its uniqueness. |
I think moving the setting state to failure to the “monitor” process actually sounds like a plus as there are other cases (such as set fault/OOM kill) Not likely, but... yay? I haven’t looked at the code to see if that makes sense though |
airflow/utils/helpers.py
Outdated
There was a problem hiding this comment.
Please add an else with an exception if they don't match
|
I'm still a bit confused how this test works out. Since we delete the record, I don't suspect that it gets upserted again with the failed state. I'm missing something since the test seems to work before. |
| @@ -183,87 +183,65 @@ def f(t): | |||
| return s | |||
|
|
|||
|
|
|||
There was a problem hiding this comment.
Please add :)
"""
Kills a list of processes with a given signal, with a configurable timout in seconds
Returns a tuple of (gone, alive) lists of processes based on their state
:param logger: logger
:type logger: logging.Logger
:param processes: process id of the root process
:type processes: list of int
:param sig: The signal to send, either SIGTERM or SIGKILL (Default: SIGKILL)
:type sig: signal
:param timeout: time (seconds) to wait on a process to terminate before
attempting to SIGKILL
:type timeout: int
"""
| [x for x in root_process.children(recursive=True) if x.is_running()] | ||
|
|
||
| if len(descendant_processes) != 0: | ||
| if len(running_descendants) != 0: |
There was a problem hiding this comment.
Can we make this a > instead of !=?
|
So what happens.
In the case of the test there are no descendants: After some tests I noticed that the exit handler is not invoked: Maybe this is because we've changed the way of killing the processes. |
|
I have a strong suspicion that the test is not working properly, when I run the individual test on master, I get the same result: I would suggest to alter the test. Maybe @bolkedebruin or @mistercrunch has some input on this? |
|
@bolkedebruin @mistercrunch Kind reminder :) |
1 similar comment
|
@bolkedebruin @mistercrunch Kind reminder :) |
That can't happen by definition of @jgao54 The signal handler of a |
| logger.debug("There are no descendant processes to kill") | ||
| logger.debug("There are no descendant processes to terminate.") | ||
|
|
||
| logger.info("Terminating root process PID: {}.".format(root_process.pid)) |
There was a problem hiding this comment.
@jgao54 I don't understand this part. The root_process is the monitoring --local task which is the parent of the --raw task and its underlying processes. So it is trying to SIGTERM itself here and SIGKILL eventually. That doesn't make sense.
If --local finds the need to terminate its children, it should exit itself gracefully maybe with a non 0 exitcode.
| sig=signal.SIGTERM, timeout=timeout) | ||
|
|
||
| if running_root or running_descendants: | ||
| kill_processes(logger, running_root + running_descendants, sig=signal.SIGKILL) |
|
@jgao54 maybe close this one for now, since @bolkedebruin pushed a fix in #3204 |
Make sure you have checked all steps below.
JIRA
Description
Fixing a previous TODO item to also kill the root process whenever the process tree is being killed. By sending a SIGTERM signal to the root process (
airflow run --raw ...), the signal handler will catch the signal and complete necessary cleanups (i.e. operator on_kill).Tests
Commits
My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
Passes
git diff upstream/master -u -- "*.py" | flake8 --diff@milanvdmria @ashb
I think this approach is simpler (that @ashb suggested in #2877). I've tested it with LocalExecutor and it behaves as expected, so it should at least fix some issues folks have been experiencing.