New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-1623] Trigger on_kill method in operators #3204

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
4 participants
@bolkedebruin
Contributor

bolkedebruin commented Apr 9, 2018

Make sure you have checked all steps below.

JIRA

  • My PR addresses the following Airflow JIRA issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"

Description

  • Here are some details about my PR, including screenshots of any UI changes:

on_kill methods were not triggered, due to processes
not being properly terminated. This was due to the fact
the runners use a shell which is then replaced by the
child pid, which is unknown to Airflow.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

new and updated tests

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":

    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"
  • Passes git diff upstream/master -u -- "*.py" | flake8 --diff

cc: @jgao54 @Fokko @milanvdm

This is an updated and cleaned up version from @jgao54 version

@codecov-io

This comment has been minimized.

Show comment
Hide comment
@codecov-io

codecov-io Apr 9, 2018

Codecov Report

Merging #3204 into master will increase coverage by 0.12%.
The diff coverage is 85.71%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3204      +/-   ##
==========================================
+ Coverage   74.23%   74.35%   +0.12%     
==========================================
  Files         193      193              
  Lines       14441    14415      -26     
==========================================
- Hits        10720    10718       -2     
+ Misses       3721     3697      -24
Impacted Files Coverage Δ
airflow/task/task_runner/base_task_runner.py 79.31% <ø> (ø) ⬆️
airflow/task/task_runner/bash_task_runner.py 100% <100%> (ø) ⬆️
airflow/utils/helpers.py 62.83% <84.21%> (+8.81%) ⬆️
airflow/jobs.py 82.89% <0%> (-0.18%) ⬇️
airflow/models.py 87.32% <0%> (-0.05%) ⬇️
airflow/utils/log/logging_mixin.py 97.46% <0%> (+2.53%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 44ce9ca...88581a7. Read the comment docs.

codecov-io commented Apr 9, 2018

Codecov Report

Merging #3204 into master will increase coverage by 0.12%.
The diff coverage is 85.71%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3204      +/-   ##
==========================================
+ Coverage   74.23%   74.35%   +0.12%     
==========================================
  Files         193      193              
  Lines       14441    14415      -26     
==========================================
- Hits        10720    10718       -2     
+ Misses       3721     3697      -24
Impacted Files Coverage Δ
airflow/task/task_runner/base_task_runner.py 79.31% <ø> (ø) ⬆️
airflow/task/task_runner/bash_task_runner.py 100% <100%> (ø) ⬆️
airflow/utils/helpers.py 62.83% <84.21%> (+8.81%) ⬆️
airflow/jobs.py 82.89% <0%> (-0.18%) ⬇️
airflow/models.py 87.32% <0%> (-0.05%) ⬇️
airflow/utils/log/logging_mixin.py 97.46% <0%> (+2.53%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 44ce9ca...88581a7. Read the comment docs.

@Fokko

This comment has been minimized.

Show comment
Hide comment
@Fokko

Fokko Apr 10, 2018

Contributor

@jgao54 @milanvdm thoughts?

Contributor

Fokko commented Apr 10, 2018

@jgao54 @milanvdm thoughts?

[AIRFLOW-1623] Trigger on_kill method in operators
on_kill methods were not triggered, due to processes
not being properly terminated. This was due to the fact
the runners use a shell which is then replaced by the
child pid, which is unknown to Airflow.
@bolkedebruin

This comment has been minimized.

Show comment
Hide comment
@bolkedebruin

bolkedebruin Apr 10, 2018

Contributor

Tests pass, error is a bit spurious (and unrelated). https://travis-ci.org/bolkedebruin/airflow/builds/364482267

Contributor

bolkedebruin commented Apr 10, 2018

Tests pass, error is a bit spurious (and unrelated). https://travis-ci.org/bolkedebruin/airflow/builds/364482267

@jgao54

This comment has been minimized.

Show comment
Hide comment
@jgao54

jgao54 Apr 10, 2018

Contributor

taking a look

Contributor

jgao54 commented Apr 10, 2018

taking a look

@jgao54

This comment has been minimized.

Show comment
Hide comment
@jgao54

jgao54 Apr 10, 2018

Contributor

@bolkedebruin, thanks for the clean up.

To address your comment on #2975, I believe root_process of the task_runner is always the --raw task. In base_task_runner.py, I see that the command to spin up the process always have raw set to True:

        self._command = popen_prepend + self._task_instance.command_as_list(
            raw=True,
            pickle_id=local_task_job.pickle_id,
            mark_success=local_task_job.mark_success,
            job_id=local_task_job.id,
            pool=local_task_job.pool,
            cfg_path=cfg_path,
        )

I agree with you that it doesn't make sense to have the root process killing itself, and your PR addresses this with:

if pid == os.getpid():
    raise RuntimeError("I refuse to kill myself")

In addition, I think we should delete the signal handler in the --raw task process. I'm not sure why it's there to begin with, but like you said, we shouldn't be letting root process attempting to kill itself.

Contributor

jgao54 commented Apr 10, 2018

@bolkedebruin, thanks for the clean up.

To address your comment on #2975, I believe root_process of the task_runner is always the --raw task. In base_task_runner.py, I see that the command to spin up the process always have raw set to True:

        self._command = popen_prepend + self._task_instance.command_as_list(
            raw=True,
            pickle_id=local_task_job.pickle_id,
            mark_success=local_task_job.mark_success,
            job_id=local_task_job.id,
            pool=local_task_job.pool,
            cfg_path=cfg_path,
        )

I agree with you that it doesn't make sense to have the root process killing itself, and your PR addresses this with:

if pid == os.getpid():
    raise RuntimeError("I refuse to kill myself")

In addition, I think we should delete the signal handler in the --raw task process. I'm not sure why it's there to begin with, but like you said, we shouldn't be letting root process attempting to kill itself.

@jgao54

jgao54 approved these changes Apr 11, 2018

@bolkedebruin

This comment has been minimized.

Show comment
Hide comment
@bolkedebruin

bolkedebruin Apr 11, 2018

Contributor

Hi @jgao54

Im not sure if deleting the signal handler from the —raw task is the right thing to do. The signal handler is there to trigger the on_kill method of the operator. If absent that would not be run anymore. bashOperator and the SparkSubmitOperator both make use of this.

On your comment of the “root” process. I don’t think you are correct and your implementation worked by coincidence. If you run “bash -c sleep 1000”, you will find ‘sleep’ in your process list but you won’t find ‘bash’ running anymore. However the parent pid of sleep still refers to the pid of bash. So your implementation worked in case of a TaskRunner that used something like bash to execute the command. If you would you “sleep” directly (so without bash) you would get ourselves as the parent of the “sleep” process and thus kill yourself.

Using setsid makes sure we Airflow sees all processes to the new session/group regardless. Except when they run setsid themselves, which happens for example in the BashOperator. The bashOperator therefore becomes responsible for cleaning up after itself.

Contributor

bolkedebruin commented Apr 11, 2018

Hi @jgao54

Im not sure if deleting the signal handler from the —raw task is the right thing to do. The signal handler is there to trigger the on_kill method of the operator. If absent that would not be run anymore. bashOperator and the SparkSubmitOperator both make use of this.

On your comment of the “root” process. I don’t think you are correct and your implementation worked by coincidence. If you run “bash -c sleep 1000”, you will find ‘sleep’ in your process list but you won’t find ‘bash’ running anymore. However the parent pid of sleep still refers to the pid of bash. So your implementation worked in case of a TaskRunner that used something like bash to execute the command. If you would you “sleep” directly (so without bash) you would get ourselves as the parent of the “sleep” process and thus kill yourself.

Using setsid makes sure we Airflow sees all processes to the new session/group regardless. Except when they run setsid themselves, which happens for example in the BashOperator. The bashOperator therefore becomes responsible for cleaning up after itself.

@asfgit asfgit closed this in 39b7d7d Apr 11, 2018

@jgao54

This comment has been minimized.

Show comment
Hide comment
@jgao54

jgao54 Apr 11, 2018

Contributor

Im not sure if deleting the signal handler from the —raw task is the right thing to do.

You are correct, silly of me. Looks like that's the only place operator on_kill method is triggered (I somehow remembered that TaskRunner upon termination will explicitly call operator.on_kill, but that's not the case at all)

However the parent pid of sleep still refers to the pid of bash.

I believe when bash -c stops running, sleep 1000's parent id will get reassigned to the grandparent (or nearest ancestor).

Perhaps there's a bit of misunderstanding on root process. What I meant by the "root process" is the process attribute in the TaskRunner, which is always the pid of airflow run --raw process, while reap_process_group method (executed in airflow run --local process) takes in that pid as "root process". So I think having "root process kill itself" scenario is not possible after all (?).

This is just mostly me thinking out loud. The setsid makes sense to me and so does this PR, but i'm still a bit lost with the whole picture.

Contributor

jgao54 commented Apr 11, 2018

Im not sure if deleting the signal handler from the —raw task is the right thing to do.

You are correct, silly of me. Looks like that's the only place operator on_kill method is triggered (I somehow remembered that TaskRunner upon termination will explicitly call operator.on_kill, but that's not the case at all)

However the parent pid of sleep still refers to the pid of bash.

I believe when bash -c stops running, sleep 1000's parent id will get reassigned to the grandparent (or nearest ancestor).

Perhaps there's a bit of misunderstanding on root process. What I meant by the "root process" is the process attribute in the TaskRunner, which is always the pid of airflow run --raw process, while reap_process_group method (executed in airflow run --local process) takes in that pid as "root process". So I think having "root process kill itself" scenario is not possible after all (?).

This is just mostly me thinking out loud. The setsid makes sense to me and so does this PR, but i'm still a bit lost with the whole picture.

@bolkedebruin

This comment has been minimized.

Show comment
Hide comment
@bolkedebruin

bolkedebruin Apr 11, 2018

Contributor

@jgao54

I believe when bash -c stops running, sleep 1000's parent id will get reassigned to the grandparent (or nearest ancestor).

I verified this (on a mac) and that is not the case. The parent PID is just not there anymore.

while reap_process_group method (executed in airflow run --local process) takes in that pid as "root process". So I think having "root process kill itself" scenario is not possible after all (?).

reap_process_group does not really care about the root process. It could take any of the PIDs and get the GPID from there. By coincidence/convention when you call setsid the OS (POSIX) sets the GPID to the PID of the process that is the leader of the group. In reap_process_group we only add the parent pid to ensure we can watch all processes in the process group otherwise we could quit right away thinking we are done.

Contributor

bolkedebruin commented Apr 11, 2018

@jgao54

I believe when bash -c stops running, sleep 1000's parent id will get reassigned to the grandparent (or nearest ancestor).

I verified this (on a mac) and that is not the case. The parent PID is just not there anymore.

while reap_process_group method (executed in airflow run --local process) takes in that pid as "root process". So I think having "root process kill itself" scenario is not possible after all (?).

reap_process_group does not really care about the root process. It could take any of the PIDs and get the GPID from there. By coincidence/convention when you call setsid the OS (POSIX) sets the GPID to the PID of the process that is the leader of the group. In reap_process_group we only add the parent pid to ensure we can watch all processes in the process group otherwise we could quit right away thinking we are done.

albertocalderari pushed a commit to albertocalderari/incubator-airflow that referenced this pull request May 13, 2018

[AIRFLOW-1623] Trigger on_kill method in operators
on_kill methods were not triggered, due to
processes
not being properly terminated. This was due to the
fact
the runners use a shell which is then replaced by
the
child pid, which is unknown to Airflow.

Closes #3204 from bolkedebruin/AIRFLOW-1623

dlebech added a commit to trustpilot/incubator-airflow that referenced this pull request Sep 11, 2018

[AIRFLOW-1623] Trigger on_kill method in operators
on_kill methods were not triggered, due to
processes
not being properly terminated. This was due to the
fact
the runners use a shell which is then replaced by
the
child pid, which is unknown to Airflow.

Closes #3204 from bolkedebruin/AIRFLOW-1623
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment