Introduce `ansible-runner worker cleanup` as cleanup of last-resort for remote execution nodes #846

AlanCoding · 2021-09-21T17:51:01Z

As of creating the draft PR, I might have all the test cases I brainstormed taken care of. I had some content for manually running demos:

https://gist.github.com/AlanCoding/99de3b85031cae3b7f2c7c8f81560359#file-cleanup_demo-md

I don't want to merge this until AWX changes have been finished (in basic terms) and tested locally. I do want to make sure runner CI is running and that I don't get any surprises.

EDIT: I am now pretty happy with the testing in conjunction with AWX.

AlanCoding · 2021-09-29T15:57:03Z

ansible_runner/cleanup.py

+
+def prune_images(runtime='podman'):
+    """Run the prune images command and return changed status"""
+    stdout = run_command([runtime, 'image', 'prune', '-f'])


@fosterseth in AWX we did something a little different, and I'm not totally sure why. It kind of manually deleted dangling images instead of using the prune command.

https://github.com/ansible/awx/blob/7c9626b0e7c0b7835c8ff42970c6f1740f3b10fe/awx/main/tasks.py#L397

This might have been to fix some podman-specific bug, but I lack the knowledge to reproduce that bug. We could add it here using the same commands as you used. Wouldn't hurt.

prune here is probably better -- in awx it looks like we wanted to get proper stdout for the images being removed, so it first grabs a list of images, then removes them. Maybe I wasn't able to get it work properly with prune at the time

docs/remote_jobs.rst

test/integration/containerized/test_cleanup_images.py

samdoran · 2021-09-29T20:29:48Z

test/integration/containerized/test_cleanup_images.py

+            'FROM {}'.format(default_container_image),
+            'RUN echo {} > /tmp/for_test.txt'.format(special_string)
+        ]))
+    image_name = 'quay.io/fortest/hasfile:latest'


Is this unique enough so that multiple versions of this test running simultaneously on the same host will not encounter errors when the image is later removed?

If this were a fixture used by multiple tests I would have randomized the name. But why would the same test run in parallel on the same host? Yes, tests will run in parallel but I imagined it would never run the same test more than once. I still don't have a problem adding randomization because it won't hurt anything, I just wanted to explain myself.

Because of the use of @pytest.mark.parametrize on this test, it's possible this test could be running in parallel (once with podman and once with docker). We aren't currently using podman for these tests, only docker, but the plan is to eventually enable podman as well.

The other thing is multiple instances of the same tests with different Python versions running at the same time.

test/unit/test_cleanup.py

samdoran · 2021-09-29T20:50:29Z

ansible_runner/cleanup.py

+def delete_associated_folders(dir):
+    """Where dir is the private_data_dir for a completed job, this deletes related tmp folders it used"""
+    for ident in project_idents(dir):
+        registry_auth_pattern = f'/tmp/{registry_auth_prefix}{ident}_*'


Is hardcoding /tmp here problematic? What if the system is using a different tmp directory?

>>> from tempfile import gettempdir >>> gettempdir() '/tmp'

Just drop that in instead?

If there isn't a config option in runner for specifying the temp dir, then yes, tempfile.gettempdir() is a better option than hard coding /tmp.

ansible_runner/cleanup.py

samdoran · 2021-09-29T20:55:10Z

ansible_runner/cleanup.py

+__all__ = ['add_cleanup_args', 'run_cleanup']
+
+
+GRACE_PERIOD_DEFAULT = 60  # minutes


Do we have a better place for storing configuration values rather than setting a global in just this file?

there's ansible_runner.defaults, which is fine with me

Let's go with that. Keep it all caps since it is a global (the other values in that file should be all caps too, but that's for another PR).

samdoran · 2021-09-29T20:59:52Z

ansible_runner/utils/__init__.py

@@ -27,19 +27,21 @@
 from six import string_types, PY2, PY3, text_type, binary_type


-def _cleanup_folder(folder):
+def cleanup_folder(folder):


This needs tests, especially if it's going to be public.

test/unit/test_cleanup.py

AlanCoding · 2021-09-30T15:04:10Z

@samdoran thank you so much for the review comments. I hope that most of these will be resolved, but there's a question or two that still seems unresolved, like what pattern we want for invoking a subprocess.

ansible_runner/cleanup.py

chrismeyersfsu · 2021-09-30T19:54:38Z

ansible_runner/utils/__init__.py

@@ -27,19 +27,21 @@
 from six import string_types, PY2, PY3, text_type, binary_type


-def _cleanup_folder(folder):
+def cleanup_folder(folder):
+    """Deletes folder, returns True or False based on whether a change happened."""
    try:
        shutil.rmtree(folder)


For systems that we care about, it seems that rmtree() (by default) will NOT follow symlinks. Which I like because security.

https://docs.python.org/3/library/shutil.html#shutil.rmtree.avoids_symlink_attacks

Co-authored-by: Sam Doran <sdoran@redhat.com>

Shrews

I think all of Sam's comments here have been addressed. If not, those can be addressed in followup PRs.

ansible-zuul

LGTM!

samdoran

No blockers, just some suggestions.

samdoran · 2021-10-04T20:04:41Z

ansible_runner/__main__.py

 from ansible_runner import run
 from ansible_runner import output
+from ansible_runner import cleanup


Nit: combine multiple imports.

Suggested change

from ansible_runner import run

from ansible_runner import output

from ansible_runner import cleanup

from ansible_runner import (

cleanup,

output,

run,

)

samdoran · 2021-10-04T20:07:44Z

ansible_runner/cleanup.py

+    )
+
+
+def run_command(cmd):


Ok. I was talking about run_command() and run_command_async() but those do a bit more than run commands since they also instantiate a Runner object.

We should add a generic and simple public function for running commands at a later date.

samdoran · 2021-10-04T20:09:56Z

ansible_runner/defaults.py

@@ -1,2 +1,6 @@
 default_process_isolation_executable = 'podman'
 default_container_image = 'quay.io/ansible/ansible-runner:devel'
+registry_auth_prefix = 'ansible_runner_registry_'


Nit: this should be all caps as well since it's a global.

Suggested change

registry_auth_prefix = 'ansible_runner_registry_'

REGISTRY_AUTH_PREFIX = 'ansible_runner_registry_'

samdoran · 2021-10-04T20:11:06Z

docs/remote_jobs.rst

+When running `ansible-runner worker`, if no `--private-data-dir` is given,
+it will extract the contents to a temporary directory which is deleted at the end of execution.
+You can use the `--delete` flag in conjunction with `--private-data-dir` to assure that
+the provided directory is deleted at the end of execution.


For RST, double backticks are needed for inline code formatting.

Suggested change

When running `ansible-runner worker`, if no `--private-data-dir` is given,

it will extract the contents to a temporary directory which is deleted at the end of execution.

You can use the `--delete` flag in conjunction with `--private-data-dir` to assure that

the provided directory is deleted at the end of execution.

When running ``ansible-runner worker``, if no ``--private-data-dir`` is given,

it will extract the contents to a temporary directory which is deleted at the end of execution.

You can use the ``--delete`` flag in conjunction with ``--private-data-dir`` to assure that

the provided directory is deleted at the end of execution.

samdoran · 2021-10-04T20:12:03Z

docs/remote_jobs.rst

+
+The following command offers out-of-band cleanup.
+
+    $ ansible-runner worker cleanup --file-pattern=/tmp/foo_*


Probably need to quote this to avoid shell expansion.

Suggested change

$ ansible-runner worker cleanup --file-pattern=/tmp/foo_*

$ ansible-runner worker cleanup --file-pattern='/tmp/foo_*'

samdoran · 2021-10-04T20:12:44Z

docs/remote_jobs.rst

+`ansible-runner worker --private_data_dir=/tmp/foo_3`, for example.
+NOTE: see the `--grace-period` option, which sets the time window.
+
+This command also takes a `--remove-images` option to run the podman or docker `rmi` command.
+There is otherwise no automatic cleanup of images used by a run,
+even if `container_auth_data` is used to pull from a private container registry.
+To be sure that layers are deleted as well, the `--image-prune` flag is necessary.


Suggested change

`ansible-runner worker --private_data_dir=/tmp/foo_3`, for example.

NOTE: see the `--grace-period` option, which sets the time window.

This command also takes a `--remove-images` option to run the podman or docker `rmi` command.

There is otherwise no automatic cleanup of images used by a run,

even if `container_auth_data` is used to pull from a private container registry.

To be sure that layers are deleted as well, the `--image-prune` flag is necessary.

``ansible-runner worker --private_data_dir=/tmp/foo_3``, for example.

NOTE: see the ``--grace-period`` option, which sets the time window.

This command also takes a ``--remove-images`` option to run the podman or docker ``rmi`` command.

There is otherwise no automatic cleanup of images used by a run,

even if ``container_auth_data`` is used to pull from a private container registry.

To be sure that layers are deleted as well, the ``--image-prune`` flag is necessary.

samdoran · 2021-10-04T20:21:01Z

test/unit/test_cleanup.py

+    old_dir = str(tmp_path / 'modtime_old_xyz')
+    new_dir = str(tmp_path / 'modtime_new_abc')
+    os.mkdir(old_dir)
+    time.sleep(1)
+    os.mkdir(new_dir)
+    ct = cleanup_dirs(pattern=str(tmp_path / 'modtime_*_*'), grace_period=1. / 60.)
+    assert ct == 1
+    assert not os.path.exists(old_dir)
+    assert os.path.exists(new_dir)


You could keep these as Path objects.

Suggested change

old_dir = str(tmp_path / 'modtime_old_xyz')

new_dir = str(tmp_path / 'modtime_new_abc')

os.mkdir(old_dir)

time.sleep(1)

os.mkdir(new_dir)

ct = cleanup_dirs(pattern=str(tmp_path / 'modtime_*_*'), grace_period=1. / 60.)

assert ct == 1

assert not os.path.exists(old_dir)

assert os.path.exists(new_dir)

old_dir = tmp_path / 'modtime_old_xyz'

new_dir = tmp_path / 'modtime_new_abc'

old_dir.mkdir()

time.sleep(1)

new_dir.mkdir()

ct = cleanup_dirs(pattern=str(tmp_path / 'modtime_*_*'), grace_period=1. / 60.)

assert ct == 1

assert not old_dir.exists()

assert new_dir.exists()

Side note: using time.sleep() in tests is usually asking for instability. It's better to just set a/c/mtimes explicitly rather than slow down the tests unnecessary.

samdoran · 2021-10-04T20:24:40Z

test/unit/test_utils.py

+def test_cleanup_folder(tmp_path):
+    folder_path = tmp_path / 'a_folder'
+    folder_path.mkdir()
+    assert folder_path.exists()  # sanity


I think this assertion and the same on line 32 are testing pathlib or the state of the test environment, not our function, and are unnecessary.

AlanCoding mentioned this pull request Sep 21, 2021

Cleanup unused images in execution nodes ansible/awx#10701

Closed

AlanCoding force-pushed the cleanup_command branch from 8fa836c to afce151 Compare September 21, 2021 19:22

AlanCoding mentioned this pull request Sep 28, 2021

Consolidate cleanup actions under new ansible-runner worker cleanup command ansible/awx#11160

Merged

AlanCoding changed the title ~~[WIP] Introduce ansible-runner worker cleanup as cleanup of last-resort for remote execution nodes~~ Introduce ansible-runner worker cleanup as cleanup of last-resort for remote execution nodes Sep 29, 2021

AlanCoding marked this pull request as ready for review September 29, 2021 14:37

AlanCoding requested a review from a team as a code owner September 29, 2021 14:37

AlanCoding force-pushed the cleanup_command branch from cc0f232 to 57f64a9 Compare September 29, 2021 14:57

AlanCoding mentioned this pull request Sep 29, 2021

Background job to cleanup private_data_dir files left by interrupted job ansible/awx#10948

Closed

2 tasks

AlanCoding commented Sep 29, 2021

View reviewed changes

Shrews reviewed Sep 29, 2021

View reviewed changes

docs/remote_jobs.rst Show resolved Hide resolved

samdoran suggested changes Sep 29, 2021

View reviewed changes

AlanCoding force-pushed the cleanup_command branch from a61bca6 to 6eaf4f7 Compare September 30, 2021 14:37

Shrews reviewed Sep 30, 2021

View reviewed changes

ansible_runner/cleanup.py Show resolved Hide resolved

chrismeyersfsu reviewed Sep 30, 2021

View reviewed changes

AlanCoding and others added 15 commits October 1, 2021 09:37

Add arguments for new cleanup command

ca0a046

Build out python logic for cleanup command

57f89ca

Adjustments to make verbosity better behaved

89a124b

Implement plausible solution for registry auth dir recognition

9123b92

Print if images are cleaned up

8350f21

Initial set of tests, get registry auth cleanup working

ac6e90f

Save current work for integration tests

82e2d3c

Get integration test running

ecaaca1

Pass runtime to prune operation too

73f022f

Add a grace period to the worker cleanup command

adf13e4

Allow multiple args for cleanup options

ab9fb2c

Correct example in help text

66381c5

Create new docs section for remote job dir cleanup

aec55b5

Add some minimal path validation to prevent user footguns

59c793d

Apply suggestions from code review

b8e306b

Co-authored-by: Sam Doran <sdoran@redhat.com>

AlanCoding and others added 8 commits October 1, 2021 09:41

Keep dockerfile_path var as a Path object until needed

33fb333

Co-authored-by: Sam Doran <sdoran@redhat.com>

Syntax cleanup from github commits

d3e7a43

Use whatever the system tmp dir is

1eaead2

Add a simple set of unit tests for cleanup_folder

b4fd318

Move grace period default to constants module

bfce8c0

Randomize image name in test

1cdd27e

Elaborate on the necessity of the command by expanding on image removal

982d1af

Fix spelling in skip message

50401dd

AlanCoding force-pushed the cleanup_command branch from 52e9136 to 50401dd Compare October 1, 2021 13:42

Shrews approved these changes Oct 1, 2021

View reviewed changes

Shrews added the gate label Oct 1, 2021

ansible-zuul bot approved these changes Oct 1, 2021

View reviewed changes

beeankha approved these changes Oct 1, 2021

View reviewed changes

samdoran approved these changes Oct 4, 2021

View reviewed changes

ansible-zuul bot merged commit 1a8c1c5 into ansible:devel Oct 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce `ansible-runner worker cleanup` as cleanup of last-resort for remote execution nodes #846

Introduce `ansible-runner worker cleanup` as cleanup of last-resort for remote execution nodes #846

AlanCoding commented Sep 21, 2021 •

edited

AlanCoding Sep 29, 2021

fosterseth Sep 30, 2021

samdoran Sep 29, 2021

AlanCoding Sep 30, 2021

Shrews Sep 30, 2021

samdoran Oct 4, 2021

samdoran Sep 29, 2021

AlanCoding Sep 30, 2021

samdoran Oct 4, 2021

samdoran Sep 29, 2021

AlanCoding Sep 30, 2021

samdoran Oct 4, 2021

samdoran Sep 29, 2021

AlanCoding commented Sep 30, 2021

chrismeyersfsu Sep 30, 2021 •

edited

Shrews left a comment

ansible-zuul bot left a comment

samdoran left a comment

samdoran Oct 4, 2021

samdoran Oct 4, 2021

samdoran Oct 4, 2021

samdoran Oct 4, 2021

samdoran Oct 4, 2021

samdoran Oct 4, 2021

samdoran Oct 4, 2021

samdoran Oct 4, 2021

		__all__ = ['add_cleanup_args', 'run_cleanup']


		GRACE_PERIOD_DEFAULT = 60 # minutes

	registry_auth_prefix = 'ansible_runner_registry_'
	REGISTRY_AUTH_PREFIX = 'ansible_runner_registry_'


		The following command offers out-of-band cleanup.

		$ ansible-runner worker cleanup --file-pattern=/tmp/foo_*

		)


		def run_command(cmd):

Introduce ansible-runner worker cleanup as cleanup of last-resort for remote execution nodes #846

Introduce ansible-runner worker cleanup as cleanup of last-resort for remote execution nodes #846

Conversation

AlanCoding commented Sep 21, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlanCoding commented Sep 30, 2021

chrismeyersfsu Sep 30, 2021 • edited

Choose a reason for hiding this comment

Shrews left a comment

Choose a reason for hiding this comment

ansible-zuul bot left a comment

Choose a reason for hiding this comment

samdoran left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Introduce `ansible-runner worker cleanup` as cleanup of last-resort for remote execution nodes #846

Introduce `ansible-runner worker cleanup` as cleanup of last-resort for remote execution nodes #846

AlanCoding commented Sep 21, 2021 •

edited

chrismeyersfsu Sep 30, 2021 •

edited