Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce ansible-runner worker cleanup as cleanup of last-resort for remote execution nodes #846

Merged
merged 23 commits into from
Oct 4, 2021

Conversation

AlanCoding
Copy link
Member

@AlanCoding AlanCoding commented Sep 21, 2021

As of creating the draft PR, I might have all the test cases I brainstormed taken care of. I had some content for manually running demos:

https://gist.github.com/AlanCoding/99de3b85031cae3b7f2c7c8f81560359#file-cleanup_demo-md

I don't want to merge this until AWX changes have been finished (in basic terms) and tested locally. I do want to make sure runner CI is running and that I don't get any surprises.


EDIT: I am now pretty happy with the testing in conjunction with AWX.

@AlanCoding AlanCoding changed the title [WIP] Introduce ansible-runner worker cleanup as cleanup of last-resort for remote execution nodes Introduce ansible-runner worker cleanup as cleanup of last-resort for remote execution nodes Sep 29, 2021
@AlanCoding AlanCoding marked this pull request as ready for review September 29, 2021 14:37
@AlanCoding AlanCoding requested a review from a team as a code owner September 29, 2021 14:37

def prune_images(runtime='podman'):
"""Run the prune images command and return changed status"""
stdout = run_command([runtime, 'image', 'prune', '-f'])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fosterseth in AWX we did something a little different, and I'm not totally sure why. It kind of manually deleted dangling images instead of using the prune command.

https://github.com/ansible/awx/blob/7c9626b0e7c0b7835c8ff42970c6f1740f3b10fe/awx/main/tasks.py#L397

This might have been to fix some podman-specific bug, but I lack the knowledge to reproduce that bug. We could add it here using the same commands as you used. Wouldn't hurt.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prune here is probably better -- in awx it looks like we wanted to get proper stdout for the images being removed, so it first grabs a list of images, then removes them. Maybe I wasn't able to get it work properly with prune at the time

test/integration/containerized/test_cleanup_images.py Outdated Show resolved Hide resolved
test/integration/containerized/test_cleanup_images.py Outdated Show resolved Hide resolved
test/integration/containerized/test_cleanup_images.py Outdated Show resolved Hide resolved
'FROM {}'.format(default_container_image),
'RUN echo {} > /tmp/for_test.txt'.format(special_string)
]))
image_name = 'quay.io/fortest/hasfile:latest'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this unique enough so that multiple versions of this test running simultaneously on the same host will not encounter errors when the image is later removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this were a fixture used by multiple tests I would have randomized the name. But why would the same test run in parallel on the same host? Yes, tests will run in parallel but I imagined it would never run the same test more than once. I still don't have a problem adding randomization because it won't hurt anything, I just wanted to explain myself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the use of @pytest.mark.parametrize on this test, it's possible this test could be running in parallel (once with podman and once with docker). We aren't currently using podman for these tests, only docker, but the plan is to eventually enable podman as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other thing is multiple instances of the same tests with different Python versions running at the same time.

test/unit/test_cleanup.py Outdated Show resolved Hide resolved
def delete_associated_folders(dir):
"""Where dir is the private_data_dir for a completed job, this deletes related tmp folders it used"""
for ident in project_idents(dir):
registry_auth_pattern = f'/tmp/{registry_auth_prefix}{ident}_*'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is hardcoding /tmp here problematic? What if the system is using a different tmp directory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> from tempfile import gettempdir
>>> gettempdir()
'/tmp'

Just drop that in instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there isn't a config option in runner for specifying the temp dir, then yes, tempfile.gettempdir() is a better option than hard coding /tmp.

ansible_runner/cleanup.py Outdated Show resolved Hide resolved
__all__ = ['add_cleanup_args', 'run_cleanup']


GRACE_PERIOD_DEFAULT = 60 # minutes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a better place for storing configuration values rather than setting a global in just this file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's ansible_runner.defaults, which is fine with me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go with that. Keep it all caps since it is a global (the other values in that file should be all caps too, but that's for another PR).

@@ -27,19 +27,21 @@
from six import string_types, PY2, PY3, text_type, binary_type


def _cleanup_folder(folder):
def cleanup_folder(folder):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs tests, especially if it's going to be public.

test/unit/test_cleanup.py Outdated Show resolved Hide resolved
@AlanCoding
Copy link
Member Author

@samdoran thank you so much for the review comments. I hope that most of these will be resolved, but there's a question or two that still seems unresolved, like what pattern we want for invoking a subprocess.

@@ -27,19 +27,21 @@
from six import string_types, PY2, PY3, text_type, binary_type


def _cleanup_folder(folder):
def cleanup_folder(folder):
"""Deletes folder, returns True or False based on whether a change happened."""
try:
shutil.rmtree(folder)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For systems that we care about, it seems that rmtree() (by default) will NOT follow symlinks. Which I like because security.

https://docs.python.org/3/library/shutil.html#shutil.rmtree.avoids_symlink_attacks

Copy link
Contributor

@Shrews Shrews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all of Sam's comments here have been addressed. If not, those can be addressed in followup PRs.

@Shrews Shrews added the gate label Oct 1, 2021
Copy link
Contributor

@ansible-zuul ansible-zuul bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@samdoran samdoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blockers, just some suggestions.

Comment on lines 40 to +42
from ansible_runner import run
from ansible_runner import output
from ansible_runner import cleanup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: combine multiple imports.

Suggested change
from ansible_runner import run
from ansible_runner import output
from ansible_runner import cleanup
from ansible_runner import (
cleanup,
output,
run,
)

)


def run_command(cmd):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I was talking about run_command() and run_command_async() but those do a bit more than run commands since they also instantiate a Runner object.

We should add a generic and simple public function for running commands at a later date.

@@ -1,2 +1,6 @@
default_process_isolation_executable = 'podman'
default_container_image = 'quay.io/ansible/ansible-runner:devel'
registry_auth_prefix = 'ansible_runner_registry_'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this should be all caps as well since it's a global.

Suggested change
registry_auth_prefix = 'ansible_runner_registry_'
REGISTRY_AUTH_PREFIX = 'ansible_runner_registry_'

Comment on lines +39 to +42
When running `ansible-runner worker`, if no `--private-data-dir` is given,
it will extract the contents to a temporary directory which is deleted at the end of execution.
You can use the `--delete` flag in conjunction with `--private-data-dir` to assure that
the provided directory is deleted at the end of execution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For RST, double backticks are needed for inline code formatting.

Suggested change
When running `ansible-runner worker`, if no `--private-data-dir` is given,
it will extract the contents to a temporary directory which is deleted at the end of execution.
You can use the `--delete` flag in conjunction with `--private-data-dir` to assure that
the provided directory is deleted at the end of execution.
When running ``ansible-runner worker``, if no ``--private-data-dir`` is given,
it will extract the contents to a temporary directory which is deleted at the end of execution.
You can use the ``--delete`` flag in conjunction with ``--private-data-dir`` to assure that
the provided directory is deleted at the end of execution.


The following command offers out-of-band cleanup.

$ ansible-runner worker cleanup --file-pattern=/tmp/foo_*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably need to quote this to avoid shell expansion.

Suggested change
$ ansible-runner worker cleanup --file-pattern=/tmp/foo_*
$ ansible-runner worker cleanup --file-pattern='/tmp/foo_*'

Comment on lines +50 to +56
`ansible-runner worker --private_data_dir=/tmp/foo_3`, for example.
NOTE: see the `--grace-period` option, which sets the time window.

This command also takes a `--remove-images` option to run the podman or docker `rmi` command.
There is otherwise no automatic cleanup of images used by a run,
even if `container_auth_data` is used to pull from a private container registry.
To be sure that layers are deleted as well, the `--image-prune` flag is necessary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`ansible-runner worker --private_data_dir=/tmp/foo_3`, for example.
NOTE: see the `--grace-period` option, which sets the time window.
This command also takes a `--remove-images` option to run the podman or docker `rmi` command.
There is otherwise no automatic cleanup of images used by a run,
even if `container_auth_data` is used to pull from a private container registry.
To be sure that layers are deleted as well, the `--image-prune` flag is necessary.
``ansible-runner worker --private_data_dir=/tmp/foo_3``, for example.
NOTE: see the ``--grace-period`` option, which sets the time window.
This command also takes a ``--remove-images`` option to run the podman or docker ``rmi`` command.
There is otherwise no automatic cleanup of images used by a run,
even if ``container_auth_data`` is used to pull from a private container registry.
To be sure that layers are deleted as well, the ``--image-prune`` flag is necessary.

Comment on lines +38 to +46
old_dir = str(tmp_path / 'modtime_old_xyz')
new_dir = str(tmp_path / 'modtime_new_abc')
os.mkdir(old_dir)
time.sleep(1)
os.mkdir(new_dir)
ct = cleanup_dirs(pattern=str(tmp_path / 'modtime_*_*'), grace_period=1. / 60.)
assert ct == 1
assert not os.path.exists(old_dir)
assert os.path.exists(new_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could keep these as Path objects.

Suggested change
old_dir = str(tmp_path / 'modtime_old_xyz')
new_dir = str(tmp_path / 'modtime_new_abc')
os.mkdir(old_dir)
time.sleep(1)
os.mkdir(new_dir)
ct = cleanup_dirs(pattern=str(tmp_path / 'modtime_*_*'), grace_period=1. / 60.)
assert ct == 1
assert not os.path.exists(old_dir)
assert os.path.exists(new_dir)
old_dir = tmp_path / 'modtime_old_xyz'
new_dir = tmp_path / 'modtime_new_abc'
old_dir.mkdir()
time.sleep(1)
new_dir.mkdir()
ct = cleanup_dirs(pattern=str(tmp_path / 'modtime_*_*'), grace_period=1. / 60.)
assert ct == 1
assert not old_dir.exists()
assert new_dir.exists()

Side note: using time.sleep() in tests is usually asking for instability. It's better to just set a/c/mtimes explicitly rather than slow down the tests unnecessary.

def test_cleanup_folder(tmp_path):
folder_path = tmp_path / 'a_folder'
folder_path.mkdir()
assert folder_path.exists() # sanity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this assertion and the same on line 32 are testing pathlib or the state of the test environment, not our function, and are unnecessary.

@ansible-zuul ansible-zuul bot merged commit 1a8c1c5 into ansible:devel Oct 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants