Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Handle disappearing hosts #57129

Open
wants to merge 4 commits into
base: devel
Choose a base branch
from

Conversation

bcoca
Copy link
Member

@bcoca bcoca commented May 29, 2019

fixes #55669

ISSUE TYPE
  • Bugfix Pull Request
COMPONENT NAME

core

@bcoca bcoca requested a review from jimi-c May 29, 2019 15:26
@ansibot ansibot added affects_2.9 This issue/PR affects Ansible v2.9 bug This issue/PR relates to a bug. core_review In order to be merged, this PR must follow the core review workflow. needs_triage Needs a first human triage before being processed. support:community This issue/PR relates to code supported by the Ansible community. support:core This issue/PR relates to code supported by the Ansible Engineering Team. labels May 29, 2019
@goneri goneri removed the needs_triage Needs a first human triage before being processed. label May 30, 2019
@ansibot ansibot added needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. core_review In order to be merged, this PR must follow the core review workflow. and removed core_review In order to be merged, this PR must follow the core review workflow. needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. labels May 30, 2019
@nitzmahone
Copy link
Member

nitzmahone commented May 30, 2019

I built a reproducer for this with an inventory plugin (intending to turn it into a test for this PR), but as-is doesn't fix the issue (piqued my interest as I've often wondered how we'd handle disappearing hosts)...

@bcoca
Copy link
Member Author

bcoca commented May 30, 2019

@nitzmahone what is the reproducer? original issue was mostly a confluence of race conditions and i was having a hard time making it happen.

@nitzmahone
Copy link
Member

vanish.py

from ansible.plugins.inventory import BaseInventoryPlugin

runcount = 0


class InventoryModule(BaseInventoryPlugin):

    NAME = 'vanish'

    def verify_file(self, path):
        if not path.endswith('.vanish.yml'):
            return False
        return super(InventoryModule, self).verify_file(path)

    def parse(self, inventory, loader, path, cache=True):
        global runcount
        runcount += 1
        group = 'vanishgroup'
        inventory.add_group(group)
        inventory.add_host('h1', group=group)
        print('runcount is %d' % runcount)
        if runcount <= 1:
            inventory.add_host('disappearing_host', group=group)
        inventory.set_variable(group, 'ansible_connection', 'local')

disappearing_host.yml

- hosts: vanishgroup
  gather_facts: no
  tasks:
  - shell: echo yo
    notify: dude
  - meta: refresh_inventory
  - ping:
  handlers:
  - name: dude
    debug: msg=howdy

disappear.vanish.yml

plugin: vanish

ansible-playbook -i disappear.vanish.yml disappearing_host.yml

reproduces it 100% for me

@nitzmahone
Copy link
Member

I suspect there are probably a number of other potential hazards for this, but handler exec is a big one. It seems to work as expected without a handler and using the default callback.

@bcoca
Copy link
Member Author

bcoca commented May 31, 2019

this looks like diff case than the one i was handling: free strat + one of disappearing hosts being slow to return previous result when inventory (run_once) was refreshed.

so we probably have multiple cases that cause errors when a host disappears mid play.

@nitzmahone
Copy link
Member

Yeah, I thought about strategy: free too, though I didn't see anything mentioning that in the original issue... Reproducing that case under free should be pretty easy with this- just do a shell: sleep {{ bla }} and have the disappearing host's value of bla be longer than the others.

@nitzmahone nitzmahone closed this May 31, 2019
@nitzmahone nitzmahone reopened this May 31, 2019
@nitzmahone
Copy link
Member

I'm also wondering if we should leave "removed" hosts in place and just have an active flag on them that play_iterator et al would just ignore when asking for the next host... Would probably solve a lot of these kinds of problems, since other things like task completion and stats that need to look up hosts would still find them. Otherwise this is probably going to be a long series of race condition whack-a-mole...

@bcoca
Copy link
Member Author

bcoca commented May 31, 2019

discussed a lot in irc with original author so ticket does not have good reflection of full issue, still unsure that is the case since i could not easily reproduce.

as for ignoring the host, that is kind of what this does, sets state to 'nothing to do' and puts in removed_keys() so it is skipped even if iterator has it in preset objects

need to check why handlers still run, but also need to discuss --force-handlers in this case as well as meta: clear_host_errors

@ansibot ansibot added the stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. label Jun 8, 2019
@bcoca bcoca changed the title Handle disappearing hosts [WIP] Handle disappearing hosts Jun 22, 2019
@ansibot ansibot added the WIP This issue/PR is a work in progress. Nevertheless it was shared for getting input from peers. label Jun 22, 2019
@ansibot

This comment has been minimized.

@ansibot ansibot removed the stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. label Jun 24, 2019
@ansibot ansibot added the stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. label Jul 3, 2019
@ansibot ansibot added the needs_rebase https://docs.ansible.com/ansible/devel/dev_guide/developing_rebasing.html label Jul 19, 2019
@ansibot ansibot removed needs_rebase https://docs.ansible.com/ansible/devel/dev_guide/developing_rebasing.html stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. labels Feb 21, 2020
@ansibot
Copy link
Contributor

ansibot commented Feb 21, 2020

The test ansible-test sanity --test pylint [explain] failed with 1 error:

lib/ansible/executor/task_queue_manager.py:158:80: undefined-variable: Undefined variable 'AnsibleCollectionRef'

click here for bot help

@ansibot ansibot added the stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. label Feb 29, 2020
@ansibot ansibot removed the core_review In order to be merged, this PR must follow the core review workflow. label Aug 31, 2020
@ansibot ansibot added pre_azp This PR was last tested before migration to Azure Pipelines. and removed stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. labels Dec 5, 2020
@ansibot ansibot removed the support:community This issue/PR relates to code supported by the Ansible community. label Mar 5, 2021
@KFoxder
Copy link

KFoxder commented Apr 11, 2021

@bcoca Is there any workaround for this issue. Our playbook fails consistently now that we are on Ansible 2.9.19...

@bcoca
Copy link
Member Author

bcoca commented Apr 12, 2021

@KFoxder not really, the iterator was not designed for hosts to disappear mid run.

I'm still not sure how to fix this since we don't want to add hosts to run retroactively, it has many weird consequences.

my current thought is to invalidate the run for the hosts that were removed with a new status to avoid it being cleared by meta
and any other way of reintroducing the hosts into the play

I have not figured out how to deal with the magic vars as they should both reflect the hosts the play started out targeting but also avoid trying to use/connect to removed hosts.

Co-authored-by: David Shrewsbury <Shrews@users.noreply.github.com>
@ansibot ansibot removed the pre_azp This PR was last tested before migration to Azure Pipelines. label Apr 26, 2021
@ansibot ansibot added the stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. label May 5, 2021
@ansibot ansibot added the needs_rebase https://docs.ansible.com/ansible/devel/dev_guide/developing_rebasing.html label May 13, 2021
@bcoca
Copy link
Member Author

bcoca commented May 4, 2022

new state 'removed' for host so iterator skips, but still reports on it (remove from hostvars?)

@ansibot ansibot added needs_ci This PR requires CI testing to be performed. Please close and re-open this PR to trigger CI. and removed stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. labels Jun 26, 2023
@ansibot ansibot added stale_review Updates were made after the last review and the last review is more than 7 days old. has_issue labels Jul 12, 2023
@ansibot ansibot added needs_rebase https://docs.ansible.com/ansible/devel/dev_guide/developing_rebasing.html and removed needs_rebase https://docs.ansible.com/ansible/devel/dev_guide/developing_rebasing.html needs_ci This PR requires CI testing to be performed. Please close and re-open this PR to trigger CI. labels Oct 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects_2.9 This issue/PR affects Ansible v2.9 bug This issue/PR relates to a bug. has_issue needs_rebase https://docs.ansible.com/ansible/devel/dev_guide/developing_rebasing.html stale_review Updates were made after the last review and the last review is more than 7 days old. support:core This issue/PR relates to code supported by the Ansible Engineering Team. WIP This issue/PR is a work in progress. Nevertheless it was shared for getting input from peers.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Randomly receiving 'NoneType' object has no attribute 'name' after inventory refresh
8 participants