Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

race condition with fact cache and smart fact gathering causes undefined variable errors. #14456

Closed
MichaelBaydoun opened this issue Feb 12, 2016 · 5 comments
Labels
bug This issue/PR relates to a bug.
Milestone

Comments

@MichaelBaydoun
Copy link
Contributor

Issue Type: Bug Report
Ansible Version: 2.0.0.2
Ansible Configuration:

No changes, fact caching worked in 1.9.4 but is experiencing problems in 2.0.0.2

Fact caching settings in ansible.cfg

fact_caching = jsonfile
fact_caching_connection = ~/.ansiblecachedir
fact_caching_timeout = 10800
gathering = smart

Environment:

Control server is RedHat 6, target hosts are a collection of rhel5/6 and win2008/2012

Summary:

Fact caching worked in 1.9.4, but is having issues in 2.0.0.2.
Site runs fail at random places and random hosts with undefined variables that are really facts, and that were defined and used earlier during the site run.

Steps To Reproduce:

During a long site.yml run, ours takes almost two hours, we gather facts at the top and cache them, and the cache timeout is set to three hours.
Every two hour site run fails at random places and random hosts with undefined variables that are really facts, and that were defined and used without error earlier during the site run.
Disable fact caching and the errors go away.
Limit the site run to a subset of hosts or tags, which makes runs much faster, and the problem doesn't occur

Expected Results:

Fact variables are not forgotten

Actual Results:

Hosts error with undefined variables that are facts during longer runs, that were defined earlier in the run, and that shouldn't have expired.

@bcoca bcoca added this to the stable-2.0 milestone Feb 15, 2016
@MichaelBaydoun
Copy link
Contributor Author

MichaelBaydoun commented May 9, 2016

I figured out what's happening here, but not how to fix it. There is a flaw in the smart fact gathering design.

At the start of a multi-play playbook run such as the typical site.yml, if previously cached facts have not expired, they will not be regathered. Then, if they expire later during the playbook run, no attempt is made to refresh them, and the undefined variable error occurs.

From the documentation "The value ‘smart’ means each new host that has no facts discovered will be scanned, but if the same host is addressed in multiple plays it will not be contacted again in the playbook run."

It wouldn't really matter what timeout value is used for fact_caching_timeout, the potential for this condition still exists.

Evidence that this happened. Running 2.0.1.0 now, turned fact caching back on to test, issue reappears.

One of the hosts that just failed, with an undefined ansible_system fact

12:33:54            TASK [splunk : install splunk] *************************************************
12:33:54            Monday 09 May 2016  12:33:54 -0400 (0:00:00.535)       0:33:37.157 ************ 
12:33:54            fatal: [wdv-servicessbx]: FAILED! => {"failed": true, "msg": "The conditional check 'ansible_system == 'Linux'' failed. The error was: error while evaluating conditional (ansible_system == 'Linux'): 'ansible_system' is undefined\n\nThe error appears to have been in '/home/icansible/ansible/roles/splunk/tasks/main.yml': line 2, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n---\n- include: linux.yml\n  ^ here\n"}

Looking in the cached file directory, the failed server hadn't been updated in the most recent execution, which launched at 12:00, because the file was not quite old enough yet, it was 2 hours 46 minutes old, our timeout is set to 3 hours

-rw-rw----.  1 icansible icansible   926 May  9 09:14 wdv-servicessbx

But, the site run failed at 12:33 because the fact file was now over 3 hours (10,800 seconds) old.

The only behavior I can't explain, is why we never saw this problem in 1.9.4 but see it in frequently starting with 2.0

@MichaelBaydoun MichaelBaydoun changed the title Facts being forgotten before cache timeout in 2.0.0.2 Smart fact gathering causes undefined variable errors. May 9, 2016
@MichaelBaydoun
Copy link
Contributor Author

Changed issue title to more accurately reflect the issue, now that it's been identified.

@bcoca bcoca changed the title Smart fact gathering causes undefined variable errors. race condition with fact cache and smart fact gathering causes undefined variable errors. May 10, 2016
@jctanner
Copy link
Contributor

jctanner commented May 10, 2016

Suspected culprit 0aa0183 which fixed #12722

bcoca added a commit that referenced this issue May 10, 2016
fixes #14456, now it won't expire keys in middle of a play when they
were 'valid' at 'gather time'.
@bcoca bcoca closed this as completed in f576082 May 10, 2016
@MichaelBaydoun
Copy link
Contributor Author

MichaelBaydoun commented May 10, 2016

Note: We are using file (json) caching. For anyone having this problem, until the commit makes the 2.1 release, a suggested workaround is to switch from smart to explicit, and explicitly call setup once at the start of your playbook.

I'm not entirely clear how redis or other caching systems are working, but I believe they will see the same issue, and the fix committed above only covers json caching behavior.

@prutseltje
Copy link
Contributor

Generating my /etc/hosts file in ansible2 fails. But only when playbook runs for a few minutes and then generates the /etc/hosts file. When running task direct with corresponding tag the /etc/hosts is generated without any issue.

fatal: [host]: FAILED! => {"changed": false, "failed": true, "msg": "AnsibleUndefinedVariable: 'dict object' has no attribute 'ansible_ssh_host'"}

ansible (2.1.0.0)
gathering = smart
fact_caching = redis
fact_caching_timeout = 86400

gescheit pushed a commit to gescheit/ansible that referenced this issue Jun 3, 2016
fixes ansible#14456, now it won't expire keys in middle of a play when they
were 'valid' at 'gather time'.
@ansibot ansibot added bug This issue/PR relates to a bug. and removed bug_report labels Mar 7, 2018
@ansible ansible locked and limited conversation to collaborators Apr 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug This issue/PR relates to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants