Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

awx intermittently reports non-complete playbook as failed and stops outputing log #12400

Open
5 of 9 tasks
tree4125 opened this issue Jun 21, 2022 · 8 comments
Open
5 of 9 tasks

Comments

@tree4125
Copy link

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

Shortly after starting a playbook using the AWX UI it is marked as failed. log output is incomplete and the job is still running in the background.

AWX version

21.0.0

Select the relevant components

  • UI
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

This has been happening randomly for the last few versions of AWX.

Expected results

Playbook should be marked as successful and have full log output in AWX.

Actual results

Log output incomplete or empty and playbook is marked as failed.

Additional information

2022-06-09 17:55:18,024 ERROR [7fffc6e497e7415c90df446bd2ff7720] awx.main.dispatch Worker failed to run task awx.main.tasks.jobs.RunJob(*[21317], **{}
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 587, in run
self.final_run_hook(self.instance, status, private_data_dir, fact_modification_times)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 962, in final_run_hook
super(RunJob, self).final_run_hook(job, status, private_data_dir, fact_modification_times)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 376, in final_run_hook
instance.log_lifecycle("finalize_run")
AttributeError: 'NoneType' object has no attribute 'log_lifecycle'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 90, in perform_work
result = self.run_callable(body)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 65, in run_callable
return _call(args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 93, in _wrapped
return f(self, args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 589, in run
logger.exception('{} Final run hook errored.'.format(self.instance.log_format))
AttributeError: 'NoneType' object has no attribute 'log_format'
2022-06-09 17:55:18,028 DEBUG [7fffc6e497e7415c90df446bd2ff7720] awx.main.dispatch task 93552643-cd13-40d3-a0d6-688f6b4e06c1 starting awx.main.tasks.system.handle_work_error(
['93552643-cd13-40d3-a0d6-688f6b4e06c1'])
2022-06-09 17:55:18,029 DEBUG [7fffc6e497e7415c90df446bd2ff7720] awx.main.tasks.system Executing error task id 93552643-cd13-40d3-a0d6-688f6b4e06c1, subtasks: [{'type': 'job', 'id': 21317}]
2022-06-09 17:55:18,031 ERROR [7fffc6e497e7415c90df446bd2ff7720] awx.main.dispatch Worker failed to run task awx.main.tasks.system.handle_work_error(
['93552643-cd13-40d3-a0d6-688f6b4e06c1'], **{'subtasks': [{'type': 'job', 'id': 21317}]}
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 587, in run
self.final_run_hook(self.instance, status, private_data_dir, fact_modification_times)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 962, in final_run_hook
super(RunJob, self).final_run_hook(job, status, private_data_dir, fact_modification_times)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 376, in final_run_hook
instance.log_lifecycle("finalize_run")
AttributeError: 'NoneType' object has no attribute 'log_lifecycle'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 90, in perform_work
result = self.run_callable(body)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 65, in run_callable
return _call(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 93, in _wrapped
return f(self, *args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 589, in run
logger.exception('{} Final run hook errored.'.format(self.instance.log_format))
AttributeError: 'NoneType' object has no attribute 'log_format'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 237, in _cursor
return self._prepare_cursor(self.create_cursor(name))
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
return func(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/postgresql/base.py", line 236, in create_cursor
cursor = self.connection.cursor()
psycopg2.InterfaceError: connection already closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 90, in perform_work
result = self.run_callable(body)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 65, in run_callable
return _call(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/system.py", line 685, in handle_work_error
instance = UnifiedJob.get_instance_by_type(each_task['type'], each_task['id'])
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/unified_jobs.py", line 495, in get_instance_by_type
return model.objects.get(id=job_id)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 431, in get
num = len(clone)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 262, in len
self._fetch_all()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 1324, in _fetch_all
self._result_cache = list(self._iterable_class(self))
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/polymorphic/query.py", line 58, in _polymorphic_iterator
o = next(base_iter)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 51, in iter
results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/sql/compiler.py", line 1173, in execute_sql
cursor = self.connection.cursor()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
return func(args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 259, in cursor
return self._cursor()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 237, in _cursor
return self._prepare_cursor(self.create_cursor(name))
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/utils.py", line 90, in exit
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 237, in _cursor
return self._prepare_cursor(self.create_cursor(name))
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
return func(args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/postgresql/base.py", line 236, in create_cursor
cursor = self.connection.cursor()
django.db.utils.InterfaceError: connection already closed
2022-06-09 17:55:18,034 DEBUG [7fffc6e497e7415c90df446bd2ff7720] awx.main.dispatch task 93552643-cd13-40d3-a0d6-688f6b4e06c1 starting awx.main.tasks.system.handle_work_success(
[])
2022-06-09 17:55:18,037 ERROR [7fffc6e497e7415c90df446bd2ff7720] awx.main.dispatch Worker failed to run task awx.main.tasks.system.handle_work_success(
[], **{'task_actual': {'type': 'job', 'id': 21317}}
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 237, in _cursor
return self._prepare_cursor(self.create_cursor(name))
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
return func(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/postgresql/base.py", line 236, in create_cursor
cursor = self.connection.cursor()
psycopg2.InterfaceError: connection already closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 90, in perform_work
result = self.run_callable(body)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 65, in run_callable
return _call(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/system.py", line 666, in handle_work_success
instance = UnifiedJob.get_instance_by_type(task_actual['type'], task_actual['id'])
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/unified_jobs.py", line 495, in get_instance_by_type
return model.objects.get(id=job_id)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 431, in get
num = len(clone)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 262, in len
self._fetch_all()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 1324, in _fetch_all
self._result_cache = list(self._iterable_class(self))
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/polymorphic/query.py", line 58, in _polymorphic_iterator
o = next(base_iter)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 51, in iter
results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/sql/compiler.py", line 1173, in execute_sql
cursor = self.connection.cursor()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
return func(args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 259, in cursor
return self._cursor()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 237, in _cursor
return self._prepare_cursor(self.create_cursor(name))
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/utils.py", line 90, in exit
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 237, in _cursor
return self._prepare_cursor(self.create_cursor(name))
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
return func(args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/postgresql/base.py", line 236, in create_cursor
cursor = self.connection.cursor()
django.db.utils.InterfaceError: connection already closed
2022-06-09 17:55:18,075 DEBUG [42e499902d394546855280338c664d6d] awx.main.dispatch task 9ee34e0c-b973-457a-870e-0664ef46542d starting awx.main.scheduler.tasks.run_task_manager(
[])
2022-06-09 17:55:18,077 DEBUG [42e499902d394546855280338c664d6d] awx.main.scheduler Running task manager.
2022-06-09 17:55:18,105 DEBUG [42e499902d394546855280338c664d6d] awx.main.scheduler Starting Scheduler
2022-06-09 17:55:18,485 DEBUG [42e499902d394546855280338c664d6d] awx.main.scheduler Finishing Scheduler
2022-06-09 17:55:19,065 DEBUG [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(6)
2022-06-09 17:55:19,065 DEBUG [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(2)
2022-06-09 17:55:19,065 DEBUG [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(6)
2022-06-09 17:55:19,065 DEBUG [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(2)
2022-06-09 17:55:28,058 DEBUG [abc4094c5fe74a9b9adfe1e5d3f8f819] awx.main.dispatch task fbd882e9-126b-4aeb-a874-23035b40ebb4 starting awx.main.tasks.system.awx_periodic_scheduler(
[])
2022-06-09 17:55:28,057 DEBUG [abc4094c5fe74a9b9adfe1e5d3f8f819] awx.main.dispatch scaling down worker pid:101979
2022-06-09 17:55:28,061 DEBUG [7fffc6e497e7415c90df446bd2ff7720] awx.main.dispatch worker exiting gracefully pid:101979
2022-06-09 17:55:28,090 DEBUG [abc4094c5fe74a9b9adfe1e5d3f8f819] awx.main.tasks.system Starting periodic scheduler
2022-06-09 17:55:28,097 DEBUG [abc4094c5fe74a9b9adfe1e5d3f8f819] awx.main.tasks.system Last scheduler run was: 2022-06-09 17:54:58.081189+00:00
2022-06-09 17:55:28,570 ERROR [abc4094c5fe74a9b9adfe1e5d3f8f819] awx.main.dispatch job 21317 (failed) is no longer running; reaping

@kialam
Copy link
Member

kialam commented Jun 22, 2022

Possible duplicate of #12297

@fosterseth
Copy link
Member

looks like we want to provide extra error handling when creating the object instead of just returning a None type, @AlanCoding knows the relevant code around this

@AlanCoding
Copy link
Member

It's a plain method that's causing the trouble. If the update_model method fails to return here because the retry count has ran out, then it does nothing.

if _attempt < _max_attempts:
time.sleep(5)
return update_model(model, pk, _attempt=_attempt + 1, _max_attempts=_max_attempts, **updates)
else:
logger.error('Failed to update %s after %d retries.', model._meta.object_name, _attempt)

That means that the method completes with no return, in other words, returning None. This is a very very bad thing to do. It would be much more helpful to re-raise the exception.

We may need to be careful about any calls to update_model in the tasks/jobs.py module that are outside of a try-except block. Fundamentally though, if we don't have database access, we can't update the status. So we should yell and scream in the logs, but nonetheless exit.

@muchafel
Copy link

muchafel commented Jul 12, 2022

Hi there,

super helpful for me would be if you also could add a debug message when an attempt fails:

if _attempt < _max_attempts: 
     time.sleep(5) 
     logger.debug('attempt failed. Will Rretry (attempt s%/d%',  _attempt, _max_attempts) 
     return update_model(model, pk, _attempt=_attempt + 1, _max_attempts=_max_attempts, **updates) 
 else: 
     logger.error('Failed to update %s after %d retries.', model._meta.object_name, _attempt) 

also is there a way to increase the _max_attempts via configuration?

(I am struggling with this error for quite a while now, imho it is not uncommon that you have unstable connections, if you rely on cheap DAaaS)

@AlanCoding
Copy link
Member

With some recent merges, that method should be raising the exception and give some better error details, which is ultimately probably a database error. I hope we can get this out in a new release soon.

@muchafel
Copy link

maybe some additional observation: I can reliably reproduce the issue, if I go to the server-side and drop all client connections. Issue persist as long as I don't restart awx. In my view there is something wrong with how django handles the connection in its pool.Helpful would be a way to configure max_conn_age

@CMPluto
Copy link

CMPluto commented Mar 3, 2023

any update on this issue? I am using
AWX 21.5.0

@tree4125
Copy link
Author

tree4125 commented Mar 11, 2023

We increased the amount of reserved cpu and memory to the containers and it helped, we still see it from time to time on larger jobs. i think there is an issue where if awx task container loses it's database connection for any reason instead of reconnecting it gets stuck in a failed state indefinitely. I believe this could be fixed with some django database options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants