-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
awx intermittently reports non-complete playbook as failed and stops outputing log #12400
Comments
Possible duplicate of #12297 |
looks like we want to provide extra error handling when creating the object instead of just returning a None type, @AlanCoding knows the relevant code around this |
It's a plain method that's causing the trouble. If the awx/awx/main/utils/update_model.py Lines 39 to 43 in 29d6084
That means that the method completes with no return, in other words, returning We may need to be careful about any calls to |
Hi there, super helpful for me would be if you also could add a debug message when an attempt fails:
also is there a way to increase the _max_attempts via configuration? (I am struggling with this error for quite a while now, imho it is not uncommon that you have unstable connections, if you rely on cheap DAaaS) |
With some recent merges, that method should be raising the exception and give some better error details, which is ultimately probably a database error. I hope we can get this out in a new release soon. |
maybe some additional observation: I can reliably reproduce the issue, if I go to the server-side and drop all client connections. Issue persist as long as I don't restart awx. In my view there is something wrong with how django handles the connection in its pool.Helpful would be a way to configure |
any update on this issue? I am using |
We increased the amount of reserved cpu and memory to the containers and it helped, we still see it from time to time on larger jobs. i think there is an issue where if awx task container loses it's database connection for any reason instead of reconnecting it gets stuck in a failed state indefinitely. I believe this could be fixed with some django database options. |
Please confirm the following
Bug Summary
Shortly after starting a playbook using the AWX UI it is marked as failed. log output is incomplete and the job is still running in the background.
AWX version
21.0.0
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
No response
Operating system
No response
Web browser
No response
Steps to reproduce
This has been happening randomly for the last few versions of AWX.
Expected results
Playbook should be marked as successful and have full log output in AWX.
Actual results
Log output incomplete or empty and playbook is marked as failed.
Additional information
2022-06-09 17:55:18,024 ERROR [7fffc6e497e7415c90df446bd2ff7720] awx.main.dispatch Worker failed to run task awx.main.tasks.jobs.RunJob(*[21317], **{}
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 587, in run
self.final_run_hook(self.instance, status, private_data_dir, fact_modification_times)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 962, in final_run_hook
super(RunJob, self).final_run_hook(job, status, private_data_dir, fact_modification_times)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 376, in final_run_hook
instance.log_lifecycle("finalize_run")
AttributeError: 'NoneType' object has no attribute 'log_lifecycle'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 90, in perform_work
result = self.run_callable(body)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 65, in run_callable
return _call(args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 93, in _wrapped
return f(self, args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 589, in run
logger.exception('{} Final run hook errored.'.format(self.instance.log_format))
AttributeError: 'NoneType' object has no attribute 'log_format'
2022-06-09 17:55:18,028 DEBUG [7fffc6e497e7415c90df446bd2ff7720] awx.main.dispatch task 93552643-cd13-40d3-a0d6-688f6b4e06c1 starting awx.main.tasks.system.handle_work_error(['93552643-cd13-40d3-a0d6-688f6b4e06c1'])
2022-06-09 17:55:18,029 DEBUG [7fffc6e497e7415c90df446bd2ff7720] awx.main.tasks.system Executing error task id 93552643-cd13-40d3-a0d6-688f6b4e06c1, subtasks: [{'type': 'job', 'id': 21317}]
2022-06-09 17:55:18,031 ERROR [7fffc6e497e7415c90df446bd2ff7720] awx.main.dispatch Worker failed to run task awx.main.tasks.system.handle_work_error(['93552643-cd13-40d3-a0d6-688f6b4e06c1'], **{'subtasks': [{'type': 'job', 'id': 21317}]}
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 587, in run
self.final_run_hook(self.instance, status, private_data_dir, fact_modification_times)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 962, in final_run_hook
super(RunJob, self).final_run_hook(job, status, private_data_dir, fact_modification_times)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 376, in final_run_hook
instance.log_lifecycle("finalize_run")
AttributeError: 'NoneType' object has no attribute 'log_lifecycle'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 90, in perform_work
result = self.run_callable(body)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 65, in run_callable
return _call(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 93, in _wrapped
return f(self, *args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 589, in run
logger.exception('{} Final run hook errored.'.format(self.instance.log_format))
AttributeError: 'NoneType' object has no attribute 'log_format'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 237, in _cursor
return self._prepare_cursor(self.create_cursor(name))
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
return func(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/postgresql/base.py", line 236, in create_cursor
cursor = self.connection.cursor()
psycopg2.InterfaceError: connection already closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 90, in perform_work
result = self.run_callable(body)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 65, in run_callable
return _call(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/system.py", line 685, in handle_work_error
instance = UnifiedJob.get_instance_by_type(each_task['type'], each_task['id'])
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/unified_jobs.py", line 495, in get_instance_by_type
return model.objects.get(id=job_id)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 431, in get
num = len(clone)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 262, in len
self._fetch_all()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 1324, in _fetch_all
self._result_cache = list(self._iterable_class(self))
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/polymorphic/query.py", line 58, in _polymorphic_iterator
o = next(base_iter)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 51, in iter
results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/sql/compiler.py", line 1173, in execute_sql
cursor = self.connection.cursor()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
return func(args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 259, in cursor
return self._cursor()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 237, in _cursor
return self._prepare_cursor(self.create_cursor(name))
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/utils.py", line 90, in exit
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 237, in _cursor
return self._prepare_cursor(self.create_cursor(name))
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
return func(args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/postgresql/base.py", line 236, in create_cursor
cursor = self.connection.cursor()
django.db.utils.InterfaceError: connection already closed
2022-06-09 17:55:18,034 DEBUG [7fffc6e497e7415c90df446bd2ff7720] awx.main.dispatch task 93552643-cd13-40d3-a0d6-688f6b4e06c1 starting awx.main.tasks.system.handle_work_success([])
2022-06-09 17:55:18,037 ERROR [7fffc6e497e7415c90df446bd2ff7720] awx.main.dispatch Worker failed to run task awx.main.tasks.system.handle_work_success([], **{'task_actual': {'type': 'job', 'id': 21317}}
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 237, in _cursor
return self._prepare_cursor(self.create_cursor(name))
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
return func(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/postgresql/base.py", line 236, in create_cursor
cursor = self.connection.cursor()
psycopg2.InterfaceError: connection already closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 90, in perform_work
result = self.run_callable(body)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 65, in run_callable
return _call(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/system.py", line 666, in handle_work_success
instance = UnifiedJob.get_instance_by_type(task_actual['type'], task_actual['id'])
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/unified_jobs.py", line 495, in get_instance_by_type
return model.objects.get(id=job_id)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 431, in get
num = len(clone)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 262, in len
self._fetch_all()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 1324, in _fetch_all
self._result_cache = list(self._iterable_class(self))
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/polymorphic/query.py", line 58, in _polymorphic_iterator
o = next(base_iter)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 51, in iter
results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/sql/compiler.py", line 1173, in execute_sql
cursor = self.connection.cursor()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
return func(args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 259, in cursor
return self._cursor()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 237, in _cursor
return self._prepare_cursor(self.create_cursor(name))
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/utils.py", line 90, in exit
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 237, in _cursor
return self._prepare_cursor(self.create_cursor(name))
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
return func(args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/postgresql/base.py", line 236, in create_cursor
cursor = self.connection.cursor()
django.db.utils.InterfaceError: connection already closed
2022-06-09 17:55:18,075 DEBUG [42e499902d394546855280338c664d6d] awx.main.dispatch task 9ee34e0c-b973-457a-870e-0664ef46542d starting awx.main.scheduler.tasks.run_task_manager([])
2022-06-09 17:55:18,077 DEBUG [42e499902d394546855280338c664d6d] awx.main.scheduler Running task manager.
2022-06-09 17:55:18,105 DEBUG [42e499902d394546855280338c664d6d] awx.main.scheduler Starting Scheduler
2022-06-09 17:55:18,485 DEBUG [42e499902d394546855280338c664d6d] awx.main.scheduler Finishing Scheduler
2022-06-09 17:55:19,065 DEBUG [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(6)
2022-06-09 17:55:19,065 DEBUG [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(2)
2022-06-09 17:55:19,065 DEBUG [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(6)
2022-06-09 17:55:19,065 DEBUG [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(2)
2022-06-09 17:55:28,058 DEBUG [abc4094c5fe74a9b9adfe1e5d3f8f819] awx.main.dispatch task fbd882e9-126b-4aeb-a874-23035b40ebb4 starting awx.main.tasks.system.awx_periodic_scheduler([])
2022-06-09 17:55:28,057 DEBUG [abc4094c5fe74a9b9adfe1e5d3f8f819] awx.main.dispatch scaling down worker pid:101979
2022-06-09 17:55:28,061 DEBUG [7fffc6e497e7415c90df446bd2ff7720] awx.main.dispatch worker exiting gracefully pid:101979
2022-06-09 17:55:28,090 DEBUG [abc4094c5fe74a9b9adfe1e5d3f8f819] awx.main.tasks.system Starting periodic scheduler
2022-06-09 17:55:28,097 DEBUG [abc4094c5fe74a9b9adfe1e5d3f8f819] awx.main.tasks.system Last scheduler run was: 2022-06-09 17:54:58.081189+00:00
2022-06-09 17:55:28,570 ERROR [abc4094c5fe74a9b9adfe1e5d3f8f819] awx.main.dispatch job 21317 (failed) is no longer running; reaping
The text was updated successfully, but these errors were encountered: