Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job stuck "running", Post run hook errored, DB connection crashed? #11027

Closed
3 tasks done
azrdev opened this issue Sep 6, 2021 · 13 comments
Closed
3 tasks done

Job stuck "running", Post run hook errored, DB connection crashed? #11027

azrdev opened this issue Sep 6, 2021 · 13 comments

Comments

@azrdev
Copy link

azrdev commented Sep 6, 2021

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I am not entitled to status updates or other assurances.

Summary

I have a WFJ which has completed all tasks but is stuck in the "running" state. kubectl logs shows traceback(s):

2021-09-06 10:21:20,837 DEBUG    [8f2c583bfb674c6d9b3edaacd8a62ac6] awx.analytics.job_lifecycle job-209 preparing playbook
2021-09-06 10:21:21,000 DEBUG    [8f2c583bfb674c6d9b3edaacd8a62ac6] awx.analytics.job_lifecycle job-209 running playbook
...
2021-09-06 11:04:56,228 ERROR    [-] awx.conf.settings Database settings are not available, using defaults.
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 235, in _cursor
    return self._prepare_cursor(self.create_cursor(name))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 223, in create_cursor
    cursor = self.connection.cursor()
psycopg2.InterfaceError: connection already closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/conf/settings.py", line 81, in _ctit_db_wrapper
    yield
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/conf/settings.py", line 411, in __getattr__
    value = self._get_local(name)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/conf/settings.py", line 333, in _get_local
    self._preload_cache()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/conf/settings.py", line 296, in _preload_cache
    for setting in Setting.objects.filter(key__in=settings_to_cache.keys(), user__isnull=True).order_by('pk'):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 274, in __iter__
    self._fetch_all()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 1242, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 55, in __iter__
    results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/sql/compiler.py", line 1140, in execute_sql
    cursor = self.connection.cursor()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 256, in cursor
    return self._cursor()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 235, in _cursor
    return self._prepare_cursor(self.create_cursor(name))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 235, in _cursor
    return self._prepare_cursor(self.create_cursor(name))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 223, in create_cursor
    cursor = self.connection.cursor()
django.db.utils.InterfaceError: connection already closed
RESULT 2
OKREADY

...
2021-09-06 11:05:32,664 DEBUG    [f96d53f7b799434ca4f4a35cb069321e] awx.main.dispatch task 98e4ce8a-396e-4330-8f8c-eb50906f4516 starting awx.main.tasks.handle_success_and_failure_notifications(*[209])
2021-09-06 11:05:32,823 ERROR    [8f2c583bfb674c6d9b3edaacd8a62ac6] awx.main.tasks job 209 (running) Exception occurred while running task
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
psycopg2.errors.ProtocolViolation: server conn crashed?
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1384, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2900, in run
    return self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2941, in _run_internal
    res = list(first_future.done)[0].result()
  File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
  File "/usr/lib64/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/utils/common.py", line 1036, in wrapper_cleanup_new_process
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2982, in processor
    return ansible_runner.interface.run(
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/interface.py", line 249, in run
    r.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/streaming.py", line 251, in run
    self.event_callback(data)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/streaming.py", line 214, in event_callback
    should_write = self.event_handler(event_data)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1203, in event_handler
    self.instance.save(update_fields=['artifacts'])
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/models/unified_jobs.py", line 815, in save
    self_before = self.__class__.objects.get(pk=self.pk)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/manager.py", line 82, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 402, in get
    num = len(clone)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 256, in __len__
    self._fetch_all()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 1242, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/polymorphic/query.py", line 60, in _polymorphic_iterator
    o = next(base_iter)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 55, in __iter__
    results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/sql/compiler.py", line 1142, in execute_sql
    cursor.execute(sql, params)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/utils.py", line 67, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/utils.py", line 76, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.DatabaseError: server conn crashed?
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

2021-09-06 11:05:32,828 DEBUG    [8f2c583bfb674c6d9b3edaacd8a62ac6] awx.main.tasks job 209 (running) finished running, producing 147 events.
2021-09-06 11:05:32,830 ERROR    [8f2c583bfb674c6d9b3edaacd8a62ac6] awx.main.tasks job 209 (running) Post run hook errored.
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/fields/related_descriptors.py", line 164, in __get__
    rel_obj = self.field.get_cached_value(instance)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/fields/mixins.py", line 13, in get_cached_value
    return instance._state.fields_cache[cache_name]
KeyError: 'unified_job_template'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
psycopg2.errors.ProtocolViolation: server conn crashed?
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1410, in run
    self.post_run_hook(self.instance, status)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1105, in post_run_hook
    instance.log_lifecycle("post_run")
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/models/unified_jobs.py", line 1471, in log_lifecycle
    if self.unified_job_template:
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/fields/related_descriptors.py", line 178, in __get__
    rel_obj = self.get_object(instance)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/fields/related_descriptors.py", line 145, in get_object
    return qs.get(self.field.get_reverse_related_filter(instance))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 402, in get
    num = len(clone)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 256, in __len__
    self._fetch_all()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 1242, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/polymorphic/query.py", line 60, in _polymorphic_iterator
    o = next(base_iter)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 55, in __iter__
    results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/sql/compiler.py", line 1142, in execute_sql
    cursor.execute(sql, params)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/utils.py", line 67, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/utils.py", line 76, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.DatabaseError: server conn crashed?
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

2021-09-06 11:05:32,839 ERROR    [8f2c583bfb674c6d9b3edaacd8a62ac6] awx.main.dispatch Worker failed to run task awx.main.tasks.RunJob(*[209], **{}
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 239, in _set_autocommit
    self.connection.autocommit = autocommit
psycopg2.InterfaceError: connection already closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/dispatch/worker/task.py", line 90, in perform_work
    result = self.run_callable(body)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/dispatch/worker/task.py", line 65, in run_callable
    return _call(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 759, in _wrapped
    return f(self, *args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1420, in run
    self.instance = self.update_model(pk)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 790, in update_model
    with transaction.atomic():
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/transaction.py", line 201, in __enter__
    connection.set_autocommit(False, force_begin_transaction_with_broken_autocommit=True)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 405, in set_autocommit
    self._set_autocommit(autocommit)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 239, in _set_autocommit
    self.connection.autocommit = autocommit
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 239, in _set_autocommit
    self.connection.autocommit = autocommit
django.db.utils.InterfaceError: connection already closed
2021-09-06 11:05:32,847 DEBUG    [8f2c583bfb674c6d9b3edaacd8a62ac6] awx.main.dispatch task 67c935a9-4d24-4652-b14c-9bd62b93fcda starting awx.main.tasks.handle_work_error(*['67c935a9-4d24-4652-b14c-9bd62b93fcda'])
2021-09-06 11:05:32,848 DEBUG    [8f2c583bfb674c6d9b3edaacd8a62ac6] awx.main.tasks Executing error task id 67c935a9-4d24-4652-b14c-9bd62b93fcda, subtasks: [{'type': 'job', 'id': 209}]
2021-09-06 11:05:32,850 ERROR    [8f2c583bfb674c6d9b3edaacd8a62ac6] awx.main.dispatch Worker failed to run task awx.main.tasks.handle_work_error(*['67c935a9-4d24-4652-b14c-9bd62b93fcda'], **{'subtasks': [{'type': 'job', 'id': 209}]}
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 239, in _set_autocommit
    self.connection.autocommit = autocommit
psycopg2.InterfaceError: connection already closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/dispatch/worker/task.py", line 90, in perform_work
    result = self.run_callable(body)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/dispatch/worker/task.py", line 65, in run_callable
    return _call(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 759, in _wrapped
    return f(self, *args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1420, in run
    self.instance = self.update_model(pk)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 790, in update_model
    with transaction.atomic():
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/transaction.py", line 201, in __enter__
    connection.set_autocommit(False, force_begin_transaction_with_broken_autocommit=True)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 405, in set_autocommit
    self._set_autocommit(autocommit)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 239, in _set_autocommit
    self.connection.autocommit = autocommit
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 239, in _set_autocommit
    self.connection.autocommit = autocommit
django.db.utils.InterfaceError: connection already closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 235, in _cursor
    return self._prepare_cursor(self.create_cursor(name))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 223, in create_cursor
    cursor = self.connection.cursor()
psycopg2.InterfaceError: connection already closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/dispatch/worker/task.py", line 90, in perform_work
    result = self.run_callable(body)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/dispatch/worker/task.py", line 65, in run_callable
    return _call(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 575, in handle_work_error
    instance = UnifiedJob.get_instance_by_type(each_task['type'], each_task['id'])
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/models/unified_jobs.py", line 496, in get_instance_by_type
    return model.objects.get(id=job_id)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/manager.py", line 82, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 402, in get
    num = len(clone)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 256, in __len__
    self._fetch_all()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 1242, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/polymorphic/query.py", line 60, in _polymorphic_iterator
    o = next(base_iter)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 55, in __iter__
    results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/sql/compiler.py", line 1140, in execute_sql
    cursor = self.connection.cursor()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 256, in cursor
    return self._cursor()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 235, in _cursor
    return self._prepare_cursor(self.create_cursor(name))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 235, in _cursor
    return self._prepare_cursor(self.create_cursor(name))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 223, in create_cursor
    cursor = self.connection.cursor()
django.db.utils.InterfaceError: connection already closed
2021-09-06 11:05:32,851 DEBUG    [8f2c583bfb674c6d9b3edaacd8a62ac6] awx.main.dispatch task 67c935a9-4d24-4652-b14c-9bd62b93fcda starting awx.main.tasks.handle_work_success(*[])
2021-09-06 11:05:32,855 ERROR    [8f2c583bfb674c6d9b3edaacd8a62ac6] awx.main.dispatch Worker failed to run task awx.main.tasks.handle_work_success(*[], **{'task_actual': {'type': 'job', 'id': 209}}
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 235, in _cursor
    return self._prepare_cursor(self.create_cursor(name))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 223, in create_cursor
    cursor = self.connection.cursor()
psycopg2.InterfaceError: connection already closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/dispatch/worker/task.py", line 90, in perform_work
    result = self.run_callable(body)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/dispatch/worker/task.py", line 65, in run_callable
    return _call(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 556, in handle_work_success
    instance = UnifiedJob.get_instance_by_type(task_actual['type'], task_actual['id'])
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/models/unified_jobs.py", line 496, in get_instance_by_type
    return model.objects.get(id=job_id)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/manager.py", line 82, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 402, in get
    num = len(clone)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 256, in __len__
    self._fetch_all()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 1242, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/polymorphic/query.py", line 60, in _polymorphic_iterator
    o = next(base_iter)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 55, in __iter__
    results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/sql/compiler.py", line 1140, in execute_sql
    cursor = self.connection.cursor()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 256, in cursor
    return self._cursor()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 235, in _cursor
    return self._prepare_cursor(self.create_cursor(name))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 235, in _cursor
    return self._prepare_cursor(self.create_cursor(name))
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 223, in create_cursor
    cursor = self.connection.cursor()
django.db.utils.InterfaceError: connection already closed
2021-09-06 11:05:33,665 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-09-06 11:05:33,665 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1)
2021-09-06 11:05:33,666 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(3)
2021-09-06 11:05:33,666 DEBUG    [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(3)
2021-09-06 11:05:37,724 WARNING  [f96d53f7b799434ca4f4a35cb069321e] awx.main.tasks Failed to even try to send notifications for job '2021-09-06 10:21:10.827687+00:00-209-running' due to job not being in finished state.

AWX version

19.2.0

Installation method

kubernetes

Modifications

yes

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

If I knew /o\

Expected results

the job to terminate

Actual results

keeps "running"

Additional information

Customization: our database is external to the k8s cluster.

(Maybe) related tickets:

@Seb0042
Copy link

Seb0042 commented Sep 17, 2021

It has been corrected in 19.2.2.

@chrismeyersfsu
Copy link
Member

@azrdev is this fixed in >= 19.2.2 for you?

@azrdev
Copy link
Author

azrdev commented Sep 27, 2021

I'll tell you as soon as our AWX is updated, sorry I'll have to wait some days for that to happen

@azrdev
Copy link
Author

azrdev commented Oct 6, 2021

@chrismeyersfsu @Seb0042 on 19.3.0 we still see our long-running jobs' state not correctly synced with AWX, even though its stdout is displayed fully.

The UI-visible error has changed, though: it's not stuck "running" anymore, but fails with status

Task was marked as running but was not present in the job queue, so it has been marked as failed.

Is this the same issue as #10211 and #11087 ?

@shanemcd
Copy link
Member

shanemcd commented Oct 8, 2021

Are you using the database provisioned by the operator, or connecting to your own?

@shanemcd
Copy link
Member

shanemcd commented Oct 8, 2021

A couple more questions:

  • Is it possible that some control plane pods are getting evicted and re-launching?
  • Are you maybe scaling the control plane up and down around the time this happens?

@showblue
Copy link

I have the same issue with a fresh new 19.3.0 and 19.4.0 Kubernetes setup with external Postgres 12 DB. Every job that runs longer than 5 min will result in the same failed status except that the playbook runs were all successful.

$ awx jobs list -k -f human --filter id,name,created,finished,status,controller_node,job_explanation
id name         created                     finished                    status     controller_node      job_explanation
== ============ =========================== =========================== ========== ==================== =================================================================================================
4  linux-baseos 2021-10-14T10:11:54.567438Z 2021-10-14T10:14:09.641713Z failed     awx-6dc7478d4-vpcsw
5  linux-baseos 2021-10-14T10:19:06.411401Z 2021-10-14T10:24:28.104199Z failed     awx-6dc7478d4-vpcsw  Task was marked as running but was not present in the job queue, so it has been marked as failed.
7  linux-baseos 2021-10-18T03:13:13.772004Z 2021-10-18T04:23:28.737044Z successful awx-686f6c755c-7gn28
11 linux-baseos 2021-10-18T06:47:03.580629Z 2021-10-18T06:47:19.530979Z failed     awx-686f6c755c-7gn28
13 linux-baseos 2021-10-18T06:51:37.844249Z 2021-10-18T07:03:12.246307Z failed     awx-686f6c755c-7gn28 Task was marked as running but was not present in the job queue, so it has been marked as failed.
14 linux-baseos 2021-10-18T07:21:01.486229Z 2021-10-18T07:38:05.904964Z failed     awx-686f6c755c-7gn28 Task was marked as running but was not present in the job queue, so it has been marked as failed.
15 linux-baseos 2021-10-18T08:42:38.580710Z 2021-10-18T09:02:33.781595Z failed     awx-5b4f4f57c5-xhgpc Task was marked as running but was not present in the job queue, so it has been marked as failed.
17 linux-baseos 2021-10-18T12:00:44.663579Z 2021-10-18T12:23:20.193032Z canceled   awx-5b4f4f57c5-xhgpc
18 linux-baseos 2021-10-18T12:23:23.649428Z 2021-10-18T12:42:54.849812Z failed     awx-7bfcf8d59c-ggsbd Task was marked as running but was not present in the job queue, so it has been marked as failed.

Job 20 log

Oct 19 09:42:45  awx-7bfcf8d59c-ggsbd  awx-task DEBUG   [abc4f02e214443f3911319adb23d6591] awx.main.dispatch task 6b084733-eb45-45ca-9f26-577854b5ee10 starting awx.main.tasks.handle_success_and_failure_notifications(*[20])
Oct 19 09:42:45  awx-7bfcf8d59c-ggsbd  awx-task INFO    [00faf08544f440d98ab3e5e88a20dd4a] awx.main.commands.run_callback_receiver Event processing is finished for Job 20, sending notifications
Oct 19 09:42:45  awx-7bfcf8d59c-ggsbd  awx-task INFO    [00faf08544f440d98ab3e5e88a20dd4a] awx.main.commands.run_callback_receiver Event processing is finished for Job 20, sending notifications
Oct 19 09:42:46  awx-7bfcf8d59c-ggsbd  awx-task ERROR   [00faf08544f440d98ab3e5e88a20dd4a] awx.main.tasks job 20 (running) Exception occurred while running task
Oct 19 09:42:46  awx-7bfcf8d59c-ggsbd  awx-task  Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/fields/related_descriptors.py", line 164, in __get__
    rel_obj = self.field.get_cached_value(instance)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/fields/mixins.py", line 13, in get_cached_value
    return instance._state.fields_cache[cache_name]
Oct 19 09:42:46  awx-7bfcf8d59c-ggsbd  awx-task  KeyError: 'instance_group'
Oct 19 09:42:46  awx-7bfcf8d59c-ggsbd  awx-task  During handling of the above exception, another exception occurred:
Oct 19 09:42:46  awx-7bfcf8d59c-ggsbd  awx-task  Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
Oct 19 09:42:46  awx-7bfcf8d59c-ggsbd  awx-task  psycopg2.OperationalError: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

Complete job 20 log -->
job20-awx.txt

@azrdev
Copy link
Author

azrdev commented Oct 27, 2021

@shanemcd sorry for the delay after your questions:

Are you using the database provisioned by the operator, or connecting to your own?

We use an external database.
We're currently looking into the possibility that something else (database, firewall,...) drops long-lived (maybe unused) connections, but given there are the two named other tickets + showblue here, I'm not sure it's not a AWX's fault.

  • Is it possible that some control plane pods are getting evicted and re-launching?

  • Are you maybe scaling the control plane up and down around the time this happens?

no, this is not the case.

@azrdev
Copy link
Author

azrdev commented Oct 28, 2021

The DB admins told us that indeed the DB closes unused connections after 10 minutes.
Does AWX offer some kind of keepalive for the DB connection, in case we cannot change this on the DB side?

@CWollinger
Copy link
Contributor

Same here. With AWX version 19.4. Database and AWX launched by the AWX operator in the same k8s namespace

@chrismeyersfsu
Copy link
Member

AWX does not have a feature to explicitly keep the database connection alive. Maybe there is a Django feature you can tweak to do this for you?

@azrdev
Copy link
Author

azrdev commented Dec 2, 2021

@chrismeyersfsu django has CONN_MAX_AGE and mentions the problem explicitly:

If your database terminates idle connections after some time, you should set CONN_MAX_AGE to a lower value, so that Django doesn’t attempt to use a connection that has been terminated by the database server

on https://docs.djangoproject.com/en/3.2/ref/databases/#persistent-connections

I cannot see where AWX (specifically awx_task, I guess) sets its django config/options, which would also be the place to override CONN_MAX_AGE to something smaller than our external DB has -- by default it's apparently on a high value or None aka unlimited. Could you point to the relevant AWX code?

@shanemcd
Copy link
Member

Hello. Based on the few number of folks seeing this, it seems likely that the problem is with your environment. If you need help troubleshooting or are looking for help using AWX, try our mailing list or IRC channel:

#ansible-awx on https://libera.chat/
https://groups.google.com/forum/#!forum/awx-project

If after further troubleshooting you still think this is a bug in AWX, please open a new issue with any information you find.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants