Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

null value in column "ip_address" of relation "main_instance" violates not-null constraint #14486

Closed
5 of 11 tasks
renanguilhermef opened this issue Sep 27, 2023 · 7 comments
Closed
5 of 11 tasks

Comments

@renanguilhermef
Copy link

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to security@ansible.com instead.)

Bug Summary

AWX upgraded from 22.6.0 to 23.2.0 but awx-task is crashing and restarting after deploy with below error:

  [wait-for-migrations] Waiting for database migrations...
  [wait-for-migrations] Attempt 1 of 30
  Traceback (most recent call last):
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/utils.py", line 89, in _execute
      return self.cursor.execute(sql, params)
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/psycopg/cursor.py", line 723, in execute
      raise ex.with_traceback(None)
  psycopg.errors.NotNullViolation: null value in column "ip_address" of relation "main_instance" violates not-null constraint
  DETAIL:  Failing row contains (41, 36b83143-4f22-424e-bd15-2919c4ae7106, awx-task-6db5b9659f-hcvq2, 2023-09-27 17:37:18.521329+00, 2023-09-27 17:37:18.521353+00, 0, 22.6.0, 1.00, 4.0, 16819769344, 0, 0, t, t, null, control, 2023-09-27 17:40:23.983469+00, Instance received normal shutdown signal, 2023-09-27 17:40:22.51428+00, null, unavailable, null, f).
  
  The above exception was the direct cause of the following exception:
  
  Traceback (most recent call last):
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django_pglocks/__init__.py", line 74, in advisory_lock
      yield acquired
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/utils/pglock.py", line 14, in advisory_lock
      yield internal_lock
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/managers.py", line 131, in register
      other_inst.save(update_fields=['ip_address'])
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/ha.py", line 53, in save
      super(BaseModel, self).save(*args, **kwargs)
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/base.py", line 814, in save
      self.save_base(
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/base.py", line 877, in save_base
      updated = self._save_table(
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/base.py", line 990, in _save_table
      updated = self._do_update(
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/base.py", line 1054, in _do_update
      return filtered._update(values) > 0
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/query.py", line 1231, in _update
      return query.get_compiler(self.db).execute_sql(CURSOR)
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/sql/compiler.py", line 1984, in execute_sql
      cursor = super().execute_sql(result_type)
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/models/sql/compiler.py", line 1562, in execute_sql
      cursor.execute(sql, params)
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/utils.py", line 67, in execute
      return self._execute_with_wrappers(
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/utils.py", line 80, in _execute_with_wrappers
      return executor(sql, params, many, context)
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/utils.py", line 89, in _execute
      return self.cursor.execute(sql, params)
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/utils.py", line 91, in __exit__
      raise dj_exc_value.with_traceback(traceback) from exc_value
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/utils.py", line 89, in _execute
      return self.cursor.execute(sql, params)
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/psycopg/cursor.py", line 723, in execute
      raise ex.with_traceback(None)
  django.db.utils.IntegrityError: null value in column "ip_address" of relation "main_instance" violates not-null constraint
  DETAIL:  Failing row contains (41, 36b83143-4f22-424e-bd15-2919c4ae7106, awx-task-6db5b9659f-hcvq2, 2023-09-27 17:37:18.521329+00, 2023-09-27 17:37:18.521353+00, 0, 22.6.0, 1.00, 4.0, 16819769344, 0, 0, t, t, null, control, 2023-09-27 17:40:23.983469+00, Instance received normal shutdown signal, 2023-09-27 17:40:22.51428+00, null, unavailable, null, f).
  
  The above exception was the direct cause of the following exception:
  
  Traceback (most recent call last):
      sys.exit(manage())
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/__init__.py", line 200, in manage
      execute_from_command_line(sys.argv)
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
      utility.execute()
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/__init__.py", line 436, in execute
      self.fetch_command(subcommand).run_from_argv(self.argv)
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/base.py", line 412, in run_from_argv
      self.execute(*args, **cmd_options)
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/base.py", line 458, in execute
      output = self.handle(*args, **options)
    File "/usr/lib64/python3.9/contextlib.py", line 79, in inner
      return func(*args, **kwds)
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/management/commands/provision_instance.py", line 61, in handle
      self._register_hostname(options.get('hostname'), options.get('node_type'), options.get('uuid'))
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/management/commands/provision_instance.py", line 38, in _register_hostname
      (changed, instance) = Instance.objects.register(ip_address=os.environ.get('MY_POD_IP'), node_type='control', node_uuid=settings.SYSTEM_UUID)
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/managers.py", line 177, in register
      instance = self.create(hostname=hostname, ip_address=ip_address, node_type=node_type, **create_defaults, **uuid_option)
    File "/usr/lib64/python3.9/contextlib.py", line 137, in __exit__
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/utils/pglock.py", line 14, in advisory_lock
      yield internal_lock
    File "/usr/lib64/python3.9/contextlib.py", line 137, in __exit__
      self.gen.throw(typ, value, traceback)
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django_pglocks/__init__.py", line 80, in advisory_lock
      cursor.execute(command)
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/utils.py", line 67, in execute
      return self._execute_with_wrappers(
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/utils.py", line 80, in _execute_with_wrappers
      return executor(sql, params, many, context)
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/utils.py", line 83, in _execute
      self.db.validate_no_broken_transaction()
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 531, in validate_no_broken_transaction
      raise TransactionManagementError(
  django.db.transaction.TransactionManagementError: An error occurred in the current transaction. You can't execute queries until the end of the 'atomic' block.

As workaround I'm dropping not null column in ip_address for the table main_instance and AWX stop crashing. This is the table data:

Capture

AWX version

23.2.0

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

yes

Ansible version

No response

Operating system

No response

Web browser

Firefox, Chrome, Edge

Steps to reproduce

kustomization.yml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - github.com/ansible/awx-operator/config/default?ref=2.6.0
  - awxsmt-secret.yml
  - awxsmt-deploy.yml

images:
  - name: gcr.io/kubebuilder/kube-rbac-proxy:v0.13.0
    newName: xxxxxxx/xxxxxxxxx/awx/kube-rbac-proxy
    newTag: v0.8.0
  - name: quay.io/ansible/awx-operator:latest
    newName: xxxxxxxxxxxx/xxxxxxxxxawx-operator
    newTag: 2.6.0
namespace: awxm

awxsmt-deploy.yml

---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx
spec:
  image: xxxxxxxxxxxx/awx
  image_version: 23.2.0
  redis_image: xxxxxxxxxxx/redis
  redis_image_version: latest
  ee_images:
  - name: smart-awx-ee
    image: xxxxxxxxx/smart-awx-ee:latest
  control_plane_ee_image: xxxxxxxxxxxxxxxx/smart-awx-ee:latest
  init_container_image: xxxxxxxxxxxxxxxxxxxx/smart-awx-ee
  init_container_image_version: "latest" 
  init_projects_container_image: xxxxxxxxxxxxxxxxxxxxx/xxxxxx-centos:latest
  postgres_configuration_secret: awxsmt-postgres-config
  ingress_tls_secret: awxsmt-tls-config
  ingress_type: ingress
  ingress_annotations: |
    kubernetes.io/ingress.allow-http: "false"
  hostname: xxxxxxxxxxxxxxxxxxxxx
  service_type: nodeport
  projects_persistence: true
  projects_storage_access_mode: ReadWriteOnce
  secret_key_secret: awxsmt-secret-key-config
  admin_password_secret: awxsmt-admin-password-config
  ipv6_disabled: true

Stopped awx:

 kubectl delete -k .
namespace "awxm" deleted
customresourcedefinition.apiextensions.k8s.io "awxbackups.awx.ansible.com" deleted
customresourcedefinition.apiextensions.k8s.io "awxrestores.awx.ansible.com" deleted
customresourcedefinition.apiextensions.k8s.io "awxs.awx.ansible.com" deleted
serviceaccount "awx-operator-controller-manager" deleted
role.rbac.authorization.k8s.io "awx-operator-awx-manager-role" deleted
role.rbac.authorization.k8s.io "awx-operator-leader-election-role" deleted
clusterrole.rbac.authorization.k8s.io "awx-operator-metrics-reader" deleted
clusterrole.rbac.authorization.k8s.io "awx-operator-proxy-role" deleted
rolebinding.rbac.authorization.k8s.io "awx-operator-awx-manager-rolebinding" deleted
rolebinding.rbac.authorization.k8s.io "awx-operator-leader-election-rolebinding" deleted
clusterrolebinding.rbac.authorization.k8s.io "awx-operator-proxy-rolebinding" deleted
configmap "awx-operator-awx-manager-config" deleted
secret "awxsmt-admin-password-config" deleted
secret "awxsmt-postgres-config" deleted
secret "awxsmt-secret-key-config" deleted
secret "awxsmt-tls-config" deleted
service "awx-operator-controller-manager-metrics-service" deleted
deployment.apps "awx-operator-controller-manager" deleted
Error from server (NotFound): error when deleting ".": the server could not find the requested resource (delete awxs.awx.ansible.com awx)

Started awx with new images:

kubectl apply -k .      
namespace/awxm created
customresourcedefinition.apiextensions.k8s.io/awxbackups.awx.ansible.com created
customresourcedefinition.apiextensions.k8s.io/awxrestores.awx.ansible.com created
customresourcedefinition.apiextensions.k8s.io/awxs.awx.ansible.com created
serviceaccount/awx-operator-controller-manager created
role.rbac.authorization.k8s.io/awx-operator-awx-manager-role created
role.rbac.authorization.k8s.io/awx-operator-leader-election-role created
clusterrole.rbac.authorization.k8s.io/awx-operator-metrics-reader created
clusterrole.rbac.authorization.k8s.io/awx-operator-proxy-role created
rolebinding.rbac.authorization.k8s.io/awx-operator-awx-manager-rolebinding created
rolebinding.rbac.authorization.k8s.io/awx-operator-leader-election-rolebinding created
clusterrolebinding.rbac.authorization.k8s.io/awx-operator-proxy-rolebinding created
configmap/awx-operator-awx-manager-config created
secret/awxsmt-admin-password-config created
secret/awxsmt-postgres-config created
secret/awxsmt-secret-key-config created
secret/awxsmt-tls-config created
service/awx-operator-controller-manager-metrics-service created
deployment.apps/awx-operator-controller-manager created
awx.awx.ansible.com/awx created

Expected results

NAME                                               READY   STATUS    RESTARTS   AGE
awx-operator-controller-manager-7967797cb4-x77tk   2/2     Running   0          71s
awx-task-5b5fb6b9d-9z4vs                           4/4     Running   0          33s
awx-web-6cdfdb4f-vn2tb                             3/3     Running   0          12s

Actual results

No workaround:

NAME                                               READY   STATUS             RESTARTS         AGE
awx-operator-controller-manager-7967797cb4-gxh85   2/2     Running            0                41m
awx-task-65c7c4c86-9ps92                           3/4     CrashLoopBackOff   12 (2m45s ago)   40m
awx-web-778797f5df-jgnhd                           3/3     Running            0                40m

With workaround:

NAME                                               READY   STATUS    RESTARTS         AGE
awx-operator-controller-manager-7967797cb4-gxh85   2/2     Running   0                46m
awx-task-65c7c4c86-9ps92                           4/4     Running   13 (7m39s ago)   45m
awx-web-778797f5df-jgnhd                           3/3     Running   0                45m

Additional information

Images are used by local registry with no changes in the images

awx-ee image are updated with below:

requirements.txt

hszinc==1.3.1
hvac==0.11.2
pyhaystack==3.0.0
urllib3

requirements.yml

collections:
  - name: awx.awx
  - name: community.vmware
  - name: kubernetes.core
  - name: ansible.posix
  - name: ansible.windows
  - name: community.windows
  - name: community.general
@h3poteto
Copy link

I have the same issue.

@erik-de-neve
Copy link

Same here

@fosterseth fosterseth self-assigned this Oct 4, 2023
@fosterseth
Copy link
Member

when the task containers come up they should be adding an IP address based on the MY_POD_IP container environment variable

https://github.com/fosterseth/awx/blob/81e06dace2db58b6d9625c5edf73e2298a00cf15/awx/main/management/commands/provision_instance.py#L40C19-L40C19

Can someone shell into the task container and do a printenv and see if that environment variable is set?

@fosterseth
Copy link
Member

also, can folks who are experiencing this let us know _which _ version you are upgrading from and which you are going to?

also please provide output of kubectl get pods -o yaml

this will help us dig into this

@fosterseth
Copy link
Member

okay I know the problem now

https://github.com/fosterseth/awx/blob/441336301e401d2af860636883f040a68e3ef360/awx/main/managers.py#L133

when a conflicting IP address is detected, we set the old one to null instead of ""

@fosterseth
Copy link
Member

@erik-de-neve
@h3poteto
@renanguilhermef

workaround
Delete the pod and let a new one come up. Hopefully k8s assigns a new IP that doesn't conflict with the old pod's IP address.

If that doesn't work, follow these steps

  1. find pod ipaddress of the new task pod
    kubectl get pod awx-task-7f466b5947-22cqr -o jsonpath={.status.podIP}

  2. kubectl exec into web pod (and awx-web container) and run awx-manage shell_plus

  3. in django shell run Instance.objects.filter(ip_address='<insert_ip_address_from_step_1>').update(ip_address='')

Now the new pod should be able to come online and provision itself

@renanguilhermef
Copy link
Author

@erik-de-neve @h3poteto @renanguilhermef

workaround Delete the pod and let a new one come up. Hopefully k8s assigns a new IP that doesn't conflict with the old pod's IP address.

If that doesn't work, follow these steps

1. find pod ipaddress of the new task pod
   `kubectl get pod awx-task-7f466b5947-22cqr -o jsonpath={.status.podIP}`

2. kubectl exec into web pod (and awx-web container) and run `awx-manage shell_plus`

3. in django shell run `Instance.objects.filter(ip_address='<insert_ip_address_from_step_1>').update(ip_address='')`

Now the new pod should be able to come online and provision itself

Thanks, it worked for me deleting pod

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants