Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWX DB migration gets in loop and upgrade never ends #14314

Closed
7 of 11 tasks
elibogomolnyi opened this issue Aug 3, 2023 · 1 comment · Fixed by #14566
Closed
7 of 11 tasks

AWX DB migration gets in loop and upgrade never ends #14314

elibogomolnyi opened this issue Aug 3, 2023 · 1 comment · Fixed by #14566

Comments

@elibogomolnyi
Copy link

elibogomolnyi commented Aug 3, 2023

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to security@ansible.com instead.)

Bug Summary

When upgrading from 22.3 to 22.5, we noticed that the upgrade took much longer than expected, was stuck in the DB migration process, and didn't allow AWX to start. This upgrade included one of the database migration tasks.

This issue happened because of a bug in AWX DB migration logic. When we upgrade AWX, the AWX operator starts the DB migration. The following script is responsible for the retry mechanism.

The script makes a total of 30 retries to check the status of AWX migration. It will wait for a maximum of 60 seconds (TIMEOUT=60) for each attempt. However, an exponential backoff strategy dynamically calculates the waiting time between attempts.

The next_sleep function calculates the waiting time before the next attempt. It starts with MIN_SLEEP (0.5 seconds) and doubles the value with each attempt until it reaches MAX_SLEEP (30 seconds). So, the waiting times between attempts will be as follows:


Attempt 1: 0.5 seconds
Attempt 2: 1 second
Attempt 3: 2 seconds
Attempt 4: 4 seconds
Attempt 5: 8 seconds
Attempt 6: 16 seconds
Attempt 7: 30 seconds (maximum reached)
Attempt 8 and beyond: 30 seconds (maximum reached)


After the 7th attempt, the waiting time will remain constant at 30 seconds since it has reached the maximum configured value. Suppose none of the attempts succeeds within the specified TIMEOUT period (60 seconds). In that case, the script will fail with an error message "ERROR: Database migrations not applied" and return with exit code 1 and stop the migration task.

It means that if we have a large DB that requires more time to finish the particular migration task, the migration will never be finished since it will be stuck in a loop.

The workaround:
To resolve this issue, we found a workaround with the AWX community and implemented it. To go with a workaround, you should do the following actions:

  1. In the K8s deployments in the AWX namespace:
    Scale in awx-operator-controller-manager replicas to 0
    Scale in awx-task replicas to 0
    Scale in awx-web replicas to 0
  2. In awx-task deployment, find the following section:
    - name: awx-task
      image: pegasus-docker.artifactory.cyberng.com/pegasus-awx:22.5.0
      args:
        - /usr/bin/launch_awx_task.sh
      env:
        - name: AWX_COMPONENT
          value: task
        - name: SUPERVISOR_CONFIG_PATH
          value: /etc/supervisord_task.conf
        - name: AWX_SKIP_MIGRATIONS
          value: '1'
        - name: MY_POD_UID
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.uid

And replace the args section with

    - name: awx-task
      image: pegasus-docker.artifactory.cyberng.com/pegasus-awx:22.5.0
      args:
        - awx-manage
        - migrate
        - --noinput
      env:
        - name: AWX_COMPONENT
          value: task
        - name: SUPERVISOR_CONFIG_PATH
          value: /etc/supervisord_task.conf
        - name: AWX_SKIP_MIGRATIONS
          value: '1'
        - name: MY_POD_UID
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.uid
  1. Scale-out awx-task deployment to 1 replica. You should see the following message in the task container logs:
image 4. Wait till the migration is finished. The pod should finish running, and you should see the following log, which shows that the migration is finished: image 5. Scale out the **awx-operator-controller-manager** deployment to 1 replica. It will automatically delete the task you modified and scale out the AWX deployments.

AWX version

22.5

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

Upgrade the AWX with the large database from 22.3 to 22.5

Expected results

DB migration task is running without interruption

Actual results

DB migration task is interrupted with the pod restart

Additional information

No response

@TheRealHaoLiu
Copy link
Member

#14311 proposed solution

@fosterseth fosterseth changed the title AWX DB migration gets in loop and upgrade newer ends AWX DB migration gets in loop and upgrade never ends Aug 9, 2023
TheRealHaoLiu added a commit to TheRealHaoLiu/awx that referenced this issue Oct 11, 2023
TheRealHaoLiu added a commit to TheRealHaoLiu/awx-operator that referenced this issue Oct 11, 2023
related to ansible/awx#14314 and
ansible/awx#14566

each step of the migration can run for undetermined amount of time

there's no output between migration steps so the exec connection might be killed due to idle

adding background keepalive to prevent exec to be killed due to idle connection
TheRealHaoLiu added a commit to TheRealHaoLiu/awx that referenced this issue Oct 12, 2023
TheRealHaoLiu added a commit to TheRealHaoLiu/awx that referenced this issue Oct 13, 2023
TheRealHaoLiu added a commit to TheRealHaoLiu/awx that referenced this issue Oct 13, 2023
fixes ansible#14314

Removing retry attempt limit when waiting for migration to complete.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants