-
Notifications
You must be signed in to change notification settings - Fork 631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWXRestore fails - never ending loop of 'awxbackup-restore-db-management'. #879
Comments
It seems I'm having exactly the same problem (or very similar). |
seeing the same behavior on 21.0.0 |
I think it's related to the backup_dir name with time. When i rename backup folder to remove time restore is working correctly. |
Thank you for your reply.
However, the error remained identical. Kind regards |
When I remove
This is just a backup of vanilla awx installation with the operator, nothing fancy or custom, just like in OP. |
So based on this error message it happens when we are trying to restore on top of existing postgres database. If before restore we remove the PV with postgres data it works... |
I also should mention that the PV needs to be remove in such a manner not to let the awx pod to get a chance to re-create the schema. I'm scaling down the awx pod as soon as it comes up during the restore process, delete the PV, and scale up the awx pod after the restore is completed. |
I'm not sure if I understand your approach. I have an empty cluster in my scenario where I create only the awx-operator and the backup PVC. I then run the AWXRestore job to have a completely new AWX instance created from it. At this point I have no already deployed instance and no existing postgres database in the cluster. |
Yeah, so what happens is that the restore process creates an instance of awx CRD from your backup. The operator picks up that instance and brings up awx and postgres. awx then sees a empty postgres database and initializes it. After that the restore is trying to apply the pgdump on top of that database which fails. If I make sure that the restore process is running on top of an empty postgres instance as opposed to running on top of the awx database initialized by the awx pod, it works. Note, that I'm not implying that this is how it is supposed to operate, it is clearly a bug, I'm just sharing the workaround that worked for me, and hopefully providing some insight for the devs that might help fix it. It could be a race - if the restore kick in before the awx had the chance to initialise the db it might work out of the box, although I never seen it working like that. |
wow, thanks a lot - this workaround actually works. However, since the operator in continuous loop scales the AWX deployment back up to 1, it is a matter of good timing of:
In this case the restore job can go through. Once the restore pod is done, the AWX deployment can then be scaled back to 1 and we finaly restored all data from the backup. However, it would be really good to include this as a known bug and fix it - as long as an AWX restore job is running, the operator should not be allowed to start AWX instances. |
Interesting, may be they "fixed" this in the last release. Before that when I was testing last week, the operator did not scale the deployment back to 1. |
Hi folks, sorry for the delayed response here. Is anyone still seeing this on the latest version? |
I just did...
scaled awx and postgres to 0, cleaned up pvc/pv, scaled postgres back to 1, restore completed and scaled awx. |
I just had time to re-test it with operator 1.1.0 and it appears to be working now - sorry for the delay |
I am facing the same issue with operator version 2.16.1 and awx v 24.3.1. Only difference is I am just spinning up AWXBackup and pod |
I have the same issue, backup job starting psql container and then killing it, i havent seen any error log in logs and pod describe, have you found a solution? |
No I haven't. Not sure how to proceed at the moment. There is a documentation about it but it simply doesn't work! Its been heck of a ride trying and testing different approaches to simply get a backup & restore awx tower from that backup! |
Well i have upgraded operator from 2.2.1 to latest 2.17.0 and it started working! Not sure what the issue was |
@batchenr did you use this backup role & restore role without much modifications to backup & restore complete AWX tower? And if you can share backup/restore files if you had to customize those files would be great. |
did you deleted and updated all the crds? (if you are using helm) backup_pvc.yaml
backup.yaml
restore.yaml
i did kubectl apply -f to pvc and then backup |
I am using kustomization file. Do I need to add those pvc, backup & restore files to Kustomization.yaml
My pvc.yaml
backup.yaml
restore.yaml
|
If you use kustomization its supposed to be the same, what is the error you're getting? |
@batchenr the pod: |
I see the same behavior as @vivekshete9 describes, endless re-spinning up. I use the most basic example as described in the docs. ---
apiVersion: awx.ansible.com/v1beta1
kind: AWXBackup
metadata:
name: <name>
namespace: <namespace>
spec:
deployment_name: <deploymentname>
backup_storage_class: "<storageclass>"
backup_storage_requirements: "1Gi"
backup_pvc_namespace: "<namespace>"
image_pull_policy: "IfNotPresent"
clean_backup_on_delete: false
no_log: true With a container connected to the backup pvc, after a backup is "created", I see a lot of Even with
|
I have a problem with the restore of an instance. It's about an initial demo setup - we need to ensure a working restore before we can put the AWX into production.
I installed a new instance and then created a SourceCode credential, a Project and a Template. I then ran a backup job. After that I deleted the whole AWX including operator, pvc and pv. Nothing exists in the AWX namespace anymore, from the AWX point of view it is an empty Kubernetes environment.
Then I created a new PV for backup and postgres and a PVC for backup. The new backup PVC contains the backup from the previous run. I deployed the operator (without instance) again. Then I tried to do a restore from the existing backup.
Expected Behavior
The restore will rebuild the postgres database as well as the instance and everything will be up to date with the backup.
Current Behavior
The restore job builds the postgres container and the 4 instance containers.
However, the migration has an error and then restarts both the migration-container and the 4 instance containers in a never-ending loop.
Environment
Steps to Reproduce
Detailed Description
AWX instance yaml
AWXBackup yaml
AWXRestore yaml
Error Message
A more detailed error message can be found in the attachment.
restore_awx_output.txt
Thank you very much for your help.
The text was updated successfully, but these errors were encountered: