Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: #638 Add cronjob config for database restore #674

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

MCatherine1994
Copy link
Collaborator

@MCatherine1994 MCatherine1994 commented Aug 16, 2024

Full documentation is at: https://github.com/bcgov/nr-fom/wiki/Database-Restore


Thanks for the PR!

Any successful deployments (not always required) will be available below.

Once merged, code will be promoted and handed off to following workflow run.


Thanks for the PR!

Any successful deployments (not always required) will be available below.

Once merged, code will be promoted and handed off to following workflow run.

@MCatherine1994 MCatherine1994 changed the title Feat/638 test db backup feat: #638 Add cronjob config for database restore Aug 19, 2024
@MCatherine1994 MCatherine1994 marked this pull request as ready for review August 19, 2024 23:33
db/openshift.deploy.yml Outdated Show resolved Hide resolved
db/openshift.deploy.yml Outdated Show resolved Hide resolved
echo "Running SQL file: $sql_file"
psql -h ${NAME}-${ZONE}-${COMPONENT} -U ${POSTGRES_USER} -d ${POSTGRES_DB} -f $sql_file
echo "Finish database restore"
fi
Copy link
Collaborator Author

@MCatherine1994 MCatherine1994 Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now the logic is that:

  1. Try to find the sql file first (in case already unzip it in the past)
  2. If not found, try to find the zipped sql file, and unzip it
  3. Get the sql file again, if not found then print "no backup sql found" and do nothing
  4. If found then run restore script

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should exit with error if no backup SQL file is found. Should check the return value of the psql commands and report errors.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The restore finishes with the old schema existing, which is fine. Part of the recovery process could be to remove that. Also, maybe this process should perform a backup first before doing the restore?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we should remove that, otherwise we can't run the restore multiple times. Yeah, I like the idea to the cleanup before run the restore, thanks!!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when no backup file is found, it will just print "Not found" and do nothing for now. I'll check how to report errors.
Screen Shot 2024-08-20 at 2 41 06 PM

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So rather than having the cron job finish successfully in the case of no backup SQL file found, we should have the job run marked as a failure (which I assume will happen with a non-zero exit code from the script).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When no backup file found, exit 1:
Screen Shot 2024-08-21 at 4 05 42 PM

When psql command failed, exit 1 (the following example failure happened when I pass an empty file psql -h fom-24-db -U ${POSTGRES_USER} -d ${POSTGRES_DB} -f ''):

Screen Shot 2024-08-21 at 4 12 54 PM

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to run backup before restore, but not sure if we want to include that in this script as well, it's getting complicated....

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, probably sufficient for now to just have our documented process say to do a manual backup before trying to do the restore. (and bring the app down before doing the backup).

app: ${NAME}-${ZONE}
cronjob: ${NAME}-${ZONE}
spec:
backoffLimit: ${{JOB_BACKOFF_LIMIT}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are these limits set?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are set for the db backup cronjob, I thought maybe I can use them here as well

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not sure if that makes sense, since we have retry = never.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

um... this I'm not so sure... just follow the one for db backup

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed to 0. The restart policy sets to never will ensure when the pod fails, it won't be restarted. But we still need to set backofflimit to make sure the job will not be retired.

echo "Running SQL file: $sql_file"
psql -h ${NAME}-${ZONE}-${COMPONENT} -U ${POSTGRES_USER} -d ${POSTGRES_DB} -f $sql_file
echo "Finish database restore"
fi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The restore finishes with the old schema existing, which is fine. Part of the recovery process could be to remove that. Also, maybe this process should perform a backup first before doing the restore?

@basilv
Copy link
Collaborator

basilv commented Aug 28, 2024

Reading the wiki process: "Do a manual backup of the current database (can do that in the database pod, so this temporary backup will be stored in the database volume)" -> the specific commands to run should be provided. I'm a little concerned by doing this versus the normal backup process as if you need to restore from this special backup you need a special restore process, but on the other hand if the regular restore process fails, this approach should still succeed. Just want to be careful that if this special backup isn't stored within a PVC, then if the DB pod restarts the backup will be lost.

# use the same image as our database, so we can run the psql command
image: image-registry.apps.silver.devops.gov.bc.ca/${OC_NAMESPACE}/${NAME}-${ZONE}-${COMPONENT}:${ZONE}-db
command: ["/bin/sh", "-c"]
args:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we talked about making this a script file that's loaded into the database image. But I'm okay if it is left like this.

exit 1
else
echo "Found SQL file, rename existing database, and create a new empty database"
psql -h ${NAME}-${ZONE}-${COMPONENT} -U ${POSTGRES_USER} -c "DROP DATABASE IF EXISTS ${OLD_FOM_DATABASE_NAME};" -c "ALTER DATABASE fom RENAME TO ${OLD_FOM_DATABASE_NAME};" -c "CREATE DATABASE fom;"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not checking return code from this psql call.

exit 1
else
echo "Found SQL file, rename existing database, and create a new empty database"
psql -h ${NAME}-${ZONE}-${COMPONENT} -U ${POSTGRES_USER} -c "DROP DATABASE IF EXISTS ${OLD_FOM_DATABASE_NAME};" -c "ALTER DATABASE fom RENAME TO ${OLD_FOM_DATABASE_NAME};" -c "CREATE DATABASE fom;"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process documented in the wiki could talk more about what to do if the restore fails (or wasn't the correct restore point and needs to be redone), which could mean picking a different value for old_fom_database_name, and/or one option for reverting the restore would be to drop the fom database and rename old_fom to fom...

@webgismd
Copy link
Contributor

Team Evergreen has in their backlog to look at DB restore/backup as part of STRA-- goal was to not use PVC but object storage. they had buckets created for this purpose, but I am not sure how far they have gone with it yet. @craigyu or @DerekRoberts may know?

@DerekRoberts
Copy link
Member

@webgismd Yup! @RMCampos is working on that. I'll be sidekicking and making it easier to repeat.

@basilv
Copy link
Collaborator

basilv commented Aug 30, 2024

@webgismd for FOM at least, the database is small and we store a limited number of backups so storage demands are limited. But there are some nice aspects to using object storage instead of a PVC - better resistance to ransomware-style attacks, especially if you can structure your object storage permissions to be insert-only... I'd be more worried though about every team developing a completely different backup/restore process, versus having a defined method for OpenShift Postgres DBs that all the teams can leverage.

@webgismd
Copy link
Contributor

I 100% agree here @basilv , FDS is essentially on its own to develop a pattern for itself. I don't see alot of tactile support from other teams in this area. Perhaps something to discuss with @RMCampos @jazzgrewal @craigyu @paulushcgcj @abschwenker @DerekRoberts

@DerekRoberts
Copy link
Member

@webgismd @basilv Big yes! Let's solve this and get it out there. Our teams, quickstart, developer.gov, stink overflow, etc.

@paulushcgcj
Copy link

@webgismd for FOM at least, the database is small and we store a limited number of backups so storage demands are limited. But there are some nice aspects to using object storage instead of a PVC - better resistance to ransomware-style attacks, especially if you can structure your object storage permissions to be insert-only... I'd be more worried though about every team developing a completely different backup/restore process, versus having a defined method for OpenShift Postgres DBs that all the teams can leverage.

The base here was derived from client. In our case, we will keep both a PVC and an S3 backup, as the PVC is a "hot" copy, that will be the more recent one while the S3 will have all past and all current ones, but will be handled as a "cold" copy. This was at least the idea/suggestion I gave and is what we're aiming to do on client.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants