feat: #638 Add cronjob config for database restore #674

MCatherine1994 · 2024-08-16T22:14:28Z

added a yaml file for setting up a cronjob for database restore, refs: Test recovery of FOM database #638

Full documentation is at: https://github.com/bcgov/nr-fom/wiki/Database-Restore

Thanks for the PR!

Any successful deployments (not always required) will be available below.

Once merged, code will be promoted and handed off to following workflow run.

Main Merge Workflow

Thanks for the PR!

Any successful deployments (not always required) will be available below.

Once merged, code will be promoted and handed off to following workflow run.

Main Merge Workflow

db/openshift.deploy.yml

.github/workflows/merge-main.yml

db/openshift.deploy.yml

.github/workflows/pr-open.yml

.github/workflows/merge-main.yml

db/openshift.deploy.yml

.github/workflows/merge-demo.yml

.github/workflows/pr-open.yml

db/openshift.deploy.yml

MCatherine1994 · 2024-08-20T21:51:29Z

db/openshift.deploy.yml

+                      echo "Running SQL file: $sql_file"
+                      psql -h ${NAME}-${ZONE}-${COMPONENT} -U ${POSTGRES_USER} -d ${POSTGRES_DB} -f $sql_file
+                      echo "Finish database restore"
+                    fi


So now the logic is that:

Try to find the sql file first (in case already unzip it in the past)

If not found, try to find the zipped sql file, and unzip it

Get the sql file again, if not found then print "no backup sql found" and do nothing

If found then run restore script

Should exit with error if no backup SQL file is found. Should check the return value of the psql commands and report errors.

The restore finishes with the old schema existing, which is fine. Part of the recovery process could be to remove that. Also, maybe this process should perform a backup first before doing the restore?

yeah, we should remove that, otherwise we can't run the restore multiple times. Yeah, I like the idea to the cleanup before run the restore, thanks!!

when no backup file is found, it will just print "Not found" and do nothing for now. I'll check how to report errors.

So rather than having the cron job finish successfully in the case of no backup SQL file found, we should have the job run marked as a failure (which I assume will happen with a non-zero exit code from the script).

When no backup file found, exit 1:

When psql command failed, exit 1 (the following example failure happened when I pass an empty file psql -h fom-24-db -U ${POSTGRES_USER} -d ${POSTGRES_DB} -f ''):

I think it's better to run backup before restore, but not sure if we want to include that in this script as well, it's getting complicated....

Yes, probably sufficient for now to just have our documented process say to do a manual backup before trying to do the restore. (and bring the app down before doing the backup).

db/openshift.deploy.yml

…test-db-backup

basilv · 2024-08-21T17:28:14Z

db/openshift.deploy.yml

+            app: ${NAME}-${ZONE}
+            cronjob: ${NAME}-${ZONE}
+        spec:
+          backoffLimit: ${{JOB_BACKOFF_LIMIT}}


Where are these limits set?

these are set for the db backup cronjob, I thought maybe I can use them here as well

Hmm, not sure if that makes sense, since we have retry = never.

um... this I'm not so sure... just follow the one for db backup

I changed to 0. The restart policy sets to never will ensure when the pod fails, it won't be restarted. But we still need to set backofflimit to make sure the job will not be retired.

basilv · 2024-08-21T17:33:19Z

db/openshift.deploy.yml

+                      echo "Running SQL file: $sql_file"
+                      psql -h ${NAME}-${ZONE}-${COMPONENT} -U ${POSTGRES_USER} -d ${POSTGRES_DB} -f $sql_file
+                      echo "Finish database restore"
+                    fi


The restore finishes with the old schema existing, which is fine. Part of the recovery process could be to remove that. Also, maybe this process should perform a backup first before doing the restore?

…test-db-backup

…, refs: #638

basilv · 2024-08-28T17:26:36Z

Reading the wiki process: "Do a manual backup of the current database (can do that in the database pod, so this temporary backup will be stored in the database volume)" -> the specific commands to run should be provided. I'm a little concerned by doing this versus the normal backup process as if you need to restore from this special backup you need a special restore process, but on the other hand if the regular restore process fails, this approach should still succeed. Just want to be careful that if this special backup isn't stored within a PVC, then if the DB pod restarts the backup will be lost.

basilv · 2024-08-28T17:28:02Z

db/openshift.deploy.yml

+                  # use the same image as our database, so we can run the psql command
+                  image: image-registry.apps.silver.devops.gov.bc.ca/${OC_NAMESPACE}/${NAME}-${ZONE}-${COMPONENT}:${ZONE}-db
+                  command: ["/bin/sh", "-c"]
+                  args:


I thought we talked about making this a script file that's loaded into the database image. But I'm okay if it is left like this.

basilv · 2024-08-28T17:28:36Z

db/openshift.deploy.yml

+                      exit 1
+                    else
+                      echo "Found SQL file, rename existing database, and create a new empty database"
+                      psql -h ${NAME}-${ZONE}-${COMPONENT} -U ${POSTGRES_USER} -c "DROP DATABASE IF EXISTS ${OLD_FOM_DATABASE_NAME};" -c "ALTER DATABASE fom RENAME TO ${OLD_FOM_DATABASE_NAME};" -c "CREATE DATABASE fom;"


Still not checking return code from this psql call.

basilv · 2024-08-28T17:30:39Z

db/openshift.deploy.yml

+                      exit 1
+                    else
+                      echo "Found SQL file, rename existing database, and create a new empty database"
+                      psql -h ${NAME}-${ZONE}-${COMPONENT} -U ${POSTGRES_USER} -c "DROP DATABASE IF EXISTS ${OLD_FOM_DATABASE_NAME};" -c "ALTER DATABASE fom RENAME TO ${OLD_FOM_DATABASE_NAME};" -c "CREATE DATABASE fom;"


The process documented in the wiki could talk more about what to do if the restore fails (or wasn't the correct restore point and needs to be redone), which could mean picking a different value for old_fom_database_name, and/or one option for reverting the restore would be to drop the fom database and rename old_fom to fom...

webgismd · 2024-08-30T03:52:24Z

Team Evergreen has in their backlog to look at DB restore/backup as part of STRA-- goal was to not use PVC but object storage. they had buckets created for this purpose, but I am not sure how far they have gone with it yet. @craigyu or @DerekRoberts may know?

DerekRoberts · 2024-08-30T04:22:45Z

@webgismd Yup! @RMCampos is working on that. I'll be sidekicking and making it easier to repeat.

basilv · 2024-08-30T04:45:11Z

@webgismd for FOM at least, the database is small and we store a limited number of backups so storage demands are limited. But there are some nice aspects to using object storage instead of a PVC - better resistance to ransomware-style attacks, especially if you can structure your object storage permissions to be insert-only... I'd be more worried though about every team developing a completely different backup/restore process, versus having a defined method for OpenShift Postgres DBs that all the teams can leverage.

webgismd · 2024-08-30T16:41:24Z

I 100% agree here @basilv , FDS is essentially on its own to develop a pattern for itself. I don't see alot of tactile support from other teams in this area. Perhaps something to discuss with @RMCampos @jazzgrewal @craigyu @paulushcgcj @abschwenker @DerekRoberts

DerekRoberts · 2024-08-30T18:40:36Z

@webgismd @basilv Big yes! Let's solve this and get it out there. Our teams, quickstart, developer.gov, stink overflow, etc.

paulushcgcj · 2024-09-03T22:31:38Z

@webgismd for FOM at least, the database is small and we store a limited number of backups so storage demands are limited. But there are some nice aspects to using object storage instead of a PVC - better resistance to ransomware-style attacks, especially if you can structure your object storage permissions to be insert-only... I'd be more worried though about every team developing a completely different backup/restore process, versus having a defined method for OpenShift Postgres DBs that all the teams can leverage.

The base here was derived from client. In our case, we will keep both a PVC and an S3 backup, as the PVC is a "hot" copy, that will be the more recent one while the S3 will have all past and all current ones, but will be handled as a "cold" copy. This was at least the idea/suggestion I gave and is what we're aiming to do on client.

MCatherine1994 added 15 commits August 13, 2024 16:53

test(638): trigger dev deployment

2dc4793

fix: try to not stop the pod after cron job

efd93e6

fix(638): added db restore cronjob , refs: #638

d9f6f0a

fix(638): fix yml format error, refs: #638

2dbc136

fix(638): fix yml format error, refs: #638

b07619d

fix(638): add placehoder for customize env variables, refs: #638

5ec33a0

fix: add a deployment config for testing

36deccc

another fix to try

e58b583

another fix to try

b11645f

update the yaml file

4eadd0b

fix and try

9a7db64

fix(638): fix database backup cron job, refs: #638

3878448

feat(638): add comment to restore yaml, refs: #638

527197f

fix(638): fix db backup schedule back to original time, refs: #638

cb46007

fix(638): add more comment to the db restore yaml, refs: #638

6563786

MCatherine1994 changed the title ~~Feat/638 test db backup~~ feat: #638 Add cronjob config for database restore Aug 19, 2024

MCatherine1994 commented Aug 19, 2024

View reviewed changes

db/openshift.deploy.yml Show resolved Hide resolved

MCatherine1994 commented Aug 19, 2024

View reviewed changes

db/openshift.deploy.yml Outdated Show resolved Hide resolved

MCatherine1994 requested review from ianliuwk1019 and basilv August 19, 2024 23:31

merge from main

bb5964b

MCatherine1994 marked this pull request as ready for review August 19, 2024 23:33