Recovered from a botched cluster without a backup🤦‍♀️- Was this the right way of doing it? #2810

Lockszmith-GH · 2023-09-18T22:38:39Z

Lockszmith-GH
Sep 18, 2023

I came here asking questions, while gathering my thought and presenting the data, I actualy figured it out on my own.

TL;DR:
I managed to spin up a postgres docker container and do a pg_dump form the PVCs of a botched cnpg cluster.
If you're interested in the details, you can expand below and see the revised process with the success.

I still have some open qestions though...

If I was able to spin up a container manually that allowed me to read the database, isn't there another way to restore a broken cnpg cluster?

Is what I described above the only way I had?

My background

I'm completely new to cloudnative-pg, I'm no master of kubernetes either, although I believe I can navigate instructions well enough.
I'm completely comfortable with docker and docker-compose, been using that for years.

How I got here

Plain naivate, and ignorance. (see above)

The cloudnative-pg implementation I'm using are based on helm charts by TrueCharts deployed on a TrueNAS SCALE machine.
The TrueNAS SCALE has a k3s cluster implemented, and TrueCharts apps are in essence helm charts based on thier templates and the DB of choice there are cloudnative-pg. Suffice to say, I didn't actually deploy it myself, but it was deployed automatically.

After a power outtage on the machine which preceeded the machine running out of disk space - the combination prooved leathal for one of the apps, specifically that app's cnpg cluster.

Mix that with me thinking I can figure it out on my own, I deleted pods manually (yes, yes I know - stupid!)
I take full responsibility for the stupidity. I know it is my fault, but the fact of the matter is - I'm stuck with it right now.

Since then I've stopped, started reading (what I should have done first), and I think I understand cnpg better, but that doesn't mean I know how to get out of this situation.

What I've got

The backup's I have are too old (another stupid), and so I'm invested in recovering the data.

I am left with the PVCs and a cluster that refuses to start.
Accoring to kubectl cnpg status (see below) the cluster's primary is set to myappname-cnpg-main-2, but that pod is missing. And switching over to myappname-cnpg-main-1 is failing.

Where I was struggling

I'm currently struggling executing my plan for recovery (based on this SO answer):
I woud like to spin up a postgres docker container running attached to the data in the PVCs so I can pg_dump the database, and then I can reinstall the app and recover it.

What I tried and eventually succeeded with

Copied over the content of the myappname-cnpg-main-1 and myappname-cnpg-main-1-wal PVCs into a local dir (different machine) preserving the permissions, call these pgdata and pg_wal in my local dir respectivley.
The original copy contained a custom.conf file with posgres cluster/replication configuration and ssl, once I figured out that was stopping me from succeding, I moved the file away, and placed a blank file instead:
```
sudo mv pgdata/custom.conf pgdata/custom.conf.off
sudo bash -c 'install -o 999 -m 600 <(printf "%s\n" "log_min_messages = info" "ssl = off") pgdata/custom.conf'
```

Spin up a postgres docker container, mapped to the pgdata and wal properly.

docker run -it --rm --name=pgrecover \
    -e POSTGRES_HOST_AUTH_METHOD=trust \
    -e PGDATA=/var/lib/postgresql/data \
    -v ${PWD}/pgdata:/var/lib/postgresql/data \
    -v ${PWD}/pg_wal:/var/lib/postgresql/wal/pg_wal \
    -v ${PWD}controller:/controller \
   postgres:15

There were a few warnings about SSL switched on while pg_hba.conf was processed, but I finally got message I wanted to get:

Database system is ready to accept connections

Ctrl-C, to stop and remove the container, and spun it up again, this time with --detached so it will run in the background.

Then I run pg_dump on the active background container, for both the postgres and myappname into local files.

docker exec -it --user postgres pgrecover pg_dump -Fc postgrss > dump.postgress
docker exec -it --user postgres pgrecover pg_dump -Fc myappname > dump.myappname

Stopped and removed the container, and I could continue with the recovery process.
```
docker rm --force pgrecover
```

(my old) Question (see top for a revised version of this one)

Is this approach viable? Is there a different way to recover this cluster without losing the data?

Technical Details / cli output

existing pods

(click to expand)

Output of k3s kubectl get --namespace=ix-myappname pods | sort :

NAME                    READY   STATUS             RESTARTS       AGE
myappname-cnpg-main-1   0/1     CrashLoopBackOff   69 (23s ago)   5h30m

cnpg status output

(click to expand)

Output of kubectl cnpg status --namespace=ix-myappname myappname-cnpg-main:

Cluster Summary
Name:              myappname-cnpg-main
Namespace:         ix-myappname
PostgreSQL Image:  ghcr.io/cloudnative-pg/postgresql:15.3
Primary instance:  myappname-cnpg-main-2
Status:            Switchover in progress Switching over to vaultwarden-cnpg-main-1
Instances:         2
Ready instances:   0

Certificates Status
Certificate Name                   Expiration Date                Days Left Until Expiration
----------------                   ---------------                --------------------------
myappname-cnpg-main-ca           2023-11-24 21:00:38 +0000 UTC  67.00
myappname-cnpg-main-replication  2023-11-24 21:00:38 +0000 UTC  67.00
myappname-cnpg-main-server       2023-11-24 21:00:38 +0000 UTC  67.00

Continuous Backup status
Not configured

Streaming Replication status
Primary instance not found

Unmanaged Replication Slot Status
No unmanaged replication slots found

Instances status
Name                     Database Size  Current LSN  Replication role  Status             QoS        Manager Version  Node
----                     -------------  -----------  ----------------  ------             ---        ---------------  ----
myappname-cnpg-main-1  -              -            -                 pod not available  Burstable  -                ix-truenas

list of pvc

(click to expand)

Output of k3s kubectl get --namespace=ix-myappname pvc | sort :

NAME                        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                 AGE
myappname-cnpg-main-1       Bound    pvc-3fc505ef-5d62-4341-a902-9b7a638560d3   10Gi       RWO            ix-storage-class-myappname   23d
myappname-cnpg-main-1-wal   Bound    pvc-c85d5fad-c729-49a3-9e5b-6b1cd7b5e891   10Gi       RWO            ix-storage-class-myappname   23d
myappname-cnpg-main-2       Bound    pvc-062ea3cf-ce8d-4aa2-b264-9b5c263a4c5e   10Gi       RWO            ix-storage-class-myappname   23d
myappname-cnpg-main-2-wal   Bound    pvc-696fae64-2bec-4e10-8a42-79aadfa17a42   10Gi       RWO            ix-storage-class-myappname   23d
myappname-data              Bound    pvc-9697517e-f082-47bd-a92d-c045b1b412ef   10Gi       RWO            ix-storage-class-myappname   23d

Output of zfs list -o ix:pvc-name,name,avail,usedbydataset,used,quota,canmount,mounted,mountpoint -d3 zpool/ix-applications/releases/myappname:

IX:PVC-NAME                  NAME                                                                                       AVAIL  USEDDS   USED  QUOTA  CANMOUNT  MOUNTED  MOUNTPOINT
-                            zpool/ix-applications/releases/myappname                                                    186G    104K  52.8M   none  on        yes      /mnt/zpool/ix-applications/releases/myappname
-                            zpool/ix-applications/releases/myappname/charts                                             186G    312K   312K   none  on        yes      /mnt/zpool/ix-applications/releases/myappname/charts
-                            zpool/ix-applications/releases/myappname/volumes                                            186G     96K  52.4M   none  on        yes      /mnt/zpool/ix-applications/releases/myappname/volumes
-                            zpool/ix-applications/releases/myappname/volumes/ix_volumes                                 186G     96K    96K   none  on        yes      /mnt/zpool/ix-applications/releases/myappname/volumes/ix_volumes
myappname-cnpg-main-2        zpool/ix-applications/releases/myappname/volumes/pvc-062ea3cf-ce8d-4aa2-b264-9b5c263a4c5e  9.99G   13.8M  13.8M    10G  on        no       legacy
myappname-cnpg-main-1        zpool/ix-applications/releases/myappname/volumes/pvc-3fc505ef-5d62-4341-a902-9b7a638560d3  9.99G   13.9M  14.4M    10G  on        yes      legacy
myappname-cnpg-main-2-wal    zpool/ix-applications/releases/myappname/volumes/pvc-696fae64-2bec-4e10-8a42-79aadfa17a42  10.0G   1.09M  1.09M    10G  on        no       legacy
myappname-data               zpool/ix-applications/releases/myappname/volumes/pvc-9697517e-f082-47bd-a92d-c045b1b412ef  9.98G   21.5M  21.6M    10G  on        no       legacy
myappname-cnpg-main-1-wal    zpool/ix-applications/releases/myappname/volumes/pvc-c85d5fad-c729-49a3-9e5b-6b1cd7b5e891  10.0G   1.08M  1.30M    10G  on        yes      legacy

logs of the cnpg-main-1

(click to expand)

Output of kubectl logs --namespace ix-myappname myappname-cnpg-main-1 :

Defaulted container "postgres" out of: postgres, bootstrap-controller (init)
{"level":"info","ts":"2023-09-18T22:10:43Z","logger":"setup","msg":"Starting CloudNativePG Instance Manager","logging_pod":"myappname-cnpg-main-1","version":"1.20.2","build":{"Version":"1.20.2","Commit":"6f7f10b7","Date":"2023-07-27"}}
{"level":"info","ts":"2023-09-18T22:10:43Z","logger":"setup","msg":"starting controller-runtime manager","logging_pod":"myappname-cnpg-main-1"}
{"level":"info","ts":"2023-09-18T22:10:43Z","msg":"Starting EventSource","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","source":"kind source: *v1.Cluster"}
{"level":"info","ts":"2023-09-18T22:10:43Z","msg":"Starting Controller","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2023-09-18T22:10:43Z","msg":"Starting webserver","logging_pod":"myappname-cnpg-main-1","address":":9187"}
{"level":"info","ts":"2023-09-18T22:10:43Z","logger":"roles_reconciler","msg":"starting up the runnable","logging_pod":"myappname-cnpg-main-1"}
{"level":"info","ts":"2023-09-18T22:10:43Z","logger":"roles_reconciler","msg":"setting up RoleSynchronizer loop","logging_pod":"myappname-cnpg-main-1"}
{"level":"error","ts":"2023-09-18T22:10:43Z","msg":"Error waiting on the PostgreSQL process","logging_pod":"myappname-cnpg-main-1","error":"chmod /var/lib/postgresql/data/pgdata: operation not permitted","stacktrace":"github.com/cloudnative-pg/cloudnative-pg/pkg/man
agement/log.(*logger).Error\n\tpkg/management/log/log.go:128\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/instance/run/lifecycle.(*PostgresLifecycle).Start\n\tinternal/cmd/manager/instance/run/lifecycle/lifecycle.go:97\nsigs.k8s.io/controller-runtime
/pkg/manager.(*runnableGroup).reconcile.func1\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/manager/runnable_group.go:219"}
{"level":"info","ts":"2023-09-18T22:10:43Z","msg":"Stopping and waiting for non leader election runnables"}
{"level":"info","ts":"2023-09-18T22:10:43Z","msg":"Stopping and waiting for leader election runnables"}
{"level":"info","ts":"2023-09-18T22:10:43Z","msg":"Starting webserver","logging_pod":"myappname-cnpg-main-1","address":"localhost:8010"}
{"level":"info","ts":"2023-09-18T22:10:43Z","msg":"Starting webserver","logging_pod":"myappname-cnpg-main-1","address":":8000"}
{"level":"error","ts":"2023-09-18T22:10:43Z","msg":"error received after stop sequence was engaged","error":"chmod /var/lib/postgresql/data/pgdata: operation not permitted","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedu
re.func1\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/manager/internal.go:555"}
{"level":"info","ts":"2023-09-18T22:10:43Z","msg":"Exited log pipe","fileName":"/controller/log/postgres.json","logging_pod":"myappname-cnpg-main-1"}
{"level":"info","ts":"2023-09-18T22:10:43Z","msg":"Exited log pipe","fileName":"/controller/log/postgres.csv","logging_pod":"myappname-cnpg-main-1"}
{"level":"info","ts":"2023-09-18T22:10:43Z","msg":"Exited log pipe","fileName":"/controller/log/postgres","logging_pod":"myappname-cnpg-main-1"}
{"level":"info","ts":"2023-09-18T22:10:43Z","logger":"roles_reconciler","msg":"Terminated RoleSynchronizer loop","logging_pod":"myappname-cnpg-main-1"}
{"level":"info","ts":"2023-09-18T22:10:43Z","msg":"Starting workers","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","worker count":1}
{"level":"info","ts":"2023-09-18T22:10:43Z","msg":"Shutdown signal received, waiting for all workers to finish","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2023-09-18T22:10:43Z","msg":"Webserver exited","logging_pod":"myappname-cnpg-main-1","address":"localhost:8010"}
{"level":"info","ts":"2023-09-18T22:10:43Z","msg":"Webserver exited","logging_pod":"myappname-cnpg-main-1","address":":8000"}
{"level":"info","ts":"2023-09-18T22:10:43Z","msg":"All workers finished","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2023-09-18T22:10:43Z","msg":"Webserver exited","logging_pod":"myappname-cnpg-main-1","address":":9187"}
{"level":"info","ts":"2023-09-18T22:10:43Z","msg":"Stopping and waiting for caches"}
{"level":"error","ts":"2023-09-18T22:10:43Z","logger":"controller-runtime.source.EventHandler","msg":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.Cluster Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source
.(*Kind).Start.func1.1\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/source/kind.go:68\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\tpkg/mod/k8s.io/apimachinery@v0.27.4/pkg/util/wait/loop.go:49\nk8s.io/apimachinery/pkg/util/wai
t.loopConditionUntilContext\n\tpkg/mod/k8s.io/apimachinery@v0.27.4/pkg/util/wait/loop.go:50\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\tpkg/mod/k8s.io/apimachinery@v0.27.4/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*
Kind).Start.func1\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/source/kind.go:56"}
{"level":"info","ts":"2023-09-18T22:10:43Z","msg":"Stopping and waiting for webhooks"}
{"level":"info","ts":"2023-09-18T22:10:43Z","msg":"Wait completed, proceeding to shutdown the manager"}

PVC contents

(click to expand)

Content of the pgdata PVC:

vaultwarden-cnpg-main-1/pgdata:
total 70K
drwxrws--- 18  999 tape   29 Sep 18 08:57 ./
drwxrwsr-x  3 root tape    3 Aug 26 17:05 ../
-rw-rw-r--  1  999 tape    0 Aug 26 17:05 .check-empty-wal-archive
-rw-rw-r--  1  999 tape    3 Aug 26 17:05 PG_VERSION
drwxrwsr-x  7  999 tape    7 Aug 26 17:06 base/
-rw-rw----  1  999 tape   36 Sep 18 08:57 current_logfiles
-rw-rw-r--  1  999 tape 1.1K Sep 18 08:52 custom.conf
drwxrwsr-x  2  999 tape   65 Sep 12 14:16 global/
drwxrwsr-x  2  999 tape    2 Aug 26 17:05 pg_commit_ts/
drwxrwsr-x  2  999 tape    2 Aug 26 17:05 pg_dynshmem/
-rw-rw-r--  1  999 tape  346 Aug 26 17:06 pg_hba.conf
-rw-rw-r--  1  999 tape   24 Aug 26 17:06 pg_ident.conf
drwxrwsr-x  4  999 tape    5 Sep 12 20:35 pg_logical/
drwxrwsr-x  4  999 tape    4 Aug 26 17:05 pg_multixact/
drwxrwsr-x  2  999 tape    2 Aug 26 17:05 pg_notify/
drwxrwsr-x  2  999 tape    2 Aug 26 17:06 pg_replslot/
drwxrwsr-x  2  999 tape    2 Aug 26 17:05 pg_serial/
drwxrwsr-x  2  999 tape    2 Aug 26 17:05 pg_snapshots/
drwxrwsr-x  2  999 tape    3 Sep 12 20:35 pg_stat/
drwxrwsr-x  2  999 tape    2 Aug 26 17:05 pg_stat_tmp/
drwxrwsr-x  2  999 tape    3 Aug 26 17:05 pg_subtrans/
drwxrwsr-x  2  999 tape    2 Aug 26 17:05 pg_tblspc/
drwxrwsr-x  2  999 tape    2 Aug 26 17:05 pg_twophase/
lrwxrwxrwx  1 root tape   30 Sep 18 08:30 pg_wal -> /var/lib/postgresql/wal/pg_wal
drwxrwsr-x  2  999 tape    3 Aug 26 17:05 pg_xact/
-rw-rw-r--  1  999 tape  708 Sep 18 08:57 postgresql.auto.conf
-rw-rw-r--  1  999 tape  29K Sep 18 08:53 postgresql.conf
-rw-rw-r--  1  999 tape   75 Sep 12 14:16 postmaster.opts

Content of the pgdata PVC:

vaultwarden-cnpg-main-1-wal/pg_wal:
total 868K
drwxrwsr-x 3   26 tape  40 Sep 12 20:35 ./
drwxrwsr-x 3 root tape   3 Aug 26 17:05 ../
-rw-rw-r-- 1   26 tape 338 Aug 26 17:06 000000010000000000000003.00000060.backup
-rw-rw-r-- 1   26 tape 16M Sep 12 14:21 00000001000000050000005C
-rw-rw-r-- 1   26 tape 16M Sep 12 14:28 00000001000000050000005D
-rw-rw-r-- 1   26 tape 16M Sep 12 14:33 00000001000000050000005E
-rw-rw-r-- 1   26 tape 16M Sep 12 14:38 00000001000000050000005F
-rw-rw-r-- 1   26 tape 16M Sep 12 14:48 000000010000000500000060
-rw-rw-r-- 1   26 tape 16M Sep 12 14:53 000000010000000500000061
-rw-rw-r-- 1   26 tape 16M Sep 12 14:58 000000010000000500000062
-rw-rw-r-- 1   26 tape 16M Sep 12 15:03 000000010000000500000063
-rw-rw-r-- 1   26 tape 16M Sep 12 15:08 000000010000000500000064
-rw-rw-r-- 1   26 tape 16M Sep 12 15:14 000000010000000500000065
-rw-rw-r-- 1   26 tape 16M Sep 12 15:19 000000010000000500000066
-rw-rw-r-- 1   26 tape 16M Sep 12 15:24 000000010000000500000067
-rw-rw-r-- 1   26 tape 16M Sep 12 15:49 000000010000000500000068
-rw-rw-r-- 1   26 tape 16M Sep 12 15:54 000000010000000500000069
-rw-rw-r-- 1   26 tape 16M Sep 12 15:59 00000001000000050000006A
-rw-rw-r-- 1   26 tape 16M Sep 12 16:04 00000001000000050000006B
-rw-rw-r-- 1   26 tape 16M Sep 12 16:34 00000001000000050000006C
-rw-rw-r-- 1   26 tape 16M Sep 12 16:39 00000001000000050000006D
-rw-rw-r-- 1   26 tape 16M Sep 12 16:54 00000001000000050000006E
-rw-rw-r-- 1   26 tape 16M Sep 12 16:59 00000001000000050000006F
-rw-rw-r-- 1   26 tape 16M Sep 12 17:34 000000010000000500000070
-rw-rw-r-- 1   26 tape 16M Sep 12 17:42 000000010000000500000071
-rw-rw-r-- 1   26 tape 16M Sep 12 17:52 000000010000000500000072
-rw-rw-r-- 1   26 tape 16M Sep 12 17:57 000000010000000500000073
-rw-rw-r-- 1   26 tape 16M Sep 12 18:07 000000010000000500000074
-rw-rw-r-- 1   26 tape 16M Sep 12 18:17 000000010000000500000075
-rw-rw-r-- 1   26 tape 16M Sep 12 18:31 000000010000000500000076
-rw-rw-r-- 1   26 tape 16M Sep 12 18:36 000000010000000500000077
-rw-rw-r-- 1   26 tape 16M Sep 12 18:41 000000010000000500000078
-rw-rw-r-- 1   26 tape 16M Sep 12 18:46 000000010000000500000079
-rw-rw-r-- 1   26 tape 16M Sep 12 19:01 00000001000000050000007A
-rw-rw-r-- 1   26 tape 16M Sep 12 19:06 00000001000000050000007B
-rw-rw-r-- 1   26 tape 16M Sep 12 20:35 00000001000000050000007C
-rw-rw-r-- 1   26 tape 16M Sep 12 13:54 00000001000000050000007D
-rw-rw-r-- 1   26 tape 16M Sep 12 13:59 00000001000000050000007E
-rw-rw-r-- 1   26 tape 16M Sep 12 14:05 00000001000000050000007F
drwxrwsr-x 2   26 tape  35 Sep 12 20:35 archive_status/

zEdS15B3GCwq · 2023-10-11T05:30:20Z

zEdS15B3GCwq
Oct 11, 2023

I'd also like to share my attempt at restoring a dead cnpg DB, in a similar way to yours, with the main difference being that I ran a Postgres docker container right on the NAS and exported the DB using PGAdmin. In the end, I was not successful in restoring my old Nextcloud data - I'm not sure why, - but perhaps the method will be more useful to others.

What happened in my case was that after upgrading TrueNAS Scale, my Nextcloud chart failed to deploy. I didn't understand what was going on and attempted a few fixes with zero knowledge, one of which was to downgrade NC to a previous version - in my defense, I didn't know that'd damage the DB, and changing versions seemed to be the only way to get NC unstuck from deploying indefinitely.

Same as the OP, I had several PVCs present.
sudo k3s kubectl get pvc -n ix-nextcloud identified
nextcloud-cnpg-main1, nextcloud-cnpg-main1-wal, nextcloud-cnpg-main2, nextcloud-cnpg-main2-wal (plus config and html)
After finding the PVC identifiers with above command, and sudo zfs list | grep legacy | grep nextcloud, I mounted them to a temporary folder like this: sudo zfs set mountpoint=/temp/nc-cnpg-main1 systemdrive-pool/ix-applications/releases/nextcloud/volumes/pvc-0620cb7d-c787-4f2e-b0f5-3795189e2ed3.
I copied all data present in PVCs to a temporary location on the NAS:
I created a rescue dataset on my NAS under /mnt/systemdrive-pool/nc_rescue, but it could've been a simple folder anywhere.
sudo rsync -va --progress nc-cnpg-main1/pgdata /mnt/systemdrive-pool/nc_rescue/main1/
sudo rsync -va --progress nc-cnpg-main1-wal/pg_wal /mnt/systemdrive-pool/nc_rescue/main1/
and similarly for main2
This gave me 2 backup folders, main1 for cnpg-main1, containing pgdata and pg_wal, and the same in main2 for cnpg-main2.
At this point I reverted the mount points back to the original:
sudo zfs set mountpoint=legacy systemdrive-pool/ix-applications/releases/nextcloud/volumes/pvc-0620cb7d-c787-4f2e-b0f5-3795189e2ed3
and removed the temporary folders:
sudo rmdir /mnt/temp/nc-cnpg-main1 etc
So far I've been mostly following the OPs method, however, I decided to mount the DB with PGAdmin, as described in the migration guide. To that end, I installed the PGADmin chart and the tcdbinfo.sh script. The script worked for me, but I was also able to recover the DB password by mounting the nextcloud-config PVC and peeking inside config.php to get the dbuser and dbpassword entries.
Before mounting the DB into a Postgres container, I had to prep the data a bit. Same as in the OP, pgdata/custom.conf was replaced with a simpler one containing:

log_min_messages = info
ssl = off

I also had to chown all data to be owned by the Postgres user (uid 999), otherwise the container would fail to mount the WAL data (pemission denied): chown 999:999 for all.
As the OP, I mounted the DB into a Postgres Docker image, but I did it right on the NAS as otherwise it would've been much more complicated for me, and I exposed it on the host network for PGAdmin. I ran the container from the main1 folder with pgdata and pg_wal present:

sudo docker run -it --rm -d --name=pgrecover \
    -e POSTGRES_HOST_AUTH_METHOD=trust \
    -e PGDATA=/var/lib/postgresql/data \
    -v ${PWD}/pgdata:/var/lib/postgresql/data \
    -v ${PWD}/pg_wal:/var/lib/postgresql/wal/pg_wal \
    -v ${PWD}controller:/controller \
    --network host \
   postgres:15

sudo docker logs pgrecover

I was able to attach the DB on localhost:5432 with the user and password I got earlier from config.php or the script, and exported it to a .sql file. The database appeared to be valid as it contained several oc_ tables, one with my old users (in oc_users or something similar).

After this, I shut down the container, and tried to reinstall and restore Nextcloud following the migration guide. I got an error on the DB restore operation, but the guide suggests this may be safe to ignore. I did all the other steps (occ etc). However, my luck seems to have run out, as occ files:scan --all failed with an exception right on the first, the admin user, and when I tried to log in with any old NC account, it failed with some complicated error (didn't make a copy of the exact message).

Perhaps main2 would've been the right DB to restore. Perhaps these errors could've been fixed had I contacted TrueCharts support (again), but at this point I gave up. My Nextcloud instance hadn't been used that much yet so losing the DB was not a huge blow, and I decided not to waste more time on trying to save it instead of setting it up from scratch.

I still have the DB backup, so if someone tells me where I did wrong, I might be willing to give it another try just to see if this method could've been successful.

1 reply

Lockszmith-GH Oct 15, 2023
Author

I would recommend to reach out to NC community, as you seem to have gained access to the data, but now the issue if completely NC focused.

If you have ZFS snapshots of the data, you might be a bit more lucky - but that's also risky, as the data integrity of a file-system level snapshot, is not db-transaction-atomic, which means it's integrity might be unreliable.

I am happy some of my experience helped you out.
The main reason I did not spin the PostgresSQL container on the NAS, is because working with docker conatienrs on a kubernates cluster might mess up the kubernates control planes, and so "it is frowned upon" - but I do think your approach is solid as long as that container is very short lived.

Good luck.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovered from a botched cluster without a backup🤦‍♀️- Was this the right way of doing it? #2810

{{title}}

{{editor}}'s edit

{{editor}}'s edit

My background

How I got here

What I've got

Where I was struggling

What I tried and eventually succeeded with

(my old) Question (see top for a revised version of this one)

Technical Details / cli output

existing pods

cnpg status output

list of pvc

logs of the cnpg-main-1

PVC contents

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Recovered from a botched cluster without a backup🤦‍♀️- Was this the right way of doing it? #2810

Lockszmith-GH Sep 18, 2023

I still have some open qestions though...

My background

How I got here

What I've got

Where I was struggling

What I tried and eventually succeeded with

(my old) Question (see top for a revised version of this one)

Technical Details / cli output

existing pods

cnpg status output

list of pvc

logs of the cnpg-main-1

PVC contents

Replies: 1 comment · 1 reply

zEdS15B3GCwq Oct 11, 2023

Lockszmith-GH Oct 15, 2023 Author

Lockszmith-GH
Sep 18, 2023

Replies: 1 comment 1 reply

zEdS15B3GCwq
Oct 11, 2023

Lockszmith-GH Oct 15, 2023
Author