-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(postgres): install wal-e for continuous backups #24
Conversation
4854311
to
ba05ddc
Compare
80c5cf1
to
2a398de
Compare
dd2cff9
to
c2613cf
Compare
I'm going to remove the tests while I figure out minio failures in another issue (incoming), but this is ready for review. |
removing tests for now, to be implemented in the future in #41 so we can get this out the door. |
a375516
to
ac13ac0
Compare
pinging for reviews |
abd3c69
to
c2dcb91
Compare
Manual testing is fine - I meant to use that script to validate the PR. I think we need more testing on the backup/recovery scenarios before this is merged. Just ensuring the database has recovered in a simple case doesn't fully exercise the numerous disaster scenarios we've seen in production. |
etcd's no longer a dependency which means half of the network flakiness issues are gone. It's just postgres and minio so we got that going for us, which is nice. :) I agree that a better backup/recovery scenario is required for testing. What kind of scenario would you suggest us test? Keep in mind that the beta target is just "Testing: single node failure, recovery" so I just requested the bare minimum here to pass acceptance criteria. |
Right, but I don't think we've fully tested recovery. Testing single-node backup/recovery under load is what I'm interested in:
|
Sure. I'll write up a sql script to dump a bunch of stuff into a database, kill it off and see what happens :) |
4a6c581
to
4a8d6f9
Compare
manually tested and working with no data loss (woop!)
There is 101 lines because of the Some changes that were necessary to get this working:
Ready for another round of reviews |
4a8d6f9
to
8ce7ae9
Compare
53a0e83
to
156c5bf
Compare
So the problem @mboersma was seeing with the database being restarted was because the I've gone ahead and changed the health check behaviour to only check |
After these changes I was still able to confirm that only < 0.1% of data loss had occurred. I CTRL+C'd a job after a total of 34,551 applications and 35 users were created with this script:
This was the final log on the controller side before CTRL+C'ing the job:
After killing the database and restarting it, I then jumped into the controller:
And showing from the client:
Which looks like client number 15 lost 6 applications; a total of 0.02% data loss. |
Keep in mind that this is also an extreme example. Realistically we've only seen clusters with at most 500 containers running, which is only one or two MBs at maximum. This particular pg_dump file is 16MB in size, and the database was killed while undergoing some heavy processing. This likely won't happen in the wild, but as a worst case scenario I'd say this test went well above and beyond the expected. |
I am getting the following error -
|
that looks like the database is running. Are you running the database with deis/charts#117? |
@jchauncey I wasn't able to reproduce your issue of not being able to see the
|
Ok I'll try testing it on my aws cluster. |
Alright it seems to work just fine on AWS so its definitely something with running this on kube-solo. |
feat(postgres): install wal-e for continuous backups
To test, build the image, modify the deis-database-rc manifest in the deis-dev chart then run it:
After that, ensure the new database pod comes back up and has the
foo
database:closes #25
closes #31
closes #32