-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for separating data parsing operations from backup #31
Comments
Simple answer, shouldn't be a problem. I've used it on approx 30GB datasets without issue; there's no real limitations that I'm aware of. Most filesystem's support this size of file without issue. I'd recommend doing the backup on the local machine rather than across the network if possible. If it does happen to stop part way through backup then it may be worth us looking into possibilities around limiting the _all_docs API endpoint with start and end points to do the backup in batches? That way we could restart a segment, rather than needing to restart the entire batch export (ie. Splitting the backup into pieces for transfer). Not sure if that's even possible off the top of my head; but I digress :) What's the total document count, and delete documents count? Do you use attachments in your DB? Please do let us know how you get on (and how long it takes out of interest!) |
PS. Be aware that the raw export file requires processing after extract- this means if your exported DB backup file is 200GB on disk, you'll need 400GB in total for processing overheads. It's probably going to take a while too! |
PPS. Export is CPU and disk IO hungry; try and run it on an unused/unloaded node for best results. |
Thanks a lot for your replies. I'm going to investigate using this on said database and report back. I think being able to chunk up the backup would be very helpful, especially considering the high CPU and IO impact. Chunking it up will also allow us to introduce a short sleep between each chunk, which can lessen the impact. I do plan on running this locally however, so the risks of backups being interrupted will be minimized. Out of curiosity, is it possible to add compression somewhere? Being such a large database, using gzip would help greatly. Of course I can always gzip it after the fact, but it will add significant time to the backup, vs. being able to compress during the backup. Any thoughts on this? |
@hany I'm happy you find this tool useful. But we must acknowledge also @dalgibbard who's actually working a lot on this tool! Issues #32 and #33 are great: the backup size is really an issue when dealing with huge DBs, so compression and CPU / IO load should be handled somehow. @dalgibbard thinking in perspective, in the future we might create services over this script: e.g. a GUI (maybe a couch app?) or any other software using it (a RESTful WebService?), so it could be convenient not to code everything into one script only, but to create single scripts/segments doing unique jobs that can be piped together.
and then launching Script3 which pipes script1 | script2 N times... ..ok just a brainstorming about possible evolution of the script... Any thoughts? |
@danielebailo I noticed the credit to @dalgibbard in the script, so thank you as well! Unfortunately, it appears that running this backup tool took way too long against our large data set, and the increased CPU and IO caused some performance issues with the server. The slowest part appears to be the Is it necessary to run all those additional operations during the backup cycle after the initial dump? Since you have the source Just my $0.02. |
The sed statments drastically reduce the final output file on disk, as well as making it actually importable (during restore it makes sense to not mangle the input, in case someone is trying to input non-genuine backup data; else it may have an undesirable effect for example) - note that the number of threads used during the sed stage is configurable. With regards to modularising the script; absolutely yes. It's a bit monolithic in the way that it's grown; but it does it's job :-) I'd have concerns about how reasonable it would be to continue using bash if we get to the stage of rewriting it though; alternatives have much nicer means to edit/compress/sort the data on the fly etc, and probably with better code cleanliness too. Being honest though- I don't see myself perusing those options much. The current code does the job within the know limitations; and I'd much rather that the CouchDB devs provide/manage a backup functionality internally... not that I see that happening though. |
@dalgibbard for sure, and I don't disagree with you. My comments regarding the With such a big, busy DB, our options are quite limited, our focus has been on just getting a timely backup done. I love the way this script works, but the heavy operations is making it unusable. Some ideas:
Performs a base backup (essentially just running the
Standard restore, however it requires the " At this point, we've resorted to using plain old Just food for thought. |
It's definitely do-able, but what I'd suggest we do is: The main issue with the compression stuff is that we push it to disk from curl before running any other jobs on it. I wonder what speed would be like if we did just pipe that out to the sed's followed by compression on the end? One assumes that the main bottleneck is the output from CouchDB - this may keep up at the usual dump rate without the need for additional processing (and disk IO) afterwards. Hmm. |
Note; not |
Amended the title to more accurately represent the issue now at hand. Task:
|
This tool looks fantastic! Thanks @danielebailo for putting it together.
We have a very large CouchDB installation (~400GB in size). Are there any downsides to running this tool against a large data set like this?
The text was updated successfully, but these errors were encountered: