Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for separating data parsing operations from backup #31

Open
hany opened this issue Jun 24, 2015 · 12 comments
Open

Allow for separating data parsing operations from backup #31

hany opened this issue Jun 24, 2015 · 12 comments

Comments

@hany
Copy link

hany commented Jun 24, 2015

This tool looks fantastic! Thanks @danielebailo for putting it together.

We have a very large CouchDB installation (~400GB in size). Are there any downsides to running this tool against a large data set like this?

@dalgibbard
Copy link
Collaborator

Simple answer, shouldn't be a problem.

I've used it on approx 30GB datasets without issue; there's no real limitations that I'm aware of.

Most filesystem's support this size of file without issue. I'd recommend doing the backup on the local machine rather than across the network if possible.

If it does happen to stop part way through backup then it may be worth us looking into possibilities around limiting the _all_docs API endpoint with start and end points to do the backup in batches? That way we could restart a segment, rather than needing to restart the entire batch export (ie. Splitting the backup into pieces for transfer). Not sure if that's even possible off the top of my head; but I digress :)

What's the total document count, and delete documents count? Do you use attachments in your DB?

Please do let us know how you get on (and how long it takes out of interest!)

@dalgibbard
Copy link
Collaborator

PS. Be aware that the raw export file requires processing after extract- this means if your exported DB backup file is 200GB on disk, you'll need 400GB in total for processing overheads. It's probably going to take a while too!

@dalgibbard
Copy link
Collaborator

PPS. Export is CPU and disk IO hungry; try and run it on an unused/unloaded node for best results.

@hany
Copy link
Author

hany commented Jun 24, 2015

Thanks a lot for your replies. I'm going to investigate using this on said database and report back.

I think being able to chunk up the backup would be very helpful, especially considering the high CPU and IO impact. Chunking it up will also allow us to introduce a short sleep between each chunk, which can lessen the impact. I do plan on running this locally however, so the risks of backups being interrupted will be minimized.

Out of curiosity, is it possible to add compression somewhere? Being such a large database, using gzip would help greatly. Of course I can always gzip it after the fact, but it will add significant time to the backup, vs. being able to compress during the backup.

Any thoughts on this?

@dalgibbard
Copy link
Collaborator

Raised #32 and #33 to consider the possibilities on these two points :)

@danielebailo
Copy link
Owner

@hany I'm happy you find this tool useful. But we must acknowledge also @dalgibbard who's actually working a lot on this tool!

Issues #32 and #33 are great: the backup size is really an issue when dealing with huge DBs, so compression and CPU / IO load should be handled somehow.

@dalgibbard thinking in perspective, in the future we might create services over this script: e.g. a GUI (maybe a couch app?) or any other software using it (a RESTful WebService?), so it could be convenient not to code everything into one script only, but to create single scripts/segments doing unique jobs that can be piped together.
E.g.

  1. script to download docs (predefined chunk size, user can set custom chunk size)
  2. script for compression (works on single chunks)
  3. script to parallalize (launches N scripts1(to download), manages latencies, delays, CPU usage, IO in an adaptive way)

and then launching Script3 which pipes script1 | script2 N times...

..ok just a brainstorming about possible evolution of the script...

Any thoughts?

@hany
Copy link
Author

hany commented Jun 25, 2015

@danielebailo I noticed the credit to @dalgibbard in the script, so thank you as well!

Unfortunately, it appears that running this backup tool took way too long against our large data set, and the increased CPU and IO caused some performance issues with the server. The slowest part appears to be the sed operations that occur after the initial dump. The multiple passes against the large .json files seems to be adding a lot of overhead.

Is it necessary to run all those additional operations during the backup cycle after the initial dump? Since you have the source .json files, could you not save those operations for the restoration process instead? Backups need to occur on a regular basis, but restorations are less frequent. Granted saving those operations for restoration will certainly increase the time it takes to restore, which may be during a critical period. For large data sets however, having timely backups may be more important than restoration times (at least the data was backed up).

Just my $0.02.

@dalgibbard
Copy link
Collaborator

The sed statments drastically reduce the final output file on disk, as well as making it actually importable (during restore it makes sense to not mangle the input, in case someone is trying to input non-genuine backup data; else it may have an undesirable effect for example) - note that the number of threads used during the sed stage is configurable.

With regards to modularising the script; absolutely yes. It's a bit monolithic in the way that it's grown; but it does it's job :-)

I'd have concerns about how reasonable it would be to continue using bash if we get to the stage of rewriting it though; alternatives have much nicer means to edit/compress/sort the data on the fly etc, and probably with better code cleanliness too.

Being honest though- I don't see myself perusing those options much. The current code does the job within the know limitations; and I'd much rather that the CouchDB devs provide/manage a backup functionality internally... not that I see that happening though.

@hany
Copy link
Author

hany commented Jun 26, 2015

@dalgibbard for sure, and I don't disagree with you. My comments regarding the sed operations was not to say that it shouldn't be run, but rather be split off so it can be run at another time. Even with threading, the sed operations take about 10x longer than the actual document dump, not to mention driving up load considerably due to the extra CPU cycles.

With such a big, busy DB, our options are quite limited, our focus has been on just getting a timely backup done. I love the way this script works, but the heavy operations is making it unusable.

Some ideas:

bash couchdb-backup.sh -b -H 127.0.0.1 -d my-db -f dumpedDB.json -u admin -p password

Performs a base backup (essentially just running the curl command).

bash couchdb-backup.sh -p -H 127.0.0.1 -d my-db -f dumpedDB.json -u admin -p password

-p is for "prepare", where it runs stages 1, 2, 3, and 4.

bash couchdb-backup.sh -r -H 127.0.0.1 -d my-db -f dumpedDB.json -u admin -p password

Standard restore, however it requires the "prep" stage to run first.

At this point, we've resorted to using plain old tar with gzip compression. We've had to lower the compression level in order to allow the backup to finish in a reasonable time.

Just food for thought.

@dalgibbard
Copy link
Collaborator

It's definitely do-able, but what I'd suggest we do is:
-b does the backup and parsing as usual
-b -S does the raw backup and skips parsing
-P does the standalone parsing only.

The main issue with the compression stuff is that we push it to disk from curl before running any other jobs on it. I wonder what speed would be like if we did just pipe that out to the sed's followed by compression on the end? One assumes that the main bottleneck is the output from CouchDB - this may keep up at the usual dump rate without the need for additional processing (and disk IO) afterwards. Hmm.

@dalgibbard
Copy link
Collaborator

Note; not -P as that's already used to pass non-standard port numbers :) but you get the jist

@dalgibbard dalgibbard changed the title Does this work with large data sets? Allow for separating data parsing operations from backup Jul 8, 2015
@dalgibbard
Copy link
Collaborator

Amended the title to more accurately represent the issue now at hand.

Task:

  • Add flags around the data parsing during import to enable it to be managed as a standalone operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants