Allow for separating data parsing operations from backup #31

hany · 2015-06-24T16:21:11Z

This tool looks fantastic! Thanks @danielebailo for putting it together.

We have a very large CouchDB installation (~400GB in size). Are there any downsides to running this tool against a large data set like this?

dalgibbard · 2015-06-24T17:16:25Z

Simple answer, shouldn't be a problem.

I've used it on approx 30GB datasets without issue; there's no real limitations that I'm aware of.

Most filesystem's support this size of file without issue. I'd recommend doing the backup on the local machine rather than across the network if possible.

If it does happen to stop part way through backup then it may be worth us looking into possibilities around limiting the _all_docs API endpoint with start and end points to do the backup in batches? That way we could restart a segment, rather than needing to restart the entire batch export (ie. Splitting the backup into pieces for transfer). Not sure if that's even possible off the top of my head; but I digress :)

What's the total document count, and delete documents count? Do you use attachments in your DB?

Please do let us know how you get on (and how long it takes out of interest!)

dalgibbard · 2015-06-24T17:18:58Z

PS. Be aware that the raw export file requires processing after extract- this means if your exported DB backup file is 200GB on disk, you'll need 400GB in total for processing overheads. It's probably going to take a while too!

dalgibbard · 2015-06-24T17:29:02Z

PPS. Export is CPU and disk IO hungry; try and run it on an unused/unloaded node for best results.

hany · 2015-06-24T17:42:41Z

Thanks a lot for your replies. I'm going to investigate using this on said database and report back.

I think being able to chunk up the backup would be very helpful, especially considering the high CPU and IO impact. Chunking it up will also allow us to introduce a short sleep between each chunk, which can lessen the impact. I do plan on running this locally however, so the risks of backups being interrupted will be minimized.

Out of curiosity, is it possible to add compression somewhere? Being such a large database, using gzip would help greatly. Of course I can always gzip it after the fact, but it will add significant time to the backup, vs. being able to compress during the backup.

Any thoughts on this?

dalgibbard · 2015-06-24T18:37:28Z

Raised #32 and #33 to consider the possibilities on these two points :)

danielebailo · 2015-06-25T10:02:07Z

@hany I'm happy you find this tool useful. But we must acknowledge also @dalgibbard who's actually working a lot on this tool!

Issues #32 and #33 are great: the backup size is really an issue when dealing with huge DBs, so compression and CPU / IO load should be handled somehow.

@dalgibbard thinking in perspective, in the future we might create services over this script: e.g. a GUI (maybe a couch app?) or any other software using it (a RESTful WebService?), so it could be convenient not to code everything into one script only, but to create single scripts/segments doing unique jobs that can be piped together.
E.g.

script to download docs (predefined chunk size, user can set custom chunk size)
script for compression (works on single chunks)
script to parallalize (launches N scripts1(to download), manages latencies, delays, CPU usage, IO in an adaptive way)

and then launching Script3 which pipes script1 | script2 N times...

..ok just a brainstorming about possible evolution of the script...

Any thoughts?

hany · 2015-06-25T16:04:53Z

@danielebailo I noticed the credit to @dalgibbard in the script, so thank you as well!

Unfortunately, it appears that running this backup tool took way too long against our large data set, and the increased CPU and IO caused some performance issues with the server. The slowest part appears to be the sed operations that occur after the initial dump. The multiple passes against the large .json files seems to be adding a lot of overhead.

Is it necessary to run all those additional operations during the backup cycle after the initial dump? Since you have the source .json files, could you not save those operations for the restoration process instead? Backups need to occur on a regular basis, but restorations are less frequent. Granted saving those operations for restoration will certainly increase the time it takes to restore, which may be during a critical period. For large data sets however, having timely backups may be more important than restoration times (at least the data was backed up).

Just my $0.02.

dalgibbard · 2015-06-25T22:36:10Z

The sed statments drastically reduce the final output file on disk, as well as making it actually importable (during restore it makes sense to not mangle the input, in case someone is trying to input non-genuine backup data; else it may have an undesirable effect for example) - note that the number of threads used during the sed stage is configurable.

With regards to modularising the script; absolutely yes. It's a bit monolithic in the way that it's grown; but it does it's job :-)

I'd have concerns about how reasonable it would be to continue using bash if we get to the stage of rewriting it though; alternatives have much nicer means to edit/compress/sort the data on the fly etc, and probably with better code cleanliness too.

Being honest though- I don't see myself perusing those options much. The current code does the job within the know limitations; and I'd much rather that the CouchDB devs provide/manage a backup functionality internally... not that I see that happening though.

hany · 2015-06-26T00:19:13Z

@dalgibbard for sure, and I don't disagree with you. My comments regarding the sed operations was not to say that it shouldn't be run, but rather be split off so it can be run at another time. Even with threading, the sed operations take about 10x longer than the actual document dump, not to mention driving up load considerably due to the extra CPU cycles.

With such a big, busy DB, our options are quite limited, our focus has been on just getting a timely backup done. I love the way this script works, but the heavy operations is making it unusable.

Some ideas:

bash couchdb-backup.sh -b -H 127.0.0.1 -d my-db -f dumpedDB.json -u admin -p password

Performs a base backup (essentially just running the curl command).

bash couchdb-backup.sh -p -H 127.0.0.1 -d my-db -f dumpedDB.json -u admin -p password

-p is for "prepare", where it runs stages 1, 2, 3, and 4.

bash couchdb-backup.sh -r -H 127.0.0.1 -d my-db -f dumpedDB.json -u admin -p password

Standard restore, however it requires the "prep" stage to run first.

At this point, we've resorted to using plain old tar with gzip compression. We've had to lower the compression level in order to allow the backup to finish in a reasonable time.

Just food for thought.

dalgibbard · 2015-06-26T05:34:23Z

It's definitely do-able, but what I'd suggest we do is:
-b does the backup and parsing as usual
-b -S does the raw backup and skips parsing
-P does the standalone parsing only.

The main issue with the compression stuff is that we push it to disk from curl before running any other jobs on it. I wonder what speed would be like if we did just pipe that out to the sed's followed by compression on the end? One assumes that the main bottleneck is the output from CouchDB - this may keep up at the usual dump rate without the need for additional processing (and disk IO) afterwards. Hmm.

dalgibbard · 2015-06-26T05:35:49Z

Note; not -P as that's already used to pass non-standard port numbers :) but you get the jist

dalgibbard · 2015-07-08T20:18:55Z

Amended the title to more accurately represent the issue now at hand.

Task:

Add flags around the data parsing during import to enable it to be managed as a standalone operation.

dalgibbard changed the title ~~Does this work with large data sets?~~ Allow for separating data parsing operations from backup Jul 8, 2015

dalgibbard added the enhancement label Jul 8, 2015

peteruithoven mentioned this issue Aug 2, 2017

Stuck at 'Stage 1 - Document filtering' #68

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for separating data parsing operations from backup #31

Allow for separating data parsing operations from backup #31

hany commented Jun 24, 2015

dalgibbard commented Jun 24, 2015

dalgibbard commented Jun 24, 2015

dalgibbard commented Jun 24, 2015

hany commented Jun 24, 2015

dalgibbard commented Jun 24, 2015

danielebailo commented Jun 25, 2015

hany commented Jun 25, 2015

dalgibbard commented Jun 25, 2015

hany commented Jun 26, 2015

dalgibbard commented Jun 26, 2015

dalgibbard commented Jun 26, 2015

dalgibbard commented Jul 8, 2015

Allow for separating data parsing operations from backup #31

Allow for separating data parsing operations from backup #31

Comments

hany commented Jun 24, 2015

dalgibbard commented Jun 24, 2015

dalgibbard commented Jun 24, 2015

dalgibbard commented Jun 24, 2015

hany commented Jun 24, 2015

dalgibbard commented Jun 24, 2015

danielebailo commented Jun 25, 2015

hany commented Jun 25, 2015

dalgibbard commented Jun 25, 2015

hany commented Jun 26, 2015

dalgibbard commented Jun 26, 2015

dalgibbard commented Jun 26, 2015

dalgibbard commented Jul 8, 2015