New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicacy 'check' command can be very slow #397

Open
jeffaco opened this Issue Apr 10, 2018 · 25 comments

Comments

Projects
None yet
5 participants
@jeffaco
Copy link
Contributor

jeffaco commented Apr 10, 2018

The duplicacy check command is very slow with specific storage backends. In particular, Google Drive is quite slow with large backup sets. See the two following logs.

First Azure:

$ time duplicacy check -storage azure -all
Storage set to azure://XXXXX/YYYYY
Listing all chunks
All chunks referenced by snapshot taltos-direct at revision 1 exist
All chunks referenced by snapshot taltos-direct at revision 2 exist
All chunks referenced by snapshot taltos-direct at revision 3 exist
All chunks referenced by snapshot taltos-direct at revision 4 exist
All chunks referenced by snapshot taltos at revision 1 exist
All chunks referenced by snapshot taltos at revision 2 exist
All chunks referenced by snapshot taltos at revision 3 exist
All chunks referenced by snapshot taltos at revision 4 exist

real    1m24.446s
user    0m17.736s
sys 0m0.821s
$

Now Google Drive:

$ time duplicacy check -storage gcd -all
Storage set to gcd://XXXXX/YYYYY
Listing all chunks
All chunks referenced by snapshot taltos at revision 1 exist
All chunks referenced by snapshot taltos at revision 2 exist
All chunks referenced by snapshot taltos at revision 3 exist
All chunks referenced by snapshot taltos at revision 4 exist

real    31m22.371s
user    0m6.230s
sys 0m0.542s
$

Note that Azure took 1 minute 24 seconds for the operation. Meanwhile, GCD took a whopping 31 1/2 minutes.

Investigation showed that the difference was in the storage backends. Azure implements an API that allows Duplicacy to get all the chunk metadata in a very small number of API calls. Meanwhile, GCD chunks are generally broken into 256 directories, and each of those directories need directory operations performed on them. Network/backend delays ultimate cause a dramatic speed difference.

The duplicacy check command should support the -threads parameter to allow multiple threads to collect the chunk data with considerably less delay. Yes, GCD has a limit to the number of concurrent operations. But some smallish number (like 8-16 threads), should be under the limits.

@TheBestPessimist

This comment has been minimized.

Copy link
Contributor

TheBestPessimist commented Apr 11, 2018

Related: #208, #230 and #231.

Basically everything should be multithreaded.

Also: i am using Drive File Stream (so basically a "local drive") instead of the google api, and it works good enough for me. (tested with 100 threads, multiple backups and prunes in parallel, etc.) If you want, you can try that!

@jeffaco

This comment has been minimized.

Copy link
Contributor

jeffaco commented Apr 11, 2018

It sounds like Google Drive is just a really crummy API, resulting in really crummy backup operations, relative to other storage backends. But other things (like copy), should be multi-threaded regardless.

Google Drive has the benefit of very low cost with high storage limits, relative to other storage backends. So it's hard to ignore it.

If I understand what you are doing @TheBestPessimist - You install the Google Drive app (so it syncs all your files), you back up to that, and then you expect the desktop app to sync your files? That's pretty distasteful as it forces a local copy of the data when I don't otherwise want/need that. Or are you doing something different?

@TheBestPessimist

This comment has been minimized.

Copy link
Contributor

TheBestPessimist commented Apr 11, 2018

You install the Google Drive app

No no no no no :D

I am using Google Drive File Stream, not Google Backup and Sync

Here's a comparison: https://support.google.com/drive/answer/7638428?hl=en.

Basically Drive file stream acts like a Drive (C:, D:, E:, z:, etc.) on your system:

  • you can ask it to download any file you want for offline use (like normal gdrive)
  • if a file you access is not already downloaded (synced) it magically fetches your files on demand when you have internet connection, and you can use it (basically think of it like a really slow hard drive -- coz that's how i see it)
  • you can use it offline (you see ALL your files, but only those synced offline are available)
  • if you are offline, and copy something new, it stores all your data until you get back online, then syncs/uploads
  • it also caches some files which it thinks you might need based on your usage.
@jeffaco

This comment has been minimized.

Copy link
Contributor

jeffaco commented Apr 11, 2018

@TheBestPessimist Interesting, I didn't know that even existed. I'll need to see if my .edu domain allows that (hopefully they do).

But I'm not quite square on how Google Drive File Stream works. Say, for example, it's installed, and then you do a duplicacy check command. It will take the file stream a real amount of time to read the data and store it locally, and meanwhile, duplicacy has no idea that the actual files aren't in place yet. Thus, it seems like duplicacy would error out because files it expects to be there haven't been downloaded yet.

It seems like it would be great for a backup/copy operation, as the chunks that were backed up would still be on disk for the copy operation (assuming you had enough "cache" in the file stream, I imagine). But once you start doing restore operations and whatnot, I don't fully understand how duplicacy would be able to cope with that.

Care to elaborate? 😄

@TheBestPessimist

This comment has been minimized.

Copy link
Contributor

TheBestPessimist commented Apr 11, 2018

It just works™.

GDFS (google drive file stream) has a local cache (6GB for me -- 3TB of data on it), where it stores info about all the files and metadata. Duplicacy uses this metadata, as it checks only that the chunks exist.

Furthermore: GDFS creates a drive in my computer with all the files you have in there (even though they arent!) => for any application (including duplicacy) it appears that everything is in place. Then, google does its magic and downloads (streams as they call it) everything as needed.

I would say: just test it!

@TheBestPessimist

This comment has been minimized.

Copy link
Contributor

TheBestPessimist commented Apr 11, 2018

so basically
GDFS creates a local drive with all your files => for any program it looks like the file is there => access the file => the file is downloaded => the program uses it.

@tophee

This comment has been minimized.

Copy link

tophee commented Apr 11, 2018

I am using Google Drive File Stream, not Google Backup and Sync

Interesting. I didn't know google had that too, as pcloud is making such a fuss about the exact same feature of theirs.

The difference is apparently:
Google:

It just works™.

pCloud:
It just doesn't work reliably

Well, and to be fair:

Google: only with business accounts (right?)
pcloud: even with free accounts

@TheBestPessimist

This comment has been minimized.

Copy link
Contributor

TheBestPessimist commented Apr 11, 2018

Well the technical details sound like this: they created a virtual filesystem: the kernel space is handling file interaction and is making requests to user space app, which downloads/uploads all the files as need and handles the cache.

That's at least how i understand it 😵

@jeffaco

This comment has been minimized.

Copy link
Contributor

jeffaco commented Apr 12, 2018

I didn't have time last night to look at this, and I still plan on doing so. However, another concern has come up: designing a backup system dependent on something like Google Drive File Stream can make it very difficult to deal with new versions of Mac OS/X (or Windows for major updates, etc). When a new O/S comes out, sometimes software doesn't run properly under it, and you need to wait for software vendors to update software to work with newer releases.

In the case of Google, this is "risky" at best. At times they don't update very quickly, leaving users "stranded" for some period of time (often many months or longer). Having kernel drivers in an O/S is a huge risk in this regard, ultimately to work around a duplicacy issue in that threading isn't universally supported in all commands.

Regardless of using Google Drive File Stream (GDFS) to help work around the slow performance of duplicacy check, I think duplicacy should still support the -threads option. Besides, other parameters to duplicacy check (like -files) would render the GDFS useless to help with the performance issue. Yet, duplicacy check -threads x would still be a huge benefit.

@TheBestPessimist

This comment has been minimized.

Copy link
Contributor

TheBestPessimist commented Apr 12, 2018

Well, it's up to gilbert to finish the 3.0-web-gui-with-all-the-bells-and-whistles, and afterwards we'll start getting the backend enhancements.

He is afterall only one person developing this software!

@jeffaco

This comment has been minimized.

Copy link
Contributor

jeffaco commented May 28, 2018

Hi @TheBestPessimist I have installed Google File Stream, and it's downloading files. I'm running it on Mac OS/X, and I don't see a lot of options. There is an icon for it, and I can select:

  • About
  • Help
  • Send Feedback
  • Pause Syncing
  • Switch Account
  • Sign Out
  • Quit

How do you do things like select the cache size (you mentioned you're using 6GB for 3TB of data)? I can see where the files are, I see how to start/stop it, etc. But other than that, I have minimal control. Is it that Google File Stream is less configurable on the Mac, or is it the same on Windows and it just sort of picks what it thinks are good sizes?

Next, once GDFS is installed, how do I make duplicacy use it? Do I need to change to back up to that location instead, or do I do something different to duplicacy so it'll work? If I need to change duplicacy in some way, specific duplicacy commands to execute would be greatly appreciated.

I have logs of backups, so I know exactly how long backups take and I know exactly how long prune and check operations take. It'll be interesting to see how this improves things.

I am still concerned about GDFS working going forward. But if I could quickly change duplicacy to use it or "go straight using the API", then I'm much less concerned.

By the way, it would be GREAT if I didn't have to re-seed backups to use GDFS, as I already have a good amount of data there, AND I sync between two cloud sources, which would complicate things if I needed to re-seed.

Thanks in advance for your help!

@TheBestPessimist

This comment has been minimized.

Copy link
Contributor

TheBestPessimist commented May 28, 2018

Okay then,

Let's learn by example:

gcd web preferences:

[
    {
        "name": "default",
        "id": "tbp-pc",
        "storage": "gcd://backups/duplicacy",
        "encrypted": true,
        "no_backup": false,
        "no_restore": false,
        "no_save_password": false,
        "keys": null
    }
]

gdfs preferences:

[
    {
        "name": "default",
        "id": "tbp-pc",
        "storage": "G:/My Drive/backups/duplicacy",
        "encrypted": true,
        "no_backup": false,
        "no_restore": false,
        "no_save_password": false,
        "keys": null
    }
]

Note that the only difference is the path: on windows gdfs is mounted by default as a drive with letter G (from Google).
As i said: it appears just like a normal drive from the point of view of applications.


How do you do things like select the cache size (you mentioned you're using 6GB for 3TB of data)? I can see where the files are, I see how to start/stop it, etc. But other than that, I have minimal control. Is it that Google File Stream is less configurable on the Mac, or is it the same on Windows and it just sort of picks what it thinks are good sizes?

I was giving the cache size as an example: it doesn't fillup your space unless you tell it to keep stuff offline. By default everything is "online only" (which means that it is downloaded (streamed) on a per-need basis, and it is kept in cache. gdfs clears old stuff from cache w/o you doing anything -- all automated). So yes, gdfs picks whatever it seems good for your data.

How to tell gdfs what to store also in offline mode? In windows at least you go to the folder which you want offline, and right click it, and there's a submenu with 2 checks:

  • Available offline
  • Online only

Pick offline option (d'uh :D).


also this: https://support.google.com/a/answer/7644837

@jeffaco

This comment has been minimized.

Copy link
Contributor

jeffaco commented May 28, 2018

Hi @TheBestPessimist Sorry, I have a bunch more questions!

So GDFS has been installed for hours now, and it's still syncing. Currently at 22GB. I didn't set anything to be offline, but it's still copying a LOT of data. I have about 4.83TB of data on Google Drive right now, although I think I can cut that about in half in a bit after some final testing restoring data from Duplicacy.

I'm not terribly concerned if GDFS uses a lot of space (like perhaps 100GB or so); I have the space to burn, so it doesn't matter much. As long as it does stop and decide that "enough is enough" at some point!

  1. How long, after you installed GDFS, did it finishing syncing and stabilize?
  2. Did you wait for GDFS to stabilize before you started using duplicacy against it? Or did you start using duplicacy against it when it was still doing the initial sync?
  3. What's the best way, with duplicacy, to update my preferences? Should I just edit the preferences file and do a backup? Or is there a duplicacy command that I should use to update the preferences file?
  4. When the preferences file is updated, is the updated file copied to storage? Or should I take caution to save the old/new information?

I guess that's it for now. You can tell where I am right now; let me know the next steps that you suggest. Thanks so much!

@TheBestPessimist

This comment has been minimized.

Copy link
Contributor

TheBestPessimist commented May 28, 2018

How long, after you installed GDFS, did it finishing syncing and stabilize?

No idea. i installed it a long time ago. From what i remember however, i think it didn't take more than 10 minutes to sync the metadata for the files and that was about it. Maybe your installation derped. Try reinstalling?

Did you wait for GDFS to stabilize before you started using duplicacy against it? Or did you start using duplicacy against it when it was still doing the initial sync?

Since it only took a little time, i assume that yes, i did wait for gdfs to "stabilise".

What's the best way, with duplicacy, to update my preferences? Should I just edit the preferences file and do a backup? Or is there a duplicacy command that I should use to update the preferences file?

Backup the old preferences file under the name preferences_web, then modify the path of your GCD storage from gcd:// to /volumes/[whatever gdfs is named on mac]. try to do a list and a backup and se if it works?

When the preferences file is updated, is the updated file copied to storage? Or should I take caution to save the old/new information?

On the storage side (google drive remote) you should not do anything. All changes are done locally. Duplicacy is built so that the storage is independent of the way it is accessed.


In the case something doesnt work for your backup, you should just restore the backup preferences file. (eventually save the new one in case you want to retry some other time).

@jeffaco

This comment has been minimized.

Copy link
Contributor

jeffaco commented May 28, 2018

Well, that was a little "exciting" ...

I tried changing the preferences file, did a backup (was up to revision 49) followed by prune. and then did a duplicacy check. Much to my horror, I started getting errors. Here's a snippet:

Chunk 3228c9690a0872b71fba3436208803fe99c2dfef5e9a8d6af195f7419df988b0 referenced by snapshot taltos at revision 1 does not exist
Chunk 025c63689348b5630d1505fb5ffacfdbb97c43dad50cb290ab92bd778dc86a57 referenced by snapshot taltos at revision 1 does not exist
Chunk 1a3c3f4ec8a693864711ef54356c1e5b5ee16af289b70119d9a759f95b10afbc referenced by snapshot taltos at revision 1 does not exist
Chunk f164d6d50eb75a1d3ba3a8e7d17becf50150594a9a66bb7be29866aca71f046a referenced by snapshot taltos at revision 1 does not exist
Chunk da115c15250526a9acfbf2d580e0dd0e36d12dea3f7d5d6e72ba9380990f24ec referenced by snapshot taltos at revision 1 does not exist
Chunk d3d55850ea6334c936c839a0ab8095cf95794361e1f354c2bb39a833179ad392 referenced by snapshot taltos at revision 1 does not exist
Chunk a96379db6d03422c372fab3c3b9c20f806eb46ef93e8efcf318e90ae2373ff03 referenced by snapshot taltos at revision 1 does not exist

Tons and tons of snapshots were missing.

I then noted that GDFS was in some short of a weird state. It had icon animation indicating that it was syncing but, in reality from the log, it wasn't doing a thing. That was strange.

Wondering if my backup set on Google was actually okay or not, I changed the preferences file back and did a 30+ minute duplicacy check on it (note the log from the issue I opened on this). This fortunately was fine. Phew.

After that was done, I noted that GDFS was unwedged. Feeling emboldened since duplicacy check passed against the raw data, I put the preferences file back and did another duplicacy check operation:

Office-iMac:Storage jeff$ time ~/Applications/duplicacy check -all -storage gcd
Storage set to /Volumes/GoogleDrive/My Drive/XXXXX
Listing all chunks
All chunks referenced by snapshot taltos at revision 1 exist
All chunks referenced by snapshot taltos at revision 11 exist
All chunks referenced by snapshot taltos at revision 19 exist
All chunks referenced by snapshot taltos at revision 27 exist
All chunks referenced by snapshot taltos at revision 28 exist
All chunks referenced by snapshot taltos at revision 29 exist
All chunks referenced by snapshot taltos at revision 30 exist
All chunks referenced by snapshot taltos at revision 31 exist
All chunks referenced by snapshot taltos at revision 32 exist
All chunks referenced by snapshot taltos at revision 33 exist
All chunks referenced by snapshot taltos at revision 34 exist
All chunks referenced by snapshot taltos at revision 35 exist
All chunks referenced by snapshot taltos at revision 37 exist
All chunks referenced by snapshot taltos at revision 38 exist
All chunks referenced by snapshot taltos at revision 40 exist
All chunks referenced by snapshot taltos at revision 41 exist
All chunks referenced by snapshot taltos at revision 42 exist
All chunks referenced by snapshot taltos at revision 43 exist
All chunks referenced by snapshot taltos at revision 44 exist
All chunks referenced by snapshot taltos at revision 45 exist
All chunks referenced by snapshot taltos at revision 46 exist
All chunks referenced by snapshot taltos at revision 47 exist
All chunks referenced by snapshot taltos at revision 48 exist
All chunks referenced by snapshot taltos at revision 49 exist

real    1m37.710s
user    0m29.332s
sys     0m7.736s
Office-iMac:Storage jeff$

A duplicacy check operation in 1 minute 37 seconds is awesome, and is phenomenally better than 30+ minutes. This brings the check operation for Google in line with Azure.

I very much like the fact that I can switch back by a simple edit to preferences, which goes a long way in relieving my concern if GDFS doesn't work properly on a future version of Mac OS/X.

I'll monitor over the next few days and see if it's okay over time. Thanks so much for the advice, @TheBestPessimist !

@TheBestPessimist

This comment has been minimized.

Copy link
Contributor

TheBestPessimist commented May 29, 2018

If it works for you @jeffaco (for me it does), how should i add this to the wiki @gilbertchen ?

Until threaded commands will be the norm, this would be really helpful for gdrive users.

@jeffaco

This comment has been minimized.

Copy link
Contributor

jeffaco commented May 29, 2018

I'm having more problems.

I'd say, so far, that Google Drive File System is nice when it works, but it's quite finicky (your mileage may vary). It was working great for me for all of 1/2 day, but then I deleted a bunch of content (1.4TB) from Google from an old Arq backup via the WWW, and that made duplicacy check start failing again with missing chunks. The source is fine (when using GCD), but GDFS reports failures.

I tried logging out and in via GDFS (causing a flush of the cache), that didn't help. At this point I'll just let it sync for a few days and then try again, but I've abandoned it for now. If, in a few days, it fails again, I'll just give up.

The backup/copy speeds are reasonable (they are multi-threaded). And prune performance is reasonable as well against Google Cloud Drive. It's just the duplicacy check command that's exceedingly slow. I'll probably just counter it by running it much less often (like once or twice/week rather than on every daily backup).

@TheBestPessimist

This comment has been minimized.

Copy link
Contributor

TheBestPessimist commented May 29, 2018

I'm sorry it doesn't work for you :-/. For me it really really changed everything. I can even schedule every minute backups, and all my data looks all right. (i backup 3 different machines to the same gcd storage).

@martinpengellyphillips

This comment has been minimized.

Copy link

martinpengellyphillips commented Jul 28, 2018

I'm also seeing very slow check command using sftp storage (remote latency 70ms).

It has been stuck on "Listing all chunks" for a few hours now.

Is there a way to get more info / speed this up?

@jeffaco

This comment has been minimized.

Copy link
Contributor

jeffaco commented Jul 31, 2018

I had diagnosed the problem (getting listings of buckets in Google Cloud Drive) by using the debug qualifier in duplicacy. With that output, it was very clear where the time was being spent, and why.

In the case of Google Cloud Drive: Most cloud storage providers (Azure, S3, etc) list all buckets via a single API call, or a few API calls. But Google Cloud Drive presents a "directory" style protocol, where you must request buckets per directory. As I recall, duplicacy creates a large number of directories to hash buckets into. So doing a directory list operation on each specific "directory" really adds up, particularly when it's not multi-threaded (i.e. issue API call, wait for a response, over and over again for each directory).

In your case, I'd start with debug output and see if that helps explain things.

@martinpengellyphillips

This comment has been minimized.

Copy link

martinpengellyphillips commented Jul 31, 2018

Thanks @jeffaco

You are right that the slowdown is caused by having to list many directories on the remote storage. This is exacerbated by the fact that the storage was created pre 2.0.10 and so has two levels of directories for a total of 65536 directories.

@jeffaco

This comment has been minimized.

Copy link
Contributor

jeffaco commented Jul 31, 2018

Hmm. I always back up to multiple storages (just in case something goes wrong). That gives me the ability to nuke a storage and copy again (although it will take quite a while).

What happened in pre-2.0.10 vs. after that? As I recall, with Google Cloud Drive, duplicacy created 255 directories at the top level and 255 directories under each directory (65535 directories). Is this different post 2.0.10? Would this positively affect duplicacy check performance?

Of course, the "right" solution is implementing duplicacy check --thread, but it not clear if that will be done by the author anytime soon (he's busy with a LOT of other things). Being open source, someone else can contribute that - PRs against duplicacy are accepted. But I haven't looked enough at duplicacy to understand just how much work that would be.

@martinpengellyphillips

This comment has been minimized.

Copy link

martinpengellyphillips commented Jul 31, 2018

I have nearline as well so could nuke and re-copy. It is about 3TB though so a lot of data to shift offsite on my connection.

The author mentioned to me that the directory structure was different post 2.0.10, not clear on exact details. Waiting to hear more.

And yes, -thread for the check command would be good.

@gboudreau

This comment has been minimized.

Copy link

gboudreau commented Jul 31, 2018

#259 is what was changed in 2.0.10, regarding nesting levels. From what I understand, that means all remote storages created with 2.0.10 and later will only have 256 folders, instead of 256x256.

@jeffaco

This comment has been minimized.

Copy link
Contributor

jeffaco commented Nov 6, 2018

So I noticed that the prune command now takes a -threads option.

Any plans to add this to the check command as well (the point of this issue)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment