Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a -filters option to the backup command #314

Open
Nelvin opened this issue Dec 26, 2017 · 10 comments

Comments

Projects
None yet
5 participants
@Nelvin
Copy link

commented Dec 26, 2017

Optionally enable the use of different filter sets defined in separate files to enabel more flexible backup strategies.

Big Outlook pst files, f.i. create relatively huge amounts of chunk data and network traffic when using cloud storage. In my case it's often about 100-200MB even when I just did a simple mail check and got a few kb of actual mail data (I understand it's just the nature of binary data).

The other extreme are basic .txt files, they're so small, they could be backupped multiple times an hour without any issue.

Another situation is a huge repository where some files might only change very rarely but others constantly (a huge archive of photos vs the articles you're currently writing etc.). One option is to have multiple repositories, but another one would be to have a single repository and use different filters and different scheduled intervals, like f.i.

filters_fullbackup (used once a week)
filters_daily (daily)
filters_active_projects (once an hour)

@williamgs

This comment has been minimized.

Copy link

commented Jan 9, 2018

I like the idea of per-backup filters in addition to the repository's own filter settings.

Obviously the 'restore' command would benefit from the same ability, (it sort of works this way already, but with explicit filters instead of named presets).

@droolio

This comment has been minimized.

Copy link

commented Aug 17, 2018

I'd like to propose an enhancement to this idea. Actually, I'm not sure if it's the opposite functionality, but it seems to have major overlap...

When I first discovered Duplicacy, coming from CrashPlan, one of the features on my want list was a "watch file system in real-time" option. Now, this clearly won't work (directly at least) with the CLI version and, as it turned out, Duplicacy is pretty damn fast at iterating the folder structure on my system anyway. However, my concern is that the current methodology may not scale very well the more folders you have. This concern also stems from the fact that CrashPlan requires a regular "verify selection" step to cover any missed files while it was being 'watched', which took ages.

Duplicacy is already very nippy but what if you could feed it a filter list or a simple list of folders to 'update' only?

It wouldn't exclude files and folders outside this filter list - it would still include the metadata for all those files, unchanged from the previous successful backup, in the next snapshot - but it would only check the specified filtered paths for changes.

So instead of iterating over the folder structure and detecting changes through date modification etc., it would look at the specified 'include filter' only and check file modification times. This alternative scan method could also be combined with the -hash option.

This wouldn't be entirely useful in the CLI version, but I guess you could combine it with a separate program, service or tray icon tool, that would watch for changes in the repository and compile a filter list for next invocation of a backup job. Or the user could simply pick out their own filter where they knows files will be modified regularly. They could run this filtered backup perhaps hourly and regular unfiltered backup once a day, to catch anything that was modified that day, but not within the filter.

The GUI version could definitely benefit from a "watch folders" option. It already has a tray icon. It can monitor changes within each repository and you could run backup jobs much more frequently - say, every 15 minutes, as you could with CrashPlan - without heavily impacting the performance of other tasks (especially if you could throttle the CPU as well). It could potentially complete backups on Windows user profiles for example, within seconds, rather than 2-3 minutes (depending quantity of data, obviously).

@Nelvin

This comment has been minimized.

Copy link
Author

commented Aug 17, 2018

Great idea too but even if it feels kind of similar, I think it's a very different feature.

Monitoring the file system f.i. seems to be in contrary to the general stateless way duplicacy works, which I think is a great way to write reliable software. The more state you have to track and keep in sync with whatever data you track, the more chances there are for something to go wrong (resulting in even more complex code as you've seen in CrashPlan with the verify selection).

IMO the folder scanning in Duplicacy is so fast, it really doesn't matter even for quite a lot of files (and with optional filters you'd limit the scanning to your area of interest anyway) but what I think for many users is a much more limiting factor is the upload speed which is one of the primary reasons I'd like to see the optional filters.

Scanning the folders only needs a few seconds, but if you have lots of files the metadata for a snapshot can become relatively big (for me it's probably going to be more than 50MB) and especially for backups to a cloudservice it would be better to be able to keep these as low as possible.

Of course there are as many needs as there are users, so filesystem monitoring may be a much better option in your situation.

@TheBestPessimist

This comment has been minimized.

Copy link
Contributor

commented Aug 17, 2018

-1 for @droolio's feature as well: having live monitoring the way you presented it doesn't seem to the philosophy regarding "unmodifiable snapshots" which duplicacy follows as it implies that somehow only some folders may be included in a snapshot.

It may also be that the code for having such functionality available can be pretty damn complex for the (in my opinion) little gains it provides.

This is at least how i see it.

@droolio

This comment has been minimized.

Copy link

commented Aug 17, 2018

"unmodifiable snapshots" which duplicacy follows as it implies that somehow only some folders may be included in a snapshot.

Not sure what you mean by this?

The idea is that ALL files and folders existing in the previous snapshot, carry forth into the next snapshot - as would happen anyway if they were unmodified - and only files that were seen to be changed (because the repository is being monitored by the GUI service/tray) will be checked to see if the file modification times, size etc. (or hash, if using -hash) had actually changed on those files.

Once a day it could run an unfiltered (normal) backup so it would iterate through all the folders. No additional complex code to verify selection is necessary, as it's a normal backup. This is just to catch any missed modifications, as the API on Windows for detecting file changes apparently isn't 100% accurate. (Apart from CrashPlan, SyncTrayzor/Synthing does it and numerous other software.)

It doesn't change the nature of snapshot contents; it's a performance enhancement.

@TheBestPessimist

This comment has been minimized.

Copy link
Contributor

commented Aug 17, 2018

and only files that were seen to be changed will be checked to see if the file modification times, size etc. (or hash, if using -hash) had actually changed on those files.

So basically just run the normal backup? :D

(you seem to want duplicacy to be notified by windows if anything changed, then duplicacy itself check if windows lied or not, and then backup everything. that just sounds weird.)

@droolio

This comment has been minimized.

Copy link

commented Aug 17, 2018

For clarity, I should probably break my idea out into its own issue; it's clearly not the same idea as -filter (but could use a filter-like generated file to update only a given set of files or paths).

But to reiterate - I'm talking about a performance enhancement here, and it's not a weird idea. :) Look up watched folders in Syncthing/SyncTrayzor. CrashPlan has a "Watch file system in real-time" option for the same reason.

Under the hood (CLI), you'd need a way to pass a filtered list of files/folders to rescan, instead of iterating through the file tree. This is primarily a benefit for the GUI version, because it can capture the vast majority of modifications and can be run on a much more regular basis.

@TheBestPessimist

This comment has been minimized.

Copy link
Contributor

commented Aug 18, 2018

Then go for a new issue coz right now i'm still lost and i don't see the usability (for a backup program).

@Nelvin

This comment has been minimized.

Copy link
Author

commented Aug 28, 2018

Given the new -repository option (as discussed in https://forum.duplicacy.com/t/repository-init-and-add-options/1195/30) IMO this feature request is not required anymore as it's possible to do what I had in mind and in a better/more transparent way now.

Maybe there's someone else still in need of this or @gilbertchen already started the implementation? But if not, @gilbertchen please close the issue.

@Nelvin

This comment has been minimized.

Copy link
Author

commented Aug 29, 2018

I stand corrected, partly by my own reply over at the previously linked thread :)
It seems it would still be great to have a -filters option and/or an option to add it to an entry in the preferences file as it seems to be the last small missing feature to make configuration of backupsets extremely flexible ... so ... @gilbertchen, keep going :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.